C++无序关联容器：unordered_set与unordered_multiset详解

丁香医生

1. 无序关联容器概述

在C++标准库中，unordered_set和unordered_multiset是基于哈希表实现的无序关联容器。它们与有序容器set/multiset最大的区别在于：不维护元素的排序状态，而是通过哈希函数将元素映射到桶(bucket)中，从而提供平均O(1)时间复杂度的访问性能。

注意：虽然理论上是O(1)时间复杂度，但在实际应用中，哈希冲突会导致性能下降，因此合理设置哈希函数和桶数量非常重要。

1.1 核心特性对比

让我们先通过一个表格对比这两种容器的核心特性：

特性	unordered_set	unordered_multiset
元素唯一性	唯一	可重复
底层实现	哈希表	哈希表
插入时间复杂度	平均O(1)	平均O(1)
查找时间复杂度	平均O(1)	平均O(1)
删除时间复杂度	平均O(1)	平均O(1)
内存占用	较高	较高
迭代器稳定性	修改后可能失效	修改后可能失效

1.2 哈希表工作原理

unordered_set和unordered_multiset的底层实现都是哈希表，其工作原理可以概括为：

哈希函数：将元素值转换为size_t类型的哈希值
桶映射：通过取模运算将哈希值映射到特定桶
冲突处理：通常采用链地址法（每个桶维护一个链表）

cpp复制// 哈希表示例结构
template<typename Key>
struct HashNode {
    Key key;
    HashNode* next;
};

template<typename Key>
class HashTable {
    std::vector<HashNode<Key>*> buckets;
    // ...
};

这种结构使得在理想情况下（哈希冲突少），查找、插入和删除操作都能在常数时间内完成。

2. 容器定义与初始化

2.1 基本定义方式

unordered_set和unordered_multiset定义在<unordered_set>头文件中，使用时需要包含：

cpp复制#include <unordered_set>

定义容器时有多种初始化方式：

cpp复制// 默认构造函数
std::unordered_set<int> us1;  // 空集合
std::unordered_multiset<int> ums1;  // 空多重集合

// 列表初始化
std::unordered_set<int> us2 = {1, 3, 5, 7, 9};
std::unordered_multiset<int> ums2 = {1, 3, 3, 5, 7, 9};  // 允许重复

// 范围初始化
std::vector<int> vec = {2, 4, 6, 8};
std::unordered_set<int> us3(vec.begin(), vec.end());

2.2 高级初始化参数

除了基本的初始化方式，还可以指定哈希表参数：

cpp复制// 指定初始桶数
std::unordered_set<int> us4(16);  // 初始16个桶

// 自定义哈希函数
auto hasher = [](const std::string& s) {
    return std::hash<std::string>()(s) % 100;
};
std::unordered_set<std::string, decltype(hasher)> us5(10, hasher);

// 完整参数：桶数、哈希函数、键相等比较函数
std::unordered_set<std::string> us6(10, 
    std::hash<std::string>(), 
    std::equal_to<std::string>());

提示：选择合适的初始桶数可以减少rehash操作，提升性能。一般建议设置为预计元素数量的1.5倍左右。

3. 元素操作详解

3.1 插入元素

插入操作有多种形式，效率也有所不同：

cpp复制std::unordered_set<int> us;

// 1. insert单元素
auto ret1 = us.insert(5);  // 返回pair<iterator, bool>

// 2. insert多元素
us.insert({1, 3, 7, 9});  // 插入初始化列表

// 3. insert范围
std::vector<int> vec = {11, 13};
us.insert(vec.begin(), vec.end());

// 4. emplace构造并插入（最高效）
us.emplace(15);  // 直接在容器内构造元素

对于unordered_multiset，插入操作总是成功，因为允许重复元素：

cpp复制std::unordered_multiset<int> ums;
ums.insert(3);
ums.insert(3);  // 允许重复插入

3.2 删除元素

删除操作也有多种形式：

cpp复制std::unordered_set<int> us = {1, 3, 5, 7, 9};

// 1. 通过值删除
size_t count = us.erase(5);  // 返回删除的元素数量

// 2. 通过迭代器删除
auto it = us.find(3);
if (it != us.end()) {
    us.erase(it);  // 注意：删除后迭代器失效
}

// 3. 删除范围
us.erase(us.begin(), us.end());  // 清空容器

对于unordered_multiset，erase(key)会删除所有匹配的元素：

cpp复制std::unordered_multiset<int> ums = {1, 3, 3, 5};
size_t num_erased = ums.erase(3);  // 返回2，删除了两个3

3.3 查找元素

查找操作是哈希表的强项，平均时间复杂度为O(1)：

cpp复制std::unordered_set<int> us = {1, 3, 5, 7, 9};

// 1. find查找单个元素
auto it = us.find(5);
if (it != us.end()) {
    std::cout << "Found: " << *it << std::endl;
}

// 2. count统计元素出现次数
if (us.count(7) > 0) {
    std::cout << "7 exists" << std::endl;
}

// 3. equal_range（主要用于multiset）
std::unordered_multiset<int> ums = {1, 3, 3, 5};
auto range = ums.equal_range(3);
for (auto it = range.first; it != range.second; ++it) {
    std::cout << *it << " ";  // 输出所有3
}

4. 哈希表管理与性能优化

4.1 桶接口

哈希表的性能很大程度上取决于桶的数量和分布：

cpp复制std::unordered_set<int> us = {1, 3, 5, 7, 9};

// 获取桶数量
size_t bucket_count = us.bucket_count();

// 查看特定元素所在的桶
size_t bucket = us.bucket(5);

// 查看桶中的元素数量
size_t bucket_size = us.bucket_size(bucket);

// 遍历所有桶
for (size_t i = 0; i < us.bucket_count(); ++i) {
    std::cout << "Bucket " << i << " has " << us.bucket_size(i) << " elements\n";
}

4.2 负载因子控制

负载因子(load factor)是元素数量与桶数量的比值，直接影响哈希表性能：

cpp复制std::unordered_set<int> us;

// 获取当前负载因子
float current_load = us.load_factor();

// 获取/设置最大负载因子
float max_load = us.max_load_factor();
us.max_load_factor(0.75f);  // 设置新的最大负载因子

// 当负载因子超过最大值时，容器会自动rehash

4.3 手动调整哈希表

可以通过rehash和reserve手动优化哈希表：

cpp复制std::unordered_set<int> us;

// reserve：预留足够空间
us.reserve(100);  // 准备存储100个元素

// rehash：直接设置桶数量
us.rehash(128);  // 桶数至少为128

// 查看当前哈希策略
std::cout << "Bucket count: " << us.bucket_count()
          << ", Load factor: " << us.load_factor()
          << ", Max load factor: " << us.max_load_factor() << std::endl;

性能提示：在已知元素数量的情况下，预先调用reserve()可以避免插入时的多次rehash，显著提升性能。

5. 实际应用场景

5.1 快速查找案例

cpp复制// 大型数据集中快速查找
std::unordered_set<std::string> dictionary;

// 加载字典数据
load_dictionary(dictionary);  // 假设有这样一个函数

// 快速检查单词是否存在
std::string word;
while (std::cin >> word) {
    if (dictionary.find(word) != dictionary.end()) {
        std::cout << word << " is in the dictionary\n";
    } else {
        std::cout << word << " is misspelled\n";
    }
}

5.2 高效去重方案

cpp复制// 从向量中去除重复元素
std::vector<int> numbers = {1, 2, 3, 2, 1, 4, 5, 4, 6};

// 方法1：使用unordered_set
std::unordered_set<int> unique_set(numbers.begin(), numbers.end());
std::vector<int> unique_vec1(unique_set.begin(), unique_set.end());

// 方法2：保持原始顺序
std::unordered_set<int> seen;
std::vector<int> unique_vec2;
for (int num : numbers) {
    if (seen.insert(num).second) {  // 插入成功表示首次出现
        unique_vec2.push_back(num);
    }
}

5.3 元素频率统计

cpp复制// 统计单词频率
std::unordered_multiset<std::string> word_bag;

// 读取文本
std::string word;
while (std::cin >> word) {
    word_bag.insert(word);
}

// 输出频率
std::unordered_set<std::string> unique_words(word_bag.begin(), word_bag.end());
for (const auto& w : unique_words) {
    std::cout << w << ": " << word_bag.count(w) << " times\n";
}

6. 高级技巧与注意事项

6.1 自定义哈希函数

对于自定义类型，需要提供哈希函数：

cpp复制struct Person {
    std::string name;
    int age;
    
    bool operator==(const Person& other) const {
        return name == other.name && age == other.age;
    }
};

// 自定义哈希函数
struct PersonHasher {
    size_t operator()(const Person& p) const {
        return std::hash<std::string>()(p.name) ^ std::hash<int>()(p.age);
    }
};

std::unordered_set<Person, PersonHasher> people;
people.insert({"Alice", 30});
people.insert({"Bob", 25});

6.2 迭代器失效问题

哈希表的修改操作可能导致迭代器失效：

cpp复制std::unordered_set<int> us = {1, 3, 5, 7, 9};

// 安全的遍历方式
for (auto it = us.begin(); it != us.end(); ) {
    if (*it % 2 == 0) {
        it = us.erase(it);  // erase返回下一个有效迭代器
    } else {
        ++it;
    }
}

// 不安全的操作（可能导致未定义行为）
std::unordered_set<int> us2 = {1, 2, 3};
auto it1 = us2.find(2);
auto it2 = us2.find(3);
us2.erase(it1);
// 此时it2可能已经失效，不能再使用

6.3 性能优化实践

选择合适的初始大小：避免频繁rehash
优化哈希函数：减少冲突
考虑内存局部性：频繁访问的元素可以集中存储

cpp复制// 性能优化示例
class OptimizedHashSet {
private:
    std::unordered_set<std::string> set_;
    static const size_t INIT_BUCKETS = 1024;
    
public:
    OptimizedHashSet() {
        set_.reserve(INIT_BUCKETS);
        set_.max_load_factor(0.7f);
    }
    
    void insert(const std::string& s) {
        set_.insert(s);
        if (set_.size() > set_.bucket_count() * set_.max_load_factor()) {
            set_.rehash(set_.bucket_count() * 2);
        }
    }
};

7. 与有序容器的对比选择

7.1 性能对比

操作	unordered_set	set
插入	O(1)平均	O(log n)
查找	O(1)平均	O(log n)
删除	O(1)平均	O(log n)
遍历	O(n)	O(n)
有序遍历	不支持	自动排序

7.2 选择建议

使用unordered_set/unordered_multiset的情况：
- 需要极快的查找速度
- 不关心元素顺序
- 元素类型有良好的哈希函数
- 内存不是主要限制因素
使用set/multiset的情况：
- 需要有序遍历
- 元素类型没有合适的哈希函数
- 内存有限
- 需要稳定的迭代器（不因插入删除而失效）

cpp复制// 选择示例：高频查找 vs 有序遍历
void process_data(const std::vector<int>& data) {
    if (need_fast_lookup) {
        std::unordered_set<int> lookup(data.begin(), data.end());
        // 快速查找操作...
    } else if (need_ordered_output) {
        std::set<int> ordered(data.begin(), data.end());
        // 有序遍历操作...
    }
}