哈希表与哈希算法：原理、实现与优化

1. 哈希表基础原理

哈希表（Hash Table）是一种基于键值对（key-value）存储数据的数据结构，它通过计算键的哈希值来确定数据在内存中的存储位置，从而实现O(1)平均时间复杂度的插入、查找和删除操作。

1.1 哈希表的核心组件

哈希函数：将任意大小的输入数据映射到固定大小的输出（哈希值）
哈希桶（数组）：存储数据的实际容器
冲突解决机制：处理不同键映射到相同位置的情况

1.2 哈希表的基本原理

哈希表通过以下步骤存储和检索数据：

通过哈希函数计算键的哈希值
将哈希值映射到哈希桶的索引位置
在该位置存储或检索数据
如果发生冲突，使用冲突解决策略处理

2. 哈希函数设计

哈希函数是哈希表性能的关键因素，一个好的哈希函数应该具备以下特性：

高效计算：计算速度快
均匀分布：不同输入产生的哈希值均匀分布在整个输出范围内
确定性：相同的输入总是产生相同的输出
雪崩效应：输入的微小变化会导致输出的显著变化

2.1 常见的哈希函数

2.1.1 整数哈希

对于整数类型的键，常用的哈希函数包括：

// 简单的取模哈希函数
size_t hash_mod(int key, size_t table_size) {return key % table_size;
}// 乘法哈希函数（Knuth推荐）
size_t hash_multiply(int key, size_t table_size) {const double A = 0.6180339887; // (sqrt(5) - 1) / 2double val = key * A;val = val - floor(val);return floor(val * table_size);
}

2.1.2 字符串哈希

对于字符串类型的键，几种经典哈希算法：

// BKDR哈希算法
size_t BKDRHash(const char* str, size_t table_size) {unsigned int seed = 131; // 31, 131, 1313, 13131... 都是不错的选择unsigned int hash = 0;while (*str) {hash = hash * seed + (*str++);}return hash % table_size;
}// FNV哈希算法
size_t FNVHash(const char* str, size_t table_size) {const unsigned int FNV_PRIME = 16777619;const unsigned int OFFSET_BASIS = 2166136261;unsigned int hash = OFFSET_BASIS;while (*str) {hash ^= *str++;hash *= FNV_PRIME;}return hash % table_size;
}// Jenkins哈希算法（one-at-a-time）
size_t JenkinsHash(const char* str, size_t table_size) {unsigned int hash = 0;while (*str) {hash += *str++;hash += (hash << 10);hash ^= (hash >> 6);}hash += (hash << 3);hash ^= (hash >> 11);hash += (hash << 15);return hash % table_size;
}

2.2 选择哈希表大小

哈希表的大小也是影响性能的重要因素。通常，选择质数作为哈希表大小可以减少冲突：

// 检查一个数是否为质数
bool isPrime(int n) {if (n <= 1) return false;if (n <= 3) return true;if (n % 2 == 0 || n % 3 == 0) return false;for (int i = 5; i * i <= n; i += 6) {if (n % i == 0 || n % (i + 2) == 0) return false;}return true;
}// 找到大于n的下一个质数
int nextPrime(int n) {if (n <= 1) return 2;int prime = n;bool found = false;while (!found) {prime++;if (isPrime(prime)) found = true;}return prime;
}

3. 冲突解决策略

哈希冲突是指不同的键通过哈希函数计算后得到相同的哈希值。解决冲突的主要策略有两种：开放寻址和链式寻址。

3.1 链式寻址（Chaining）

链式寻址使用链表存储映射到同一位置的所有元素：

每个哈希桶维护一个链表
发生冲突时，将新元素添加到链表中
查找时，遍历链表找到目标元素

优点：

实现简单
对负载因子不敏感
插入操作不受表大小限制

缺点：

需要额外的内存存储链表
可能导致缓存不友好
搜索时间与链表长度相关

3.2 开放寻址（Open Addressing）

开放寻址在冲突发生时，通过某种探测序列查找下一个可用位置：

3.2.1 线性探测（Linear Probing）

冲突发生时，按顺序检查下一个位置：

size_t linear_probe(size_t hash, size_t i, size_t table_size) {return (hash + i) % table_size;
}

优点：

缓存友好
内存利用率高

缺点：

容易造成聚集（clustering）
负载因子增大时性能下降明显

3.2.2 二次探测（Quadratic Probing）

使用二次方程确定探测序列：

size_t quadratic_probe(size_t hash, size_t i, size_t table_size) {return (hash + i*i) % table_size;
}

优点：

减少了聚集问题
比线性探测性能更稳定

缺点：

可能无法探测表中所有位置

3.2.3 双重哈希（Double Hashing）

使用第二个哈希函数确定步长：

size_t double_hash(size_t hash1, size_t hash2, size_t i, size_t table_size) {return (hash1 + i * hash2) % table_size;
}

优点：

消除二次聚集
提供更均匀的探测序列

缺点：

计算开销略大

4. 哈希表的C++实现

以下是使用链式寻址实现的哈希表：

#include <iostream>
#include <vector>
#include <list>
#include <string>
#include <functional>template<typename K, typename V>
class HashMap {
private:// 哈希表的桶，每个桶是一个链表std::vector<std::list<std::pair<K, V>>> buckets;size_t size_;size_t capacity_;double load_factor_threshold_;// 哈希函数size_t hash(const K& key) const {return std::hash<K>{}(key) % buckets.size();}// 重新哈希void rehash() {size_t new_capacity = nextPrime(capacity_ * 2);std::vector<std::list<std::pair<K, V>>> new_buckets(new_capacity);for (const auto& bucket : buckets) {for (const auto& pair : bucket) {size_t bucket_idx = std::hash<K>{}(pair.first) % new_capacity;new_buckets[bucket_idx].push_back(pair);}}buckets = std::move(new_buckets);capacity_ = new_capacity;}// 查找大于n的