5.6 Hash Tables with Worst-Case O(1) Access

5.6.1 Perfect Hashing

Suppose, for purposes of simplification, that all N items are known in advance.

If a separatechaining implementation could guarantee that each list had at most a constant number of items, we would be done. We know that as we make more lists, the lists will on average be shorter, so theoretically if we have enough lists, then with a reasonably high probability we might expect to have no collisions at all!

Suppose we choose the number of lists to be M (i.e., TableSize is M), which is sufficiently large to guarantee that with probability at least 1/2, there will be no collisions.

So we are left with determining how large $M$, the number of lists, needs to be. Unfortunately, $M$ needs to be quite large; specifically $M = (N^2)$. However, if $M = N^2$, we can show that the table is collision free with probability at least 1/2, and this result can be used to make a workable modification to our basic approach.

Theorem 5.2.

If N balls are placed into $M = N^2$ bins, the probability that no bin has more than one ball is less than 1/2

Proof.

If a pair (i, j) of balls are placed in the same bin, we call that a collision. Let $C_{i,j}$ be the expected number of collisions produced by any two balls (i, j). Clearly the probability that any two specified balls collide is 1/M, and thus $C_{i,j}$ is 1/M, since the number of collisions that involve the pair (i, j) is either 0 or 1. Thus the expected number of collisions in the entire table is $∑_{(𝑖,𝑗),𝑖<𝑗}𝐶_(𝑖,𝑗)$ . Since there are $N(N-1)/2$ pairs, this sum is $N(N-1)/(2M) = N(N-1)/(2N^2) < 1/2$. Since the expected number of collisions is below 1/2, the probability that there is even one collision must also be below 1/2.

Of course, using $N^2$ lists is impractical. However, the preceding analysis suggests the following alternative:

Use only N bins, but resolve the collisions in each bin by using hash tables instead of linked lists. The idea is that because the bins are expected to have only a few items each, the hash table that is used for each bin can be quadratic in the bin size.

Figure 5.24 shows the basic structure. Here, the primary hash table has ten bins. Bins 1, 3, 5, and 7 are all empty. Bins 0, 4, and 8 have one item, so they are resolved by a secondary hash table with one position. Bins 2 and 6 have two items, so they will be resolved into a secondary hash table with four ($2^2$) positions. And bin 9 has three items, so it is resolved into a secondary hash table with nine ($3^2$) positions. As with the original idea, each secondary hash table will be constructed using a different hash function until it is collision free. The primary hash table can also be constructed several times if the number of collisions that are produced is higher than required. This scheme is known as perfect hashing. All that remains to be shown is that the total size of the secondary hash tables is indeed expected to be linear.

Theorem 5.3.

If N items are placed into a primary hash table containing N bins, then the total size of the secondary hash tables has expected value at most 2N.

Proof.

Using the same logic as in the proof of Theorem 5.2, the expected number of pairwise collisions is at most N(N - 1)/2N, or (N - 1)/2. Let bi be the number of items that hash to position i in the primary hash table; observe that $b_i^2$ space is used for this cell in the secondary hash table, and that this accounts for $b_i (b_i -1)/2$ pairwise collisions, which we will call $c_i$. Thus the amount of space used for the ith secondary hash table is $2c_i + b_i$. The total space is then $2∑𝑐_𝑖+∑𝑏_𝑖$ . The total number of collisions is (N -1)/2 (from the first sentence of this proof); the total number of items is of course N, so we obtain a total secondary space requirement of 2(N -1)/2 + N < 2N

5.6.2 Cuckoo Hashing

Cuckoo hashing is a simple hash table where

Maintain two tables, each of which has m elements. We choose two hash functions h1 and h2. Every element x will either be at position h1(x) in the first table or h2(x) in the second. Lookups take time O(1) because only two locations must be checked. Deletions take time O(1) because only two locations must be checked.

To insert an element x, start by inserting it into table 1. If h₁(x) is empty, place x there. Otherwise, place x there, evict the old element y, and try placing y into table 2. Repeat this process, bouncing between tables, until all elements stabilize.

h1h2
A 0 2
B 0 0
C 1 4
D 1 0
E 3 2
F 3 4
G 1 2

Insertions run into trouble if we run into a cycle.

If that happens, perform a rehash by choosing a new h1 and h2 and inserting all elements back into the tables. Multiple rehashes might be necessary before this succeeds. Implementation may use one or two or K tables.

5.6.3 Universal Hashing

Definition 5.1.

A family H of hash functions is universal, if for any $x \neq y$, the number of hash functions $h$ in $H$ for which $h(x) = h(y)$ is at most $|H|/M$. (M for TableSize)

Notice that this definition holds for each pair of items, rather than being averaged over all pairs of items. The definition above means that if we choose a hash function randomly from a universal family H, then the probability of a collision between any two distinct items is at most $1/M$, and when adding into a table with $N$ items, the probability of a collision at the initial point is at most $N/M$, or the load factor.

Definition 5.2.

A family H of hash functions is k-universal, if for any $x_1 = y_1, x_2 = y_2, ... , x_k = y_k$, the number of hash functions $h$ in $H$ for which $h(x_1) = h(y_1), h(x_2) = h(y_2),...,$ and $h(x_k) = h(y_k)$ is at most $|H|/M^k$.

Theorem 5.4.

The hash family $H = {H_{a,b}(x) = ((ax + b)\ mod\ p)\ mod\ M, where 1 ≤ a ≤ p − 1,0 ≤ b ≤ p − 1}$ is universal. p be a prime larger than the largest input key. a and b are chosen randomly:

For example, in this family, three of the possible random choices of (a, b) yield three different hash functions: