For an updated version of these notes, see http://www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf. For practical implementation of hash tables in C, see C/HashTables.

These are theoretical notes on hashing based largely on MotwaniRaghavan §§8.4–8.5 (which is in turn based on work of Carter and Wegman on universal hashing and Fredman, Komlós, and Szemerédi on O(1) worse-case hashing).

# 1. Hashing: basics

- Want to store s elements from a universe M of size m in a table with indices drawn from an index space N of size n.
- Typically we assume M = [m] = {0...m-1} and N = [n] = {0..n-1}.
- If m ≤ n, just use an array.
Otherwise use a

**hash function**h:M→N.When m > n, we always get

**collisions**= pairs (x,y) with h(x) = h(y).- Collisions are handled using some secondary data structure in each bin.
Linked lists are typical (

**open addressing**).- FKS (see below) uses hash tables.

- Choosing a random hash function from some good family of hash functions can minimize collisions, because the adversary can't tell which elements to include in the table.
Goal: Do each search in O(1+s/n) expected time, which for s ≤ n will be much better than the Θ(log n) time for balanced trees or SkipLists.

This only works if we are working in a RAM (random-access machine model), where we can access arbitrary memory locations in time O(1) and similarly compute arithmetic operations on O(log M)-bit values in time O(1). There is an argument that in reality any actual RAM machine requires either Ω(log N) time to read one of N memory locations (routing costs) or, if one is particularly pedantic, Ω(N^{1/3}) time (speed of light + finite volume for each location). We will ignore this argument.

# 2. Universal hash families

Family of hash functions H is

**2-universal**if for any x≠y, Pr[h(x)=h(y)] ≤ 1/n for random h∈H.**Strongly 2-universal**if for any x_{1}≠x_{2}∈M, y_{1},y_{2}∈N, Pr[h(x_{1})=y_{1}∧ h(x_{2}) = y_{2}] ≤ 1/n^{2}for random h∈H.**k-universal**usually means**strongly k-universal**: Given distinct x_{1}...x_{k}, and any y_{1}...y_{k}Pr[h(x_{i})=y_{i}) ∀i] ≤ 1/n^{k}. It is possible to generalize the weak version of 2-universality to get a weak version of k-universality (Pr[h(x_{i}) are all equal] ≤ 1/n^{k-1}), but this generalization is not as useful as strong k-universality.

- Let δ(x,y,h) = 1 if x≠y and h(x)=h(y), 0 otherwise.
Abuse of notation: δ(X,Y,H) = ∑

_{x∈X,y∈Y,h∈H}δ(x,y,h), with e.g. δ(x,Y,h) = δ({x},Y,{h}).

- H is 2-universal = δ(x,y,H) ≤ |H|/n.
- If H is all functions M→N, we get equality; but we might do better since h might tend to map distinct values to distinct places.
- Lower bound: For any family H, ∃x,y s.t. δ(x,y,H) ≥ |H|/n - |H|/m
- Proof: Count collisions in inverse images of each elements z.
Have δ(h

^{-1}(z), h^{-1}(z), h) = |h^{-1}(z)|⋅(|h^{-1}(z)|-1).Sum over all z to get δ(M,M,h) ≥ ∑

_{z}|h^{-1}(z)|⋅(|h^{-1}(z)|-1).Use convexity or Lagrange multipliers to argue RHS is minimized subject to ∑

_{z}|h^{-1}(z)| = m when all pre-images are the same size m/n.This gives δ(M,M,h) ≥ n(m/n)(m/n-1) = m

^{2}(1/n - 1/m).Sum over all h to get δ(M,M,H) ≥ |H| m

^{2}(1/n - 1/m).PigeonholePrinciple says some pair x,y has δ(x,y,H) ≥ δ(M,M,H)/(m(m-1)) > |H| m

^{2}(1/n - 1/m). (Note a bit of slop there.)Since 1/m is usually << 1/n, we are happy if we get the 2-universal upper bound of |H|/n.

- Proof: Count collisions in inverse images of each elements z.
Claim: h

_{ab}(x) = (ax + b mod p) mod n, where a∈ℤ_{p}-{0}, b∈ℤ_{p}and p is a prime ≥ m, is 2-universal.- Proof: Again, count collisions.
Split h

_{ab}(x) as g(f_{ab}(x)) where f_{ab}(x) = ax+b mod p and g(x) = x mod n.Claim: If x≠y, then δ(x,y,H) = δ(ℤ

_{p},ℤ_{p},g).- Proof:
Let r≠s ∈ Z

_{p}.Then ax+b = r and ay+b = s has a unique solution mod p (because ℤ

_{p}is a FiniteField).⇒ ∃ f

_{ab}is a 1-1 map between a,b and r,s⇒ δ(x,y,H) = ∑

_{r≠s}δ(r,s,g) = δ(ℤ_{p},ℤ_{p},g).

- Proof:
For fixed r, ⌈p/n⌉ is upper bound on number of elements in g

^{-1}(g(r)), subtract one to get number of s that collide with r.So total collisions δ(ℤ

_{p},ℤ_{p},g) ≤ p(⌈p/n⌉-1) ≤ p(p-1)/n = |H|/n.

- Proof: Again, count collisions.

- 2-universal hash family ⇒ open addressing costs O(1+s/n) expected time per operation.
- Proof: E[cost of operation on key x] = O(1 + E[size of linked list at h(x)]) = O(1 + E[# of y∈S that collide with x]) = O(1 + s Pr[y≠x collides with x]) = O(1+s/n).

# 3. FKS hashing with O(s) space and O(1) worst-case search time

Goal is to hash a static set S so that we never pay more than constant time for search (not just in expectation).

A

**perfect hash function**for a set S is a hash function h:M→N that is injective on S (x≠y ⇒ h(x)≠h(y) for x,y∈S).Claim: If H is 2-universal and s

^{2}≤ n, then there is a perfect h∈H for S.- Proof: Usual collision-counting argument.
- For all x≠y, have δ(x,y,H) ≤ |H|/n.
- So δ(S,S,H) ≤ s(s-1) |H|/n.
PigeonholePrinciple says ∃h∈H such that δ(S,S,h) ≤ s(s-1)/n < 1.

δ(S,S,h) < 1 ⇒ δ(S,S,h) = 0 (since δ(S,S,h) is always an integer).

Practical variant: If s

^{2}≤ αn, then Pr[h is perfect for S] > 1/α.

- Proof: Usual collision-counting argument.
This immediately gives O(1) search time with O(s

^{2}) space.- Next step: Hash to n = s bins, then rehash perfectly within each bin.
- Time cost is O(1) for outer hash + O(1) for inner (perfect) hash, giving O(1) total.
Space cost is ∑

_{i}O(s_{i}^{2}), where s_{i}= | { s∈S | h(s) = i } | = | h^{-1}(i) |.Observe that ∑

_{i}s_{i}^{2}= ∑_{i}(s_{i}+ s_{i}(s_{i}-1)) = s + δ(S,S,h).- Since H is 2-universal, have δ(S,S,H) ≤ |H| s(s-1)/n.
PigeonholePrinciple says ∃h such that δ(S,S,h) ≤ (1/|H|) δ(S,S,h) ≤ s(s-1)/n = s-1 < s.

- So space cost is really O(s) + O(s) = O(s).

- Same practical consideration as above applies, by making s = 2n (say) get a constant probability of choosing a good h, keep trying until we get one.

# 4. Cuckoo hashing

Goal: Get O(1) search time in a dynamic hash table at the cost of a messy insertion procedure. In fact, each search takes only two reads, which can be done in parallel; this is optimal by a lower bound of Pagh probe.pdf, which shows a matching upper bound for static dictionaries. Cuckoo hashing is an improved version of this result that allows for dynamic insertions.

We'll mostly be following the presentation in the original cuckoo hashing paper by Pagh and Rodler: cuckoo-jour.pdf.

## 4.1. Structure

We have two tables T_{1} and T_{2} of size r each, with separate, independent hash functions h_{1} and h_{2}. These functions are assumed to be k-universal for some sufficiently large value k; as long as we never look at more than k values at once, this means we can treat them effectively as random functions. (In practice, using crummy hash functions seems to work just fine, a common property of hash tables.)

Every key x is stored either in T_{1}[h_{1}(x)] or T_{2}[h_{2}(x)]. So the search procedure just looks at both of these locations and returns whichever one contains x (or fails if neither contains x).

To insert a value x_{1}, we put it in T_{1}[h_{1}(x_{1})_{] or T}2_{[h}2_{(x}1_{)}]. If one or both of these locations is empty, we put it there. Otherwise we have to kick out some value that is in the way (this is the cuckoo part of cuckoo hashing, named after the bird that leaves its eggs in other birds' nests). So we let x_{2} = T_{1}[h_{1}(x_{1})], and insert x_{1} in T_{1}[h_{1}(x_{1})]. We now have a new "nestless" value x_{2}, which we swap with whatever is in T_{2}[h_{2}(x_{2})]. If that location was empty, we are done; otherwise, we get a new value x_{3} that we have to put in T_{1} and so on. The procedure terminates when we find an empty spot or if enough iterations have passed that we don't expect to find an empty spot, in which case we rehash the entire table. The code from the Pagh-Rolder paper looks like this:

- procedure insert(x)
- if lookup(x) then return
- else do Maxloop time
x ↔ T

_{1}[h_{1}(x)]- if x = ⊥ then return
x ↔ T

_{2}[h_{2}(x)]- if x = ⊥ then return

- end loop
- rehash()
- insert(x)

- end

A detail not included in the above code is that we always rehash (in theory) after r^{2} insertions; this avoids potential problems with the hash functions used in the paper not being universal enough.

The main question is how long it takes the insertion procedure to terminate, assuming the table is not too full. Letting s = |S|, we will assume r/s ≥ 1+ε for some fixed ε.

First let's look at what happens during an insert if we have a lot of nestless values. We have a sequence of values x_{1}, x_{2}, where each pair of values x_{i}, x_{i+1} collides in h_{1} or h_{2}. Assuming we don't reach the MaxLoop limit, there are three main possibilities (the leaves of the tree below):

- Eventually we reach an empty position without seeing the same key twice.
Eventually we see the same key twice; there is some i and j>i such that x

_{j}=x_{i}. Since x_{i}was already moved once, when we reach it the second time we will try to move it back, displacing x_{i-1}. This process continues until we have restored x_{2}to T_{1}[h_{1}(x_{1})], displacing x_{2}to T_{2}[h_{2}(x_{1})] and possibly creating a new sequence of nestless values. Two outcomes are now possible:Some x

_{l}is moved to an empty location. We win!Some x

_{l}is moved to a location we've already looked at. We lose! In this case we are playing musical chairs with more players than chairs, and have to rehash.

Let's look at the probability that we get the last, *closed loop* case. Following Pagh-Rolder, we let v be the number of distinct nestless keys in the loop. We can now count how many different ways such a loop can form: There are v^{3} choices for i, j, and l, r^{v-1} choices of cells for the loop, and s^{v-1} choices for the non-x_{1} elements of the loop. We also have 2v edges each of which occurs with probability r^{-1}, giving a total probability of v^{3}r^{v-1}s^{v-1}r^{-2v} = v^{3}(s/r)^{v}/(sn). Summing this over all v gives 1/(rs) ∑ v^{3}(s/r)^{v} = O(1/(rs)) = O(1/r^{2}) (the series converges under the assumption that s/r < 1). Since the cost of hitting a closed loop is O(r), this adds O(1) to the insertion complexity.

Now we look at what happens if we don't get a closed loop. It's a little messy to analyze the behavior of keys that appear more than once in the sequence, so the trick used in the paper is to observe that for any sequences of nestless keys x_{1}...x_{p}, there is a subsequence of size p/3 with no repetitions that starts with x_{1}. Since there are only two subsequences that starts with x_{1} (we can't have the same key show up more than twice), this will either be x_{1}...x_{j-1} or x_{1}=x_{i+j-1}...x_{p}, and a case analysis shows that at least one of these will be big. We can then argue that the probability that we get a sequence of v distinct keys starting with x_{1} in either T_{1} or T_{2} is at most 2(s/r)^{v-1} (since we have to hit a nonempty spot, with probability ≤ s/r, at each step, but there are two possible starting locations), which gives an expected insertion time bounded by ∑ 3v (s/r)^{v-1} = O(1).

Using a slightly more detailed analysis (see the paper), it can be shown that the bound for non-constant ε is O(1+1/ε).

# 5. Bloom filters

See MitzenmacherUpfal §5.5.3 for basics and a principled analysis or Bloom filter for many variations and the collective wisdom of the unwashed masses.