(For a more up-to-date version of these notes, see http://www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf.)

In a **data stream computation** we are presented with a sequence of pairs (i_{t},c_{t}) where 1≤i_{t}≤n is an **index** and c_{t} and **count**, and we want to maintain a small data structure, known as a **sketch**, that will allows us to approximately answer statistical queries about the vector a given by a_{i} = ∑_{t, i[t]=i} c_{t}. The size of the sketch should be polylogarithmic in the size of a and the length of the stream and polynomial in the error bounds, and updating the sketch given a new data point should be cheap. The motivation is the existence of data sets that are too large to store at all (e.g., network traffic statistics), or too large to store in fast memory (e.g., very large database tables). By building a sketch we can make one pass through the data set but answer queries after the fact, with some loss of accuracy.

The **count-min** sketch of Cormode and Muthukrishnan (see also MitzenmacherUpfal §13.4) gives approximations of a_{i}, ∑_{i=l..r} a_{i}, and a⋅b with various error bounds, and can be used for more complex tasks like finding "heavy hitters"—indices with high weight. The easiest case is approximating a_{i} when all c_{t} are non-negative, so we'll concentrate on that.

# 1. Structure

A bit like a Bloom filter made out of counters.

Build an array c with width w = ⌈e/ε⌉ and depth d = ⌈ln(1/δ)⌉, where ε is the error bound and δ is the probability of exceeding the error bound. Choose d independent pairwise-independent hash functions. Initialize c to all zeroes.

# 2. Updates

Given an update (i_{t},c_{t}), increment c[j,h_{j}(i_{t})] by c_{t} for j=1..d.. (This is the *count* part of count-min.)

# 3. Queries

## 3.1. Point queries

Here we want to estimate a_{i} for some fixed i. There are two cases, depending on whether the increments are all non-negative, or arbitrary. In both cases we will get an estimate whose error is linear in both the error parameter ε and the L1-weight ‖a‖_{1} of a. It follows that the relative error will be low for heavy points, but we may get a large relative error for light points (and especially large for points that don't appear in the data set at all).

### 3.1.1. Non-negative case

To estimate a_{i}, compute â_{i} = min_{j} c[j,h_{j}(i)]. (This is the *min* part.) Then a_{i}≤â_{i}, and with probability at least 1-δ,â_{i} ≤ â_{i} + ε‖a‖_{1}.

Proof: The lower bound is easy. Since each pair (i,c_{t}) increments each c[j,h_{j}(i)] by c_{t}, we have an invariant that â_{i}≤a_{i} throughout the computation.

For the upper bound, let I_{ijk} be the indicator for the event that (i≠k) ∧ (h_{j}(i) = h_{j}(k)), i.e., that we get a collision between i and k using h_{j}. Pairwise independence of h_{j} gives E[I_{ijk}] = 1/w ≤ ε/e.

Now let X_{ij} = ∑_{k=1..n} I_{ijk}a_{k}. Then c[j,h_{j}(i)] = a_{i} + X_{ij}. (The fact that X_{ij}≥0 gives an alternate proof of the lower bound.) Now use linearity of expectation to get E[X_{ij}] = E[∑_{k} I_{ijk}a_{k}] = ∑_{k} a_{k}E[I_{ijk}] ≤ ∑_{k} a_{k} (ε/e) = (ε/e)‖a‖_{1}. So Pr[c[j,h_{j}(i)] > a_{i} + ε‖a‖_{1}] = Pr[X_{ij} > eEX_{ij}] < 1/e, by Markov's inequality. With d choices for j, and each hash function chosen independently the probability that every count is too big is at most (1/e)^{-d} = exp(-d) ≤ exp(-ln(1/δ)) = δ.

### 3.1.2. General case

If the increments might be negative, instead of using the min count we use the median count: â_{i} = median_{j} c[j,h_{j}(i)]. We again define the error term X_{ij} as above, and observe that E[|X_{ij}|] = E[|∑_{k} I_{ijk}a_{k}|] ≤ ∑_{k} |a_{k}E_{ijk}| ≤ ∑_{k} |a_{k}|(ε/e) = (ε/e)‖a‖_{1}. Using Markov's inequality, we get Pr[|X_{ij}| > 3ε‖a‖_{1}] = Pr[|X_{ij}| > 3eEX_{ij}] < 1/3e < 1/8. In order for the median to be off by more than 3ε‖a‖_{1}, we need d/2 of these low-probability events to occur. The expected number that occur is μ = d/8, so applying the Chernoff bound we are looking at Pr[S ≥ (1+3)μ] ≤ (e^{3}/4^{4})^{d/8} ≤ (e^{3/8}/2)^{ln (1/δ)} = δ^{ln 2 - 3/8} < δ^{1/4} (the actual exponent is about 0.31, but 1/4 is easier to deal with). It follows that

Pr[a_{i}-3ε‖a‖_{1} ≤ â_{i} ≤ a_{i}+3ε‖a‖_{1}] > 1-δ^{1/4}.

One way to think about this is that getting an estimate within ε‖a‖_{1} of the right value with probability at least 1-δ requires 3 times the width and 4 times the depth—or 12 times the space and 4 times the time—when we aren't assuming increments are non-negative.

## 3.2. Inner products

Here we want to estimate a⋅b, where a and b are both stored as count-min sketches. The paper concentrates on the case where a and b are both non-negative, which has applications in estimating the size of a join in a database. The method is to estimate a⋅b as min_{j} ∑_{k} c_{a}[j,k]⋅c_{b}[j,k].

For a single j, the sum consists of both good values and bad collisions; we have ∑_{k} c_{a}[j,k]⋅c_{b}[j,k] = ∑_{n} a_{i}b_{i} + ∑_{p≠q, h}j_{(p)=h}j_{(q)} a_{p}b_{q}. The second term has expectation ∑_{p≠q} Pr[h_{j}(p)=h_{j}(q)]a_{p}b_{q} ≤ ∑_{p≠q} (ε/e)a_{p}b_{q} ≤ ∑_{p,q},(ε/e)a_{p}b_{q} ≤ (ε/e)‖a‖_{1}‖b‖_{1}. So as in the point-query case we get probability at most e^{-1} that a single j gives a value that's more than ε‖a‖_{1}‖b‖_{1} too high, so the probability that the min value is too high is at most exp(-d) ≤ δ.

## 3.3. Finding heavy hitters

Here we want to find the heaviest elements in the set: those indices i for which a_{i} exceeds φ‖a‖_{1} for some constant threshold φ. The easy case is when increments are non-negative (for the general case, see the paper), and uses a method from a previous paper by Charikar, Chen, and Farach-Colton (http://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf). Instead of trying to find the elements after the fact, we extend the data structure and update procedure to track all the heavy elements found so far (stored in a heap), as well as ‖a‖_{1} = ∑ c_{t}. When a new increment (i,c) comes in, we first update the count-min structure and then do a point query on a_{i}; if â_{i} ≥ φ‖a‖_{1}, we insert i into the heap, and if not, we delete i along with any other value whose stored point-query estimate has dropped below threshold.

The trick here is that the threshold φ‖a‖_{1} only increases over time (remember that we are assuming non-negative increments). So if some element i is below threshold at time t, it can only go above threshold if it shows up again, and we have a probability of at least 1-δ of including it then.