Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

(For a more up-to-date version of these notes, see http://www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf.)

In a data stream computation we are presented with a sequence of pairs (i_t,c_t) where 1≤i_t≤n is an index and c_t and count, and we want to maintain a small data structure, known as a sketch, that will allows us to approximately answer statistical queries about the vector a given by a_i = ∑_{t, i[t]=i} c_t. The size of the sketch should be polylogarithmic in the size of a and the length of the stream and polynomial in the error bounds, and updating the sketch given a new data point should be cheap. The motivation is the existence of data sets that are too large to store at all (e.g., network traffic statistics), or too large to store in fast memory (e.g., very large database tables). By building a sketch we can make one pass through the data set but answer queries after the fact, with some loss of accuracy.

The count-min sketch of Cormode and Muthukrishnan (see also MitzenmacherUpfal §13.4) gives approximations of a_i, ∑_i=l..r a_i, and a⋅b with various error bounds, and can be used for more complex tasks like finding "heavy hitters"—indices with high weight. The easiest case is approximating a_i when all c_t are non-negative, so we'll concentrate on that.

1. Structure

A bit like a Bloom filter made out of counters.

Build an array c with width w = ⌈e/ε⌉ and depth d = ⌈ln(1/δ)⌉, where ε is the error bound and δ is the probability of exceeding the error bound. Choose d independent pairwise-independent hash functions. Initialize c to all zeroes.

2. Updates

Given an update (i_t,c_t), increment c[j,h_j(i_t)] by c_t for j=1..d.. (This is the count part of count-min.)

3. Queries

3.1. Point queries

Here we want to estimate a_i for some fixed i. There are two cases, depending on whether the increments are all non-negative, or arbitrary. In both cases we will get an estimate whose error is linear in both the error parameter ε and the L1-weight ‖a‖₁ of a. It follows that the relative error will be low for heavy points, but we may get a large relative error for light points (and especially large for points that don't appear in the data set at all).

3.1.1. Non-negative case

To estimate a_i, compute â_i = min_j c[j,h_j(i)]. (This is the min part.) Then a_i≤â_i, and with probability at least 1-δ,â_i ≤ â_i + ε‖a‖₁.

Proof: The lower bound is easy. Since each pair (i,c_t) increments each c[j,h_j(i)] by c_t, we have an invariant that â_i≤a_i throughout the computation.

For the upper bound, let I_ijk be the indicator for the event that (i≠k) ∧ (h_j(i) = h_j(k)), i.e., that we get a collision between i and k using h_j. Pairwise independence of h_j gives E[I_ijk] = 1/w ≤ ε/e.

Now let X_ij = ∑_k=1..n I_ijka_k. Then c[j,h_j(i)] = a_i + X_ij. (The fact that X_ij≥0 gives an alternate proof of the lower bound.) Now use linearity of expectation to get E[X_ij] = E[∑_k I_ijka_k] = ∑_k a_kE[I_ijk] ≤ ∑_k a_k (ε/e) = (ε/e)‖a‖₁. So Pr[c[j,h_j(i)] > a_i + ε‖a‖₁] = Pr[X_ij > eEX_ij] < 1/e, by Markov's inequality. With d choices for j, and each hash function chosen independently the probability that every count is too big is at most (1/e)^-d = exp(-d) ≤ exp(-ln(1/δ)) = δ.

3.1.2. General case

If the increments might be negative, instead of using the min count we use the median count: â_i = median_j c[j,h_j(i)]. We again define the error term X_ij as above, and observe that E[|X_ij|] = E[|∑_k I_ijka_k|] ≤ ∑_k |a_kE_ijk| ≤ ∑_k |a_k|(ε/e) = (ε/e)‖a‖₁. Using Markov's inequality, we get Pr[|X_ij| > 3ε‖a‖₁] = Pr[|X_ij| > 3eEX_ij] < 1/3e < 1/8. In order for the median to be off by more than 3ε‖a‖₁, we need d/2 of these low-probability events to occur. The expected number that occur is μ = d/8, so applying the Chernoff bound we are looking at Pr[S ≥ (1+3)μ] ≤ (e³/4⁴)^d/8 ≤ (e^3/8/2)^{ln (1/δ)} = δ^{ln 2 - 3/8} < δ^1/4 (the actual exponent is about 0.31, but 1/4 is easier to deal with). It follows that

Pr[a_i-3ε‖a‖₁ ≤ â_i ≤ a_i+3ε‖a‖₁] > 1-δ^1/4.

One way to think about this is that getting an estimate within ε‖a‖₁ of the right value with probability at least 1-δ requires 3 times the width and 4 times the depth—or 12 times the space and 4 times the time—when we aren't assuming increments are non-negative.

3.2. Inner products

Here we want to estimate a⋅b, where a and b are both stored as count-min sketches. The paper concentrates on the case where a and b are both non-negative, which has applications in estimating the size of a join in a database. The method is to estimate a⋅b as min_j ∑_k c_a[j,k]⋅c_b[j,k].

For a single j, the sum consists of both good values and bad collisions; we have ∑_k c_a[j,k]⋅c_b[j,k] = ∑_n a_ib_i + ∑_{p≠q, h}j_(p)=hj_(q) a_pb_q. The second term has expectation ∑_p≠q Pr[h_j(p)=h_j(q)]a_pb_q ≤ ∑_p≠q (ε/e)a_pb_q ≤ ∑_p,q,(ε/e)a_pb_q ≤ (ε/e)‖a‖₁‖b‖₁. So as in the point-query case we get probability at most e^-1 that a single j gives a value that's more than ε‖a‖₁‖b‖₁ too high, so the probability that the min value is too high is at most exp(-d) ≤ δ.

3.3. Finding heavy hitters

Here we want to find the heaviest elements in the set: those indices i for which a_i exceeds φ‖a‖₁ for some constant threshold φ. The easy case is when increments are non-negative (for the general case, see the paper), and uses a method from a previous paper by Charikar, Chen, and Farach-Colton (http://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf). Instead of trying to find the elements after the fact, we extend the data structure and update procedure to track all the heavy elements found so far (stored in a heap), as well as ‖a‖₁ = ∑ c_t. When a new increment (i,c) comes in, we first update the count-min structure and then do a point query on a_i; if â_i ≥ φ‖a‖₁, we insert i into the heap, and if not, we delete i along with any other value whose stored point-query estimate has dropped below threshold.

The trick here is that the threshold φ‖a‖₁ only increases over time (remember that we are assuming non-negative increments). So if some element i is below threshold at time t, it can only go above threshold if it shows up again, and we have a probability of at least 1-δ of including it then.

CategoryRandomizedAlgorithmsNotes