Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. Count-min banking

Suppose we have a Cormode-Muthukrishnan count-min filter (see DataStreamComputation), but we allow negative increments while still using the min function to compute results.

Show that as long as the total value a_i for every index i is non-negative, the same bounds hold as in the non-negative increments case.
Suppose that we are using the count-min filter to process bank transactions, and that we reject any increment (i_t,c_t) that would cause some counter in the count-min filter to drop below zero. The idea is that at the end of the day, any depositor for which a_i > 0 will ask for their money back, while any depositor for which a_i < 0 will vanish, leaving the bank with the missing money. Let S be the set of indices i for which a_i<0, and let T be the set of indices i for which a_i>0. What bounds can you put on ∑_i∈S a_i as a function of the number of indices n and ∑_i∈T a_i?

1.1. Solution

The essential idea is that the analysis of the count-min filter in this case only uses a_i and not the actual increments, so everything goes through.
It is easy to show that -∑_i∈S a_i ≤ ∑_i∈T a_i, since the (non-negative) sum of all values in the count-min filter is proportional to the sum of all increments. For a sufficiently long sequence of transactions, it should be possible for the adversary to get arbitrarily close to this bound: it can just send in small negative transactions until it gets lucky and drains out all the cash.

2. A corrupt random number generator

Suppose you are asked to provide a mechanism that generates a sequence of n random bits, with the guarantees that (a) each bit is equally likely to be 0 or 1, and (b) the bits are pairwise independent. For nefarious reasons of your own, you would like the probability of getting all ones to be as large as possible.

Use Chebyshev's inequality to get an upper bound on the probability that all bits are one.
Give a construction that approaches this upper bound to within a constant factor.

2.1. Solution

Call the bits X₁...X_n and let S = ∑ X_i. Then ES = ∑ EX_i = n/2 and Var[S] = ∑ Var[X_i] = n/4 (since the bits are pairwise independent), so from Chebyshev's inequality we get Pr[S ≥ n] ≤ Var[S]/(n/2)² = (n/4)/(n²/4) = 1/n.
Recall the construction of pairwise-independent bits where we generate m independent bits Y₁...Y_m, and for each nonempty subset S of [m] we let X_S = ⊕_i∈S Y_i. This gives 2^m-1 pairwise independent bits, with the property that if all the Y_i are 0, so are all the X_S. By taking Z_S = ¬X_S we get that all 2^m-1 Z_S are 1 with probability exactly 2^-m. Let m = ⌈lg (n+1)⌉. Then by taking the first n of the 2^m-1 = 2^{⌈lg (n+1)⌉}-1 ≥ n random variables Z, we obtain n pairwise independent random bits with a probability of 2^{-⌈lg (n+1)⌉} ≥ 2^{-(1 + lg (n+1))} = 1/(2(n+1)). This is (asymptotically) within a factor of 2 of the upper bound, and for n = 2^m-1 exactly we get a probability of 2^-m = 1/(n+1) of all ones.

3. Hash table expansion

It is necessary to keep the load factor in a hash table constant to obtain O(1) search costs. The usual method for doing this is to grow the hash table if it gets too full. We can abstract this as an on-line problem where we are given a request sequence x₁, x₂, ..., where x_t∈ℤ⁺ is the number of elements in the hash table at time t, and we must respond with a sequence of hash table sizes y₁, y₂, ..., where y_t≥x_t for all t, and whenever y_t≠y_t-1 we incur a cost y_t.

Give the best algorithm you can for this problem, measured in terms of the competitive ratio. Assume that the adversary is oblivious.

3.1. Solution

Without loss of generality we can assume that {x_t} is non-decreasing, so that OPT(x₁...x_k) = x_k. In fact, we can got even further and assume that x_t = t, since it costs the adversary nothing to add elements slowly. The adversary strategy can then be summarized simply by giving k. On the other side, we can restrict our attention only to "lazy" algorithms for which y_i is non-decreasing (as it costs us nothing to leave y_i high) and that do not change y_i unless necessary. A natural way to specify such an algorithm is to give some increasing sequence of sizes s₁, s₂,..., where y_i is the smallest s_j that is bigger than max_t≤i x_t.

For the simple doubling strategy, the sequence is 1, 2, 4, 8, ... . This gives a competitive ratio that approaches 4 in the limit. One factor of 2 comes from the sequence limit in calculating the cost = s_m + s_m-1 + ... = s_m (1 + 1/2 + 1/4 + ...). The other comes because the adversary can choose k = s_j+1 and force us to pay roughly 2s_j+1 ≅ 4⋅OPT. We can reduce the constant by making it harder for the adversary to figure out where to stop.

We do so by choosing a random value r uniformly in the range [1/2, 1], and let s_j = 2^jr. Now suppose the adversary stops at k. Let z be the smallest power of 2 greater than or equal to k (i.e., z = 2^{⌈lg k⌉}). We now consider two cases depending on the value of r:

If rz ≥ k, then the algorithm's cost is rz. This event occurs with probability (z-k)/(z/2) = 2(z-k)/z and the expected cost for the last increase conditioned on it occurring is (k+z)/2.
If rz < k, then the algorithm's cost is 2rz. This event occurs with probability 1-2(z-k)/z, and the expected cost conditioned on it occurring is (z+2k)/2.

So the algorithm's expected cost for its last increase is (z-k)(z+k)/z + (1-2(z-k)/z)(z+2k)/2 = (z² + 2k²)/2z and its expected cost for all increases is bounded by twice this, (z²+2k²)/z. Dividing by k gives a competitive ratio of (z²+2k²)/zk = z/k + 2k/z. At k=z this is 3. At k=z/2 it is also 3. Differentiating with respect to k shows a unique extremum in the range [z/2,z] of z/√2; at this point the competitive ratio is 2√2 < 3 (it's a minimum). So we get 3 as the bound on the competitive ratio using this randomized strategy.