Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

(See http://www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf for a more up-to-date version of these notes.)

Collection of inequalities useful for proving upper bounds on RandomizedAlgorithms.

Contents

Law of total probability
Union bound (Boole's inequality)
1. Examples
Markov's inequality
1. Examples
Chebyshev's inequality
1. Examples
  1. Balls in bins
  2. Lazy select
Chernoff bounds
1. The classic Chernoff bound
  1. Variants
  2. Applications
    1. Balls in bins again
    2. Sums of fair coins
2. Lower bound version
Probabilistic recurrence relations
Jensen's inequality
Azuma-Hoeffding inequality

1. Law of total probability

If the (countably many) events { A_i } partition the probability space (i.e., they don't intersect each other and their union has probability 1), then for any event B,

$\begin{align*} \Pr[B] = \sum \Pr[B|A_i] \Pr[A_i] \end{align*}$

and for any random variable X with finite expectation,

$\begin{align*} \E[X] = \sum \E[X|A_i] \Pr[A_i]. \end{align*}$

This allows probabilities and expectations to be computed by case analysis, conditioning on each event A_i in turn. Pretty much every analysis of a a ProbabilisticRecurrence uses this method.

2. Union bound (Boole's inequality)

For any countable collection of events { A_i },

$\begin{align*} \Pr\left[\bigcup A_i\right] \le \sum \Pr[A_i]. \end{align*}$

Typical use: show that if an algorithm can fail only if various improbable events occur, then the probability of failure is no greater than the sum of the probabilities of these events.

2.1. Examples

The proof of Adleman's Theorem in Derandomization.
Given a graph G = (V,E) where |V| = n and |E| = m, mark a subset of (n/2)√m vertices uniformly at random. The probability that any particular vertex is marked is then ((n/2)/√m)/n = 1/(2√m), and the probability that both endpoints of an edge are marked is (1/(2√m))((n/2)/√m - 1)/n < 1/(2√m))² = 1/(4m). So the probability that at least one of the edges has two marked endpoints is at most m/(4m) = 1/4. We thus have a randomized algorithm that outputs an independent set of size n/(2√m) with probability 3/4, without looking at any of the edges. This is probably not an especially good independent set, but we probably can't hope to do much better without seeing the actual graph.

3. Markov's inequality

If X ≥ 0, then Pr[X ≥ α] ≤ EX/α. (See RandomVariables for proof.)

Useful for converting expectation bounds into probability bounds. Does not work in reverse: see BadCaseForMomentGeneratingFunctions.

3.1. Examples

The expected running time for randomized QuickSort is O(n log n). It follows that the probability that randomized QuickSort takes more than f(n) time is O(n log n / f(n)). For example, the probability that it performs the maximum ${n \choose 2}$ = O(n²) comparisons is O(log n / n). (It's possible to do much better than this.)
Suppose we put toss m balls in n bins, uniformly and independently. What is the probability that some particular bin contains at least k balls? The probability that a particular ball lands in a particular bin is 1/n, so the expected number of balls in the bin is m/n. This gives a bound of m/nk that a bin contains k or more balls. We can combine this with the union bound to show that the probability that any bin contains k or more balls is at most m/k. Unfortunately this is not a very good bound.

4. Chebyshev's inequality

Pr[|X - EX| ≥ α] ≤ Var[X]/α².

Proof is by applying Markov's inequality to (X-EX)².

This is a weak concentration bound. Often useful when X is a sum of random variables, since if S = ∑ X_i, then we can calculate Var[S] = ∑ Cov(X_i, X_j) = ∑ Var[X_i] + ∑_i≠j Cov(X_i,X_j), where Var[x] = E[X²] - (EX)² and Cov(X,Y) = E[XY] - EX EY.

Note that Cov(X,Y) = 0 when X and Y are independent; this makes Chebyshev's inequality particularly useful for pairwise-independent random variables. It also works well when the X_i are indicator variables, since of E[X_i] = p, we can easily compute Var[X_i] = E[X²] - (EX)² = p - p² = p(1-p). This is always bounded by p, and for small p the bound is close to tight.

4.1. Examples

4.1.1. Balls in bins

Let X_i be the indicator that the i-th of m balls lands in a particular bin. Then E[X_i] = 1/n, giving E[∑ X_i] = m/n, and Var[X_i] = 1/n - 1/n², giving Var[∑ X_i] = m/n - m/n². So the probability that we get k + m/n or more balls in a particular bin is at most (m/n - m/n²)/k² < m/nk², and applying the union bound, the probability that we get k + m/n or more balls in any of the n bins is less than m/k². Setting this equal to ε and solving for k gives a probability of at most ε of getting more than m/n + √(m/ε) balls in any of the bins. This is still not as good a bound as we can prove, but it's at least non-trivial.

4.1.2. Lazy select

This is the example from MotwaniRaghavan §3.3.

We want to find the k-th smallest element S_(k) of a set S of size n. (The parentheses around the index indicate that we are considering the sorted version of the set S₍₁₎ < S₍₂₎ ... < S_(n).) The idea is to:

Sample a multiset R of n^3/4 elements of S with replacement and sort them. This takes O(n^3/4 log n^3/4) = o(n) comparisons so far.
Use our sample to find an interval that is likely to contain S_(k). The idea is to pick indices l = (k-n^3/4)n^-1/4 and r = (k+n^3/4)n^-1/4 and use R_(l) and R_(r) as endpoints (we are omitting some floors and maxes here to simplify the notation; for a more rigorous presentation see MotwaniRaghavan). The hope is that the interval P = [R_(l),R(r)] in S will both contain S_(k), and be small, with |P| ≤ 4n^3/4 + 2. We can compute the elements of P in 2n comparisons exactly by comparing every element with both R_(l) and R_(r).
If both these conditions hold, sort P (o(n) comparisons) and return S,,(k). If not, try again.

We want to get a bound on how likely it is that P either misses S_(k) or is too big.

For any fixed k, the probability that one sample in R is ≤ S_(k) is exactly k/n, so the expected number X of samples ≤ S_(k) is exactly kn^-1/4. The variance on X can be computed by summing the variances of the indicator variables that each sample is ≤ S_(k) which gives a bound Var[X] = n^3/4((k/n)(1-k/n)) ≤ n^3/4/4. Applying Chebyshev's inequality gives Pr[|X-kn^-1/4| ≥ √n] ≤ n^3/4/4n = n^-1/4/4.

Now let's look at the probability that P misses S_(k) because R_(l) is too big, where l = kn^-1/4-√n. This is

Pr[R_(l) > S_(k)] = Pr[X < kn^-1/4-√n] ≤ n^-1/4/4.

(with the caveat that we are being sloppy about round-off errors).

Similarly,

Pr[R_(h) < S,,(k)] = Pr[X > kn^-1/4+√n] ≤ n^-1/4/4.

So the total probability that P misses S_(k) is at most n^-1/4/2.

Now we want to show that |P| is small. We will do so by showing that it is likely that R_(l) ≥ S_(k-2n^3/4^) and R_(h) ≤ S_(k+2n^3/4^). Let X_l be the number of samples in R that are ≤ S_(k-2n^3/4^) and X_r be the number of samples in R that are ≤ S_(k+2n^3/4^). Then we have E[X_l] = kn^-1/4-2√n and E[X_r] = kn^-1/4+2√n, and Var[X_l] and Var[X_r] are both bounded by n^3/4/4.

We can now compute

Pr[R_(l) < S_(k-2n^3/4^)] = Pr[X_l > kn^-1/4-√n] < n^-1/4/4

by the same Chebyshev's inequality argument as before, and get the symmetric bound on the other side for Pr[R_(r) > S,,(k+2n^3/4)]. This gives a total bound of n^-1/4/2 that P is too big, for a bound of n^-1/4 = o(n) that the algorithm fails on its first attempt.

The total expected number of comparisons is thus given by T(n) = 2n + o(n) + O(n^-1/4 T(n)) = 2n + o(n).

5. Chernoff bounds

Another application of Markov's inequality, now to exp(αS), where S = ∑ X_i and the X_i are independent.

The trick is that because the X_i are independent, so are the related variables exp(αX_i), and since exp(αS) = ∏ exp(αX_i) we get E[exp(αS)] = ∏ E[exp(αX_i)].

The quantity E[exp(αS)], treated as a function of α, is called the moment generating function of S, because it expands into ∑_k E[X^k]a^k/k!, the exponential generating function (see GeneratingFunctions) for the series of k-th moments E[X^k]. Note that it may not converge for all S and α (see BadCaseForMomentGeneratingFunctions); we will be careful to choose α for which it does converge and for which Markov's inequality gives us good bounds.

5.1. The classic Chernoff bound

Let each X_i for i=1..n be a 0-1 random variable with expectation p_i, so that ES = μ = ∑ p_i. Then E[exp(αX_i)] = p_ie^α + (1-p_i), and E[exp(αS)] = ∏ (p_ie^α+1-p_i). Let δ≥0 and α>0. Markov's inequality then gives

$\begin{align*} \Pr\left[S \ge (1+\delta)\mu \right] &= \Pr\left[e^{\alpha S} \ge \exp\left(\alpha(1+\delta)\mu\right)\right] \\ &\le \frac{\E[e^{\alpha S}]}{\exp\left(\alpha(1+\delta)\mu\right)} \\ &= \frac{\prod_{i=1}^{n} (p_i e^{\alpha}+1-p_i)}{\exp\left(\alpha(1+\delta)\mu\right)} \\ &= \frac{\prod_{i=1}^{n} (1 + p_i(e^\alpha - 1))}{\exp\left(\alpha(1+\delta)\mu\right)} \\ &\le \frac{\prod_{i=1}^{n} \exp\left(p_i(e^\alpha - 1)\right)}{\exp\left(\alpha(1+\delta)\mu\right)} \\ &= \frac{\exp\left(\sum_{i=1}^{n} p_i (e^\alpha - 1)\right)}{\exp\left(\alpha(1+\delta)\mu\right)} \\ &= \frac{\exp\left(((e^\alpha - 1) \mu\right)}{\exp\left(\alpha(1+\delta)\mu\right)} \\ &= \left(\frac{\exp\left(((e^\alpha - 1) \right)}{\exp\left(\alpha(1+\delta)\right)}\right)^\mu. \\ &= \left(\exp\left(e^\alpha - 1 - \alpha(1+\delta)\right)\right)^\mu. \end{align*}$

The sneaky inequality step in the middle uses the fact that (1+x) ≤ exp(x) for all x, which itself is one of the most useful inequalities you can memorize.

We now choose α to minimize the base in the last expression, by minimizing e^α-1-α(1+δ). Setting the derivative with respect to α to zero gives e^α = (1+δ) or α = ln (1+δ). If we plug this back into the bound, we get

$\begin{align*} \Pr\left[S \ge (1+\delta)\mu \right] &\le \left(\exp\left((1 + \delta) - 1 - (1+\delta)\ln (1+\delta)\right)\right)^\mu \\ &= \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu. \end{align*}$

The base of this rather atrocious quantity is e⁰/1¹ at δ=0, and its derivative is negative for δ≥0 (the easiest way to show this is to substitute δ=x-1 first). So the bound is never greater than 1 and is less than 1 as soon as δ>0. We also have that the bound is exponential in μ for any fixed δ.

5.1.1. Variants

The full Chernoff bound can be difficult to work with. There are approximate variants that substitute a weaker but less intimidating bound. Some of the more useful are:

Pr[X ≥ (1+δ)μ] ≤ exp(-μδ²/3), when 0≤δ≤1.81 (the actual upper limit is slightly higher). Useful for small values of δ, especially because the bound can be inverted: if we want Pr[X ≥ (1+δ)μ] ≤ exp(-μδ²/3) ≤ ε, we can use any δ with $\sqrt{3 \ln (1/\epsilon) / \mu}$ ≤ δ ≤ 1.81. The essential idea to the proof is to show that, in the given range, exp(δ)/(1+δ)^1+δ ≤ exp(-δ²/3). This is easiest to do numerically; a somewhat more formal argument that the bound holds in the range 0≤δ≤1 can be found in MitzenmacherUpfal Theorem 4.4.
Pr[X ≥ (1+δ)μ] ≤ exp(-μδ²/4), when 0≤δ≤4.11. A slightly weaker bound than the previous that holds over a larger range; this gives Pr[X ≥ (1+δ)μ] ≤ ε if $\sqrt{4 \ln (1/\epsilon) / \mu}$ ≤ δ ≤ 4.11. Note that the version given on p.72 of MotwaniRaghavan is not correct; it claims that the bound holds up to δ=2e-1 ≅ 4.44, but it fails somewhat short of this value.
Pr[X ≥ R] ≤ 2^-R when R≥2eμ. Sometimes the assumption is replaced with the stronger R≥6μ (this is the version given in MitzenmacherUpfal Theorem 4.4, for example); one can also verify numerically that R≥5μ (i.e., δ≥4) is enough. The proof of the 2eμ bound is that exp(δ)/(1+δ)^(1+δ) < exp(1+δ)/(1+δ)^(1+δ) = (e/(1+δ))^1+δ ≤ 2^-(1+δ) when e/(1+δ) ≤ 1/2 or δ ≥ 2e-1. Raising this to μ gives Pr[X ≥ (1+δ)μ] ≤ 2^-(1+δ)μ for δ≥2e-1. Now substitute R for (1+δ)μ (giving R≥2eμ) to get the full result. Inverting this one gives Pr[X ≥ R] ≤ ε when R ≥ min(2eμ, lg(1/ε)).

The plot below shows the relation between the various bounds for μ=1 in the region where they cross each other:

5.1.2. Applications

5.1.2.1. Balls in bins again

Let's try applying the Chernoff bound to the balls-in-bins problem. Here we let S = ∑ X_i (i=1..m) be the number of balls in a particular bin, with X_i the indicator that the i-th ball lands in the bin, EX_i = p_i = 1/n, and ES = μ = m/n. So the Chernoff bound gives

$\begin{align*} \Pr[S \ge m/n + k] &= \Pr[S \ge (m/n) (1+kn/m)] \\ &\le \frac{e^{kn/m}}{(1+kn/m)^{1+kn/m}}. \end{align*}$

For m=n, this collapses to the somewhat nicer but still pretty horrifying e^k/(k+1)^k+1.

Staying with m=n, if we are bounding the probability of having large bins, we can use the 2^-R variant to show that the probability that any particular bin has more than 2 lg n balls (for example), is at most n^-2, giving the probability that there exists such a bind of at most 1/n. This is not as strong as what we can get out of the full Chernoff bound. If we take the logarithm of e^k/(k+1)^k+1, we get k-(k+1) ln (k+1); if we then substitute k = c ln n / ln ln n - 1, we get

$\begin{align*} c \ln n / \ln \ln n - 1 - c \ln n / \ln \ln n) ln (c \ln n / \ln \ln ) &= (\ln n) \left( \frac{c}{\ln \ln n} - \frac{1}{\ln n} - \frac{c}{\ln \ln n} \left( \ln c + \ln \ln n - \ln \ln \ln n \right) \right) \\ &= (\ln n) \left( \frac{c}{\ln \ln n} - \frac{1}{\ln n} - \frac{c \ln c}{\ln \ln n} - c + \frac{c \ln \ln \ln n}{\ln \ln n} \right) \\ &= (\ln n)(-c + o(1)). \end{align*}$

So the probability of getting more than c ln n / ln ln n balls in any one bin is bounded by exp((ln n)(-c + o(1))) = n^-c+o(1). This gives a maximum bin size of O(log n / log log n) with any fixed probability bound n^-a for sufficiently large n.

5.1.2.2. Sums of fair coins

Suppose we flip n fair coins, and let S be the number that come up heads. We expect μ = n/2 heads on average. How many extra heads can we get, if we want to stay within a probability bound of n^-c?

Here we use the small-δ approximation, which gives Pr[S ≥ (1+δ)(n/2)] ≤ exp(-δ²n/6). Setting exp(-δ²n/6) = n^-c gives δ = $\sqrt{6 ln n^{c}/n}$ = $\sqrt{6c \ln n / n}$ . The actual excess over the mean is δ(n/2) = $(n/2) \sqrt{6c \ln n / n}$ = $\sqrt{\frac{3}{2} c n \ln n}$ . By symmetry, the same bound applies to extra tails. So if we flip 1000 coins and see more than 676 heads (roughly the bound when c=3), we can reasonably conclude that either (a) our coin is biased, or (b) we just hit a rare one-in-a-billion jackpot.

In algorithm analysis, the √((3/2)c) part usually gets absorbed into the asymptotic notation, and we just say that with probability at least 1-n^-c, the sum of n random bits is within O(√(n log n)) of n/2.

5.2. Lower bound version

We can also use Chernoff bounds to show that a sum of independent random variables isn't too small. The essential idea is to repeat the upper bound argument with a negative value of α, which makes e^α(1-δ)μ and increasing function in δ. The resulting bound is:

$\begin{align*} \Pr\left[S \le (1-\delta)\mu \right] &\le \left(\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}\right)^\mu. \end{align*}$

A simpler but weaker version of this bound is exp(-μδ²/2). Both bounds hold for all δ with 0≤δ≤1.

6. Probabilistic recurrence relations

See ProbabilisticRecurrences.

7. Jensen's inequality

If X is a random variable and f is a convex function,¹ then f(EX) ≤ E[f(X)]. See Jensen's inequality for variants, missing side conditions that exclude pathological f's and X's, and several proofs. The inequality holds in reverse when f is concave (i.e., when -f is convex).

8. Azuma-Hoeffding inequality

See Martingales.

CategoryRandomizedAlgorithmsNotes

A function f is convex if, for any x, y, and 0≤μ≤1, f(μx+(1-μ)y) ≤ μf(x)+(1-μ)f(y). One way to prove this is to show d²f/dx² ≥ 0. (1)