Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. Peaks and valleys

Suppose we flip a fair coin n times. For each i, 1≤i≤n, let X_i be the indicator of the event that the i-th coin-flip is heads. For each i, 2≤i≤n-1, let Y_i be the indicator for the event that X_i-1=0, X_i=1, and X_i+1=0; if this even occurs we say that the sequence of coin-flips has a peak at i. Let $S = \sum_{i=2}^{n-1} Y_i$ be the random variable counting the number of peaks in the sequence.

What is E[S]?
What is Var[S]?

1.1. Solution

1. To compute E[S], we just need to compute E[Y_i] for each i and sum the expectations. Each variable Y_i is one provided we get a pattern THT in the three coins surrounding position i; since the coins are fair and independent, this event occurs with probability (1/2)³ = 1/8. Summing over all n-2 values of i gives E[S] = (n-2)/8.

2. To compute Var[S], we use the sum formula for variance, taking into account the fact that not all of the Y_i are pairwise independent. First let's compute Var[Y_i]. We have previously shown that E[Y_i] = 1/8; since Y_i² = Y_i this also means E[Y_i²] = 1/8. So Var[Y_i] = E[Y_i²] - (E[Y_i])² = 1/8 - (1/8)² = 7/64.

Now we need to compute Cov(Y_i,Y_j) for all i≠j. We'll consider several cases.

If j=i+1, then there is a sequence of 4 coin-flips starting at position i-1 such that Y_i=1 iff the first three coins have the pattern THT and Y_j=1 iff the last three coins have the pattern THT. But we can't realize both patterns simultaneously because they disagree on the middle two coins. So in this case we have E[Y_iY_i+1] = 0, and Cov(Y_i,Y_i+1) = 0 - E[Y_i]E[Y_j] = -1/64.

If j=i+2, then there is a sequence of 5 coin-flips starting at position i-1 such that Y_i=Y_j=1 iff the coins come up THTHT. This occurs with probability (1/2)⁵ = 1/32, giving Cov(Y_i,Y_i+1) = 1/32 - 1/64 = 1/64.

If j>i+2, then there is no overlap between the Y_i coins and the Y_j coins. The random variables Y_i and Y_j are thus independent, and we have Cov(Y_i,Y_j) = 0.

The cases where j<i are symmetric.

Applying the sum formula gives:

$\begin{align*} \Var[S] &= \sum_{i=1}^{n} \sum_{j=1}^{n} \Cov(Y_i,Y_j) \\ &= \sum_{i=3}^{n} \Cov(Y_i,Y_{i-2}) + \sum_{i=2}^{n} \Cov(Y_i,Y_{i-1}) + \sum_{i=1}^{n} \Var(Y_i) + \sum_{i=1}^{n-1} \Cov(Y_i,Y_{i+1}) + \sum_{i=1}^{n-2} \Cov(Y_i,Y_{i+2}) \\ &= (7/64)n + (-1/64)(n-1) + (1/64)(n-2) + (-1/64)(n-1) + (1/64)(n-2) \\ &= \frac{7n - (n-1) + (n-2) - (n-1) + (n-2)}{64} \\ &= \frac{7n-2}{64}. \end{align*}$

2. Stronger than Markov

Let X₁...X_n be the indicator variables for independent coin-flips with bias p; that is, each X_i is 1 with probability p and 0 with probability 1-p. Let $S = \sum_{i=1}^{n} X_i$ .

Compute E[e^S].
Use the expected value above and Markov's inequality to compute an upper bound on Pr[S≥m] = Pr[e^S≥e^m].
Can you find values of p, n, and m, for which this bound is smaller than the bound E[S]/m given by a direct application of Markov's inequality?

2.1. Solution

1. Since the X_i are all independent, so are the random variables $e^{X_i}$ . So we have

$\begin{align*} \E\left[e^S\right] &= \E\left[e^{\sum_{i=1}^{n} X_i}\right] \\ &= \E\left[\prod_{i=1}^{n} e^{X_i}\right] \\ &= \prod_{i=1}^{n} \E\left[e^{X_i}\right] \\ &= (pe^1 + (1-p)e^0)^n \\ &= (pe+1-p)^n. \end{align*}$

2. Markov's inequality gives Pr[e^S≥e^m] ≤ E[e^S]/e^m = (pe+1-p)ⁿ/e^m.

3. The bound is generally better when m is significantly greater than E[S] = pn. For example, if p=1/n and m=n, the new bound is (e/n+1-1/n)ⁿ/eⁿ = (1/n+1/e-1/(en))ⁿ; for large n this approaches (1/e)ⁿ, which is very small (though not as small as the exact probability (1/n)ⁿ). A straight application of Markov's inequality gives only n(1/n)/n = 1/n.

A similar case is when p=1/2 and m=n. Here the bound is ((e+1)/2)ⁿ/eⁿ ~= (0.6839...)ⁿ. The straight Markov's bound in this case is only 1/2; it doesn't even depend on n.

The main difference between applying Markov's inequality directly to S and applying it to e^S is that in the latter case we need independence to compute e^S. The technique of bounding E[e^αS] for appropriate choices of α goes by the name of Chernoff bounds and is a standard method for proving bounds on the tails of the distribution of a random variable.

3. At the genome factory

A custom gene splicing shop charges $2 to splice a thymine (T) amino acid into a single-strand DNA fragment, and $1 each for adenine (A) and cytosine (C). Guanine (G) is free. So, for example, constructing the sequence GATTACA costs 0+1+2+2+1+1+1=8 dollars.

Write a generating function such that the z^k coefficient gives the number of ways to buy a single amino acid for k dollars.
Write a generating function such that the z^k coefficient gives the number of ways to buy a single DNA strand consisting of n amino acids for k dollars.
Give a simple expression for the number of strands of length n that cost k dollars.

3.1. Solution

1+2z+z².
(1+2z+z²)ⁿ.
The trick here is that (1+2z+z²) factors as (1+z)². So (1+2z+z²)ⁿ = (1+z)²ⁿ. From the binomial theorem, this is equal to $\sum_{k=0}^{2n} {2n \choose k} z^k$ . It follows that there are exactly ${2n \choose k}$ strands of length n that cost k dollars.