Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

We start with some definitions:

1. Stochastic processes

A stochastic process is a sequence of random variables X₀, X₁, ..., typically indexed either by ℕ (a discrete-time stochastic process) or ℝ (a continuous-time stochastic process; sometimes ℝ⁺ if we don't consider times less than 0). Interpretation: A random process that evolves over time. Common examples are martingales (described below), and Markov processes, where the distribution of X_i+1 depends only on X_i and not on any previous states.

2. Sigma-algebras

A sigma-algebra (or σ-algebra) is a family of sets that includes a maximal universe Ω and the empty set ∅, is closed under complement with respect to Ω and under countable unions. The elements of this family are called measurable sets. Any probability space includes a σ-algebra that describes what events are assigned probabilities in the space. For discrete spaces this is often all subsets of Ω. For continuous spaces a restricted σ-algebra is needed to avoid paradoxes; the usual choice for e.g. [0,1] are the Lebesgue-measurable sets. For our purposes the useful property of σ-algebras is that they can be used to represent restricted information.

For example, suppose I roll two six-sided dice but only tell you what the sum of the two dice is. What I see is a probability space with 36 elements (all outcomes of the two dice). What you see is a random variable with only 11 values (the possible sums 2..12). You can detect whether some events occured ("is the sum even?") but not others ("is the first die 3?"). The collection of events that you can decide includes both ∅ and Ω, and is closed under complement and union, so it's a σ-algebra. The structure of this σ-algebra is that it includes a set A if and only if whenever A contains some point ω = (x,y), it contains all other points ω'=(x',y') where x+y = x'+y'.

We say that a random variable X is measurable with respect to a σ-algebra ℱ (or just measurable ℱ for short) if the event [X≤x] is measurable (i.e., included in the σ-algebra) for all x. For discrete random variables this just means that any event of the form X∈S is measurable. One way to get a σ-algebra is to take a random variable consider the smallest σ-algebra with respect to which X is measurable; this is called the σ-algebra generated by X and is written <X>. In general, we can also consider σ-algebras generated by any collection of events <A,B,C> or random variables <X₁,X₂,X₃>; in each case we are looking at the smallest σ-algebra with respect to which all of these events or random variables are measurable.

In the two-dice example, the σ-algebra we defined is just <X₁+X₂>. If I tell you the value of the first as well, we get <X₁+X₂,X₁>, which happens to include every event in the underlying probability space. We could also have σ-algebras <X₁>, <X₂>, <X₁,[X₁<X₂]>, that correspond to different information.

For discrete probability spaces, we can think of a σ-algebra as consisting of all unions of a set of elementary events or atoms, which are events with the property that no proper subset except ∅ is also contained in the σ-algebra. In this case, a random variable X is measurable ℱ if and only if it is constant on each atom of ℱ. The atoms of ℱ form a partition of Ω; each outcome ω∈Ω lies in exactly one atom. As a special case, the σ-algebra <X> for discrete X has as its elementary events all nonempty sets of the form X^-1(x).

2.1. Filtrations

With a stochastic process, it is natural to talk not only about the value of the process at time t (which is just the random variable X_t), but also what we know at time t. The usual assumption is that we never forget anything; so at minimum we can decide any event in the σ-algebra <X₁...X_t>. If we have additional information (perhaps there is some underlying process used to generate the X_i, and we can observe details of this process that are not necessarily included in the value of the X_i), then we have a sequence of nested σ-algebras ℱ₀⊆ℱ₁⊆ℱ₂⊆... . Any such sequence is called a filtration; a stochastic process {X_t} is adapted to a filtration {ℱ_t} if X_t is measurable ℱ_t for all t.

2.2. Conditional expectations

We've previously seen the definition of E[X|A] = ∑ x Pr[X=x|A], the expectation of X conditioned on an event A. We can also defined the expectation E[X|Y] of X conditioned on a random variable Y, or in even more generality the expectation E[X|ℱ] of X conditioned on a σ-algebra ℱ. In the first case, we define E[X|Y] = E[X|Y=y] when Y=y; unlike ordinary conditional expectation, which is always a constant, E[X|Y] is a random variable that may take on a different value for each value of y (but is constant if we fix the value of Y). In a discrete probability space, E[X|ℱ] is defined by considering the atoms A of ℱ and defining E[X|ℱ] = E[X|A] whenever we land in A. It is not hard to see that this is equivalent to E[X|Y] if Y generates ℱ.

For continuous spaces, E[X|ℱ] is a little trickier to define, and may not even exist if X is unbounded. One way to define it is to observe that the set of all real-valued random variables defined on a given probability space form a vector space (given any random variables X and Y, aX+bY is also a random variable), and that the subset of random variables that are measurable ℱ is a subspace of this vector space. So we can define E[X|ℱ] as the closest point to X in the subspace of random variables that are measurable ℱ. In practice, we are never going to actually do this.

Conditional expectations satisfy all the usual properties of expectations, so E[aX+bY|ℱ] = aE[X|ℱ] +bE[Y|ℱ], X≥Y ⇒ E[X|ℱ] ≥ E[Y|ℱ], etc. One very useful fact that only comes up because of the conditioning is that if ℱ₁⊆ℱ₂, then E[X|ℱ₁] = E[E[X|ℱ₂]|ℱ₁]; in other words, when computing an expectation it doesn't matter if we average all at once or in stages, as long as we end up averaging across the same elementary events. A complement to this fact is that if X is measurable ℱ, it acts like a constant: E[X|ℱ] = X and E[XY|ℱ] = X E[Y|ℱ]. (We won't prove these facts here, see e.g. GrimmetStirzaker.)

3. Martingales

So now let's imagine we have a stochastic process {X_t} (what we have at time t) adapted to a filtration {ℱ_t} (what we know at time t). We can think of this process as describing the state of our finances after we've been playing in a casino for a while. If the casino is perfectly fair (unlike what happens in real life), then each bet we place should have an expected return of 0. In other words, if we want to estimate based on what we know at time t how much money we will have at time t+1, our best guess would be X_t.

Formally, a stochastic process as above is a martingale if E[X_t+1|ℱ_t] = X_t. Often we replace ℱ_t with the σ-algebra generated by X₀...X_t and write this as E[X_t+1|X₀...X_t] = X_t. The useful property of martingales is that we can verify the martingale property locally, by proving either that E[X_t+1|ℱ_t] = X_t or equivalently that E[X_t+1 - X_t|ℱ_t] = E[X_t+1|ℱ_t] - X_t = 0. But this local property has strong consequences that apply across long intervals of time, as we will see below.

3.1. Examples of martingales

Let X_t+1 = X_t ± b_t where +b_t and -b_t occur with equal probability b_t is measurable ℱ_t, and the outcome ±b_t is measurable ℱ_t+1 (in other words, my "bet" b_t can only depend on what has happened so far and not on the future, but my knowledge ℱ_t includes the outcome of all past bets). Then {X_t|ℱ_t} is a martingale. In general, if Y_t+1-Y_t = b_t(X_t+1-X_t) where (X_t,ℱ_t) is a martingale and b_t is measurable ℱ_t, then Y_t is also a martingale with respect ℱ_t. Proof: E[Y_t+1-Y_t|ℱ_t] = E[b_t(X_t+1-X_t)|ℱ_t] = b_t E[X_t+1-X_t|ℱ_t] = b_t⋅0 = 0.
Special case: A random ±1 walk is a martingale.
Another random walk: Let X_t+1 = X_t±1 with equal probability and let Y_t = X_t² - t. Then E[Y_t+1|Y_t] = ½(X_t+1)²+½(X_t-1)² - (t+1) = X_t² + 1 - (t+1) = X_t² - t = Y_t. So {Y_t} is a martingale.
Doob's martingale: Let Z be any random variable and {ℱ_t} any filtration. Let X_t= E[Z|ℱ_t] (suppose that all such expectations exist). Then E[X_t+1|ℱ_t] = E[E[Z|ℱ_t+1]|ℱ_t] = E[Z|ℱ_t] = X_t; (X_t,ℱ_t) is a martingale.

3.2. Properties of martingales

The defining property of a martingale is that E[X_t+1|ℱ_t] = X_t. What if we want to compute E[X_t] without conditioning, or E[X_t+k|ℱ_t], where k > 1?

We can easily show by induction on k that, for k≥0, E[X_t+k|ℱ_t] = X_t. The base is that when k = 0, E[X_t+k|ℱ_t] = E[X_t|ℱ_t] = X_t since {X_t} is adapted to {ℱ_t} and thus X_t is measurable ℱ_t. For larger k, compute E[X_t+k|ℱ_t] = E[E[X_t+k|ℱ_t+k-1]|ℱ_t] = E[X_t+k-1|ℱ_t] = X_t.

What about E[X_t], with no conditioning? Here we observe E[X_t] = E[E[X_t|ℱ₀]] = E[X₀]. In other words, martingales never go anywhere, at least in expectation.

We can apply this to the martingales we've seen so far:

For an arbitrary betting strategy on a fair game, I neither make nor lose money on average. So if I have a strategy that makes me $1 by time t with probability p, then I must lose $x with probability (1-p), where E[X_t] = p - (1-p)x = 0 implies x = p/(1-p). This means that I can (by doubling my bet each time I lose, for example) make $1 with probability 1-10^-6 if I am willing to accept a probability of 10^-6 of losing 10^-6-1 dollars. This is probably a bad strategy for ordinary people but it works out well for insurance companies, who charge a small additional premium to make a profit above the zero expected return.
A random walk that runs for t steps starting at position 0 ends with E[X_t²-t] = 0 or E[X_t²] = t.

4. Martingales as sums of uncorrelated random variables

Let {X_t,ℱ_t} be a martingale with X₀=0. Define, for t>0, Δ_t, = X_t-X_t-1. Then we can write X_t as $\sum_{i=1}^{t} \Delta_i$ , where E[Δ_i|Δ₁...Δ_i-1] = E[X_i-X_i-1|ℱ_i] = 0. In other words, Δ_i is uncorrelated with Δ₁...Δ_i-1. This is not quite as good a condition as independence, but is good enough that martingales often act very much like sums of independent random variables.

4.1. Variance of Xt

For example, suppose we want to compute the Var[X_t] = E[(X_t-EX_t)²] = E[X_t²] (recall that we are assuming X₀ = 0, so EX_t = 0 by the martingale property). We have Var[X_t] = Var[∑_i≤t Δ_i] = ∑_i,j≤t Cov(Δ_i,Δ_j) = ∑_i≤t Var[Δ_i] + 2 ∑_{i<j≤t Cov(Δ}i_,Δj_{). Now Var[Δ}i_{] = E[Δ}i_{^2^], since EΔ}i_{= 0, and Cov(Δ}i_,Δj_{) = E[Δ}i_Δj_{] - E[Δ}i_]E[Δj_{] = E[E[Δ}j_|Δi_]Δi_{] - 0⋅0 = 0 - 0 = 0, since E[Δ}j_|Δi_{] = E[E[Δ}j_|ℱj-1]|Δ_i] = 0. It follows that Var[X_t] is exactly ∑_i≤t E[Δ_i²].

This means that if we have a bound |Δ_i| ≤ c_i for all i, we can compute Var[X_t] ≤ ∑_i≤t c_i², and apply Chebyshev's inequality to the result to get a bound on the tails of X_t. But in fact we can do much better by using moment generating functions; the result is an inequality known as the Azuma-Hoeffding inequality (named for its independent co-discoverers), which is the martingale version of Chernoff bounds.

4.2. The Azuma-Hoeffding inequality

Same setup as above: X₀ = 0, Δ_t = X_t - X_t-1 with |Δ_t|≤c_t for all t. Then

$\begin{align*} \Pr[X_t \ge x] &\le \exp\left(\frac{-x^2}{2\sum_{i=1}^{t} c_i^2}\right). \end{align*}$

Variants:

If we don't assume X₀ = 0, this becomes a bound on X_t-X₀.
Since {-X_t,ℱ_t} is also a martingale, the same bound applies, which gives the symmetric bound Pr[X_t-X₀ ≤ -x] ≤ exp(-x²/2∑c_i²).
Applying the union bound gives Pr[|X_t-X₀| ≥ x] ≤ 2 exp(-x²/2∑c_i²).

The proof is by the usual trick of applying Markov's inequality to E[exp(α∑X_t)] for appropriate choice of α, with some tricks in bounding E[exp(α∑X_t)]. See Azuma-Hoeffding inequality.

4.2.1. Application: the method of bounded differences

Basic idea: Azuma-Hoeffding applies to any process that we can model as revealing the inputs x₁...x_n to some function f(x₁...x_n) one at a time, provided changing any one input changes f by at most c (the Lipschitz condition). The reason is that when the Lipschitz condition holds, then E[f(x₁...x_n) | x₁...x_t] is a martingale that satisfies the requirements of the inequality.

This allows us to show, for example, that the number of colors needed to color a random graph is tightly concentrated, by considering a process where x_i reveals all the edges between vertex i and smaller vertices (a vertex exposure martingale).

5. Stopping times

Let {ℱ_t} be a filtration. A random variable T is a stopping time for {ℱ_t} if T≥0 and the event [T≤t] is measurable ℱ_t for all t. In simple terms, T is a stopping time if you know at time t whether you've stopped or not.

Stopping times are very handy when used with martingales, because of the

Optional Stopping Theorem

If (X_t,ℱ_t) is a martingale and T is a stopping time for {ℱ_t}, then E[X_T] = E[X₀] if

Pr[T<∞] = 1,
E[|X_T|] < ∞, and
lim_t→∞ E[ X_t⋅[T > t] ] = 0.

The first condition says that T is finite with probability 1 (i.e., eventually we do stop). The second condition puts a bound on how big |X_T| can get, which excludes some bad outcomes where we accept a small probability of a huge loss in order to get a large probability of a small gain. The third says that the contribution of large values of t to E[X_T] goes to zero as we consider larger and larger t; the term [T > t] is the indicator variable for the event that T is larger than t.

It would be nice if we could show E[X_T] = E[X₀] without the side conditions, but in general this isn't true. For example, the double-or-win martingale strategy eventually returns 1 with probability 1, so if T is the time we stop playing, we have Pr[T<∞] = 1, E[|X_T|] < ∞, but E[X_T] = 1 ≠ E[X₀] = 0. Here we have E[X_t⋅[T>t]] = -2^t-1(1/2)^t-1 = -1 for t>0, so it's the third condition that's violated. (The same thing happens for a simple ±1 random walk that stops at +1, but it's a bit harder to calculate E[X_t⋅[T>t]] in this case.)

To prove the optional stopping theorem, it helps to start with a simpler version:

Optional Stopping Theorem (baby version)

If (X_t,ℱ_t) is a martingale and T is a stopping time for {ℱ_t}, then for any n∈ℕ, E[X_min(T,n)] = E[X₀].

Proof

Define Y_t = <<X₀ + latex($\sum_{i=1}^{t} (X_t-X_t-1) [T\\le t-1]$)>>. Then (Y_t,ℱ_t) is a martingale since we can treat [T≤t-1] as a sequence of bets. But then E[X_min(T,n)] = E[Y_n] = E[Y₀] = E[X₀].

So now we'll prove the full version by considering E[X_min(T,n)] and showing that, under the conditions of the theorem, it approaches E[X_T] as n goes to infinity. First observe that, for any n, X_T = X_min(T,n) + [T>n](X_T-X_n), because either T≤n, and we just get X_T, or T>n, and we get X_n+(X_T-X_n) = X_T. Now take expectations: E[X_T] = E[X_min(T,n)] + E[[T>n]X_T] - E[[T>n]X_n]. Condition (3) on the theorem gives lim_n→∞ E[[T>n]X_n]→0. If we can show that the middle term also vanishes in the limit, we are done.

Here we use condition (2). Observe that E[[T>n] X_T] = $\sum_{t=n+1}^{\infty} \E[[T=t] X,,t,,]$ . Compare this with E[X_T] = $\sum_{t=0}^{\infty} \E[[T=t] X,,t,,]$ ; this is a convergent series, so in the limit the sum of the terms for i=0..n converges to E[X_T]. But this means that the sum of the remaining terms for i=n+1..∞ converges to zero. So the middle term goes to zero as n goes to infinity.

We thus have E[X_T] = lim_n→∞ E[X_min(T,n)] + E[[T>n]X_T] - E[[T>n]X_n] = lim_n→∞ E[X_{min(T,n)] = E[X}0,,]. This completes the proof.

Using the full-blown optional stopping theorem is a pain in the neck, because conditions (2) and (3) are often hard to test directly. So in practice one generally chooses one of several weaker but easier variants:

Optional Stopping Theorem (bounded time): If (X_t,ℱ_t) is a martingale and T is a stopping time for {ℱ_t} with T≤n always for some fixed n, then E[X_T] = E[X₀]. Proof: X_T = X_min(T,n).
Optional Stopping Theorem (bounded range): If (X_t,ℱ_t) is a martingale and T is a stopping time for {ℱ_t} with Pr[T<∞] = 0 and |X_t| ≤ M always for some fixed M and all t, then E[X_T] = E[X₀]. Proof: Use the full OST, with (1) given, (2) implied by |X_t| ≤ M, and (3) follows from |E[X_t[T>t]| ≤ E[|X_t|[T>t]] ≤ M Pr[T>t] → 0 since Pr[T<∞] = 1.
Optional Stopping Theorem (bounded increments): If (X_t,ℱ_t) is a martingale and T is a stopping time for {ℱ_t} where Pr[T<∞] = 1, ET≤∞, and |X_t-X_t-1| ≤ c always for all t, then E[X_T] = E[X₀]. Proof: Here we go back into the original proof of the OST, but there is a simpler argument that E[(X_T-X_n)[T > n]] converges to 0. The idea is that |E[(X_T-X_n)[T>n]]| = |E[∑_t≥n (X_t+1-X_t) [T>t]]| ≤ E[∑_t≥n |(X_t+1-X_t)| [T>t]] ≤ E[∑_t≥n c[T>t] ]. Now use the fact that E[T] = ∑_t [T>t] to show that this is again the tail of a convergent sequence, and thus converges to zero.

5.1. Applications

Let X_t be a ±1 random walk that starts at 0 and stops if it reaches +a or -b. Let T be the time at which the process stops. We can easily show that Pr[T<∞] = 1 and ET < ∞ by observing that from any state of the random walk, there is a probability of at least 2^-(a+b) that it stops within a+b steps (by flipping heads a+b times in a row), so that if we consider a sequence of intervals if length a+b, the expected number of such intervals we can have before we stop is at most 2^a+b. We have bounded increments by the definition of the process (bounded range also works). So E[X_T] = E[X₀] = 0 and the probability p of landing on a instead of -b must satisfy pa - (1-p)b = 0, giving p = b/(a+b).
Let X_t and T be as above and let Y_t = X_t²-t, which we previously showed to be a martingale. Again we have bounded increments (but not bounded range!), since X_t² ≤ max(a,b)² for all t and t only changes by 1 at each step. So E[Y_T] = 0, which gives E[T] = E[X_T²]. But we can calculate E[X_T²], since it is a² Pr[X_T = a] + b² Pr[X_t = -b] = a²(b/(a+b)) + b²(a/(a+b)) = (a²b+b²a)/(a+b) = ab.

6. Submartingales and supermartingales

Sometimes we can't guarantee E[X_t+1|ℱ_t] = X_t, but we can nonetheless use X_t as a lower or upper bound on E[X_t+1|ℱ_t]. An adapted process (X_t,ℱ_t) is a submartingale if X_t ≤ E[X_t+1|ℱ_t] for all t and a supermartingale if X_t ≥ E[X_t+1|ℱ_t]. Note that the quantity that is "sub" (below) or "super" (above) is always where we are now: so submartingales tend to go up over time while supermartingales tend to go down. If a process is both a submartingale and a supermartingale, it's a martingale.

Sub/super-martingales are handy when we can't set up an exact martingale, but we don't mind because we only care about one-sided bounds anyway. Most of the properties we have seen for martingales hold for sub/super-martingales if we replace equality with ≤ or ≥: for example, in a submartingale, E[X₀] ≤ E[X_t] for all t.

6.1. The Doob-Meyer decomposition

The easiest way to get these results in general is to write a submartingale (X_t,ℱ_t) as X_t = Y_t + Z_t, where Y_t is a martingale and Z_t is a non-decreasing predictable process with respect to ℱ_t; this means that Z_t+1 is always measurable ℱ_t. The existence of this decomposition (known as the Doob decomposition or sometimes the Doob-Meyer decomposition among people who get tired of naming everything after Doob) is easily shown: define Z₀ = 0, Z_t+1 = Z_t + E[X_t+1-X_t|ℱ_t] and Y_t = X_t - Z_t. Then X_t is trivially equal to Y_t + Z_t, and it's not hard to see that Z_t is both non-decreasing and predictable (the non-decreasing part is just the submartingale property). To show that Y_t is a martingale, compute E[Y_t+1|ℱ_t] = E[X_t+1-Z_t+1|ℱ_t] = E[X_t+1-E[X_t+1-X_t|ℱ_t]|ℱ_t] = E[X_t+1-X_t+1+X_t|ℱ_t] = X_t.

For supermartingales, the same decomposition works, but now Z_t is non-increasing.

6.2. Azuma-Hoeffding inequality for sub- and super-martingales

We can use the existence of the Doob decomposition to prove results about a submartingale or supermartingale X_t even if we don't know how to compute it. For example, suppose (X_t,ℱ_t) is a submartingale with bounded increments, that is, that |X_t-X_t-1| ≤ c_t for fixed constants c_t and all t. Decompose X_t as Y_t+Z_t as above. From |X_t-X_t-1| ≤ c_t we have |(Y_t+Zt_)-(Yt-1_+Zt-1)| ≤ c_t or |(Y_t-Y_t-1)+(Z_t-Z_t-1)| ≤ c_t. We want to get a bound on |Y_t-Y_{t-1|; first we get a bound on Z}t_-Zt-1_{by observing that E[Y}t_-Yt-1_{] = 0 implies that Y}t_-Yt-1_{≥0 with nonzero probability. Suppose that this event occurs; then everything inside the absolute value is non-negative, and we have (Y}t_-Yt-1_)+(Zt_-Zt-1) ≤ c_t which gives 0 ≤ (Z_t-Z_t-1) ≤ c_t (if we also take into account the fact that Z_t is non-decreasing). But then (Y_t-Y_t-1) must lie between -2c_t and c_t, or we violate the constraints on X_t. So we can apply the Azuma-Hoeffding inequality to Y_t with the constants 2c_t, which gives Pr[Y_t-Y₀ ≤ -x] ≤ $\exp\left(\frac{-x^2}{8 \sum_{i=1}^{t} c_i^2}\right)$ . But since X₀ = Y₀ and X_t ≥ Y_t, this gives the same bound for Pr[X_t - X₀ ≤ -x].

A similar bound on Pr[X_t - X₀ ≥ x] holds for supermartingales. Note that in each case only one side of the usual Azuma-Hoeffding bound holds. We can't say much about how fast a submartingale with bounded increments rises (or a supermartingale falls), because it could be that Z_t accounts for nearly all of X_t.

6.3. Sub- and super-martingales from functions of martingales

Suppose (X_t,ℱ_t) is a martingale, and let f be a convex function. Then Jensen's inequality says that E[f(X_t+1)|ℱ_t+] ≥ f(E[X_t+1|ℱ_t]) = f(X_t). In other words, when f is convex, (f(X_t),ℱ_t) is a submartingale. Where convex functions turn martingales into submartingales, concave functions turn martingales into supermartingales. This can be very handy if you want to establish that some sequence is a sub/super-martingale quickly.

Example: Let f(x) = |x|. Then f is convex, so we have that if (X_t,ℱ_t) is a martingale, (|X_t|,ℱ_t) is a submartingale. This means, for example, that if we start a ±1 random walk at some random location X₀, then without knowing anything about the distribution of X₀ we can nonetheless conclude that for any fixed t, E[|X_t|] ≥ E[|X₀|].

This fact is not as useful as one might think for getting bounds: given a martingale, it is almost always better to work with the martingale directly and then apply Jensen's inequality or f afterwards to get the desired bound.

6.4. Supermartingales and recurrences

When we solve a probabilistic recurrence, we get an upper bound on the work that remains in each state. If we have such a bound, we can get a supermartingale that may allow us to prove concentration bounds on the cost of our algorithm. Let C_t be the cost we've paid so far, and Y_t be an upper bound on the expected cost remaining, with ℱ_t the usual σ-algebra describing everything that has happened up to time t. Then we have Y_t ≥ E[C_∞-C_t|ℱ_t].

We can reverse this by finding a non-negative Y_t such that E[Y_t-Y_t+1|ℱ_t] ≥ [C_t+1-C_t|ℱ_t], so that the change in Y_t pays for the cost of each step. If this holds, then for X_t = Y_t + C_t we have E[X_t+1-X_t|ℱ_t] = E[Y_t+1-Y_t+C_t+1-C_t|ℱ_t] ≤ 0, and (X_t,ℱ_t) is a supermartingale. Now let T be the time at which the algorithm terminates, and suppose that one of the conditions for the Optional Stopping Theorem holds for T and (X_t,ℱ_t) (typical cases are that there is a fixed upper bound on the total time for the algorithm or that there is a fixed upper bound on the cost; if neither holds, we may have to do a little bit of work). Then E[C_T] ≤ E[X_T] ≤ E[X₀] = E[Y₀] by the supermartingale property. But since we now have a supermartingale instead of just a recurrence, we may be able to get stronger bounds by using Azuma-Hoeffding.

6.5. Example: QuickSort

For example, let's take QuickSort. We previously calculated (see RandomizedAlgorithms) that 2 n ln n is a bound on the number of comparisons needed to sort an array of n elements. Let's try to turn this into a supermartingale. Recall that QuickSort takes an unsorted array and recursively partitions it into smaller unsorted arrays, terminating when each array has only one element. Let's imagine that we do these partitions in parallel; i.e., at each time step we'll partition every block we've got left with more than one element. Since each such step reduces the size of the largest unsorted block, we finish after at most n such steps.

We expect each block of size n_i to take no more than 2n_i ln n_i comparisons to complete. So we will use the sum ∑ 2 n_i ln n_i over all blocks remaining at time t as our estimate Y_t of the remaining time. If we can show that the change in Y_t pays for the number of comparisons at step t on average, then the supermartingale property will hold for C_t+Y_t and we'll get E[C_∞] ≤ Y₀ = 2 n ln n.

Let's consider the partition of a single block of size n_i. The cost of the partition is n_i-1 comparisons. The expected change in Y_t contributed by this particular block is E[2 n_i ln n - (2 k ln k + 2 (n_i-k-1) ln (n,,i-k-1))], where k is uniform in the range 0..n-1 (we assume that all elements are distinct so that the pivot ends up in a block by itself). The subtracted terms are equal to

$\begin{align*} 2 \cdot \frac{1}{n} \sum_{k=0}^{n-1} 2 k \ln k &\le \frac{4}{n} \int_{2}^{n} n \ln n \\ &= \frac{4}{n} \left(\frac{n^2 \ln n}{2} - \frac{n^2}{4} - \ln 2 + 1\right) \\ &= 2 n \ln n - n - \ln 2 + 1 &\le 2 n \ln n - n. \end{align*}$

The -n pays for the n-1 cost on average. We thus get that {C_t + ∑ 2 n_i ln n_i} is a supermartingale, and that the expected number of comparisons to sort n elements is at most 2 n ln n.

We already knew this. Can we learn anything else?

6.5.1. A mediocre high-probability bound using Azuma-Hoeffding

Let's try to get a bound on how much X_t = C_t+Y_t changes at each step. We can compute C_t+1-C_t = ∑ (n_i-1) ≤ n, and Y_t+1-Y_t ≤ 0, so X_t+1-X_t ≤ n. In the other direction, when we split a block of size n_i into two blocks of size k and n_i-k-1, the smallest value we can get for 2 k ln k + 2 (n_i-k-1) ln (n_i-k-1) is when the split is exactly even (to prove this, use convexity of the potential function). In this case we get a drop from 2 n_i ln n_i to 4 ((n_i-1)/2) ln ((n_i-1)/2) = 2 (n_i - 1) ln ((n_i-1)/2) = 2 (n_i - 1) ln (n_i(1-1/2n_i)) = 2(n_i - 1) (ln n_i + ln (1-1/2n_i)) ≥ 2(n_i - 1) (ln n_i - ln 2) = 2 n_i ln n_i - ln n_i - 2 n_i ln 2 + ln 2, giving a drop of at most ln n_i + 2 n_i ln 2 + ln 2 = ln (2n_i) + 2n_i. It is not hard to see that this expression is maximized when all n_i = 1 (which is a sign that our bound is not going to be tight), which gives a total over all i that is at most n(2 + ln 2) < 3n.

Now plug this into Azuma-Hoeffding (the one-sided version) to get Pr[C_n - 2 n ln n ≥ x] ≤ exp(-x²/8n⋅(3n)²), which starts getting small for x ≥ √(72n³) or approximately x ≥ 9n^3/2. This is a pretty weak bound, although it's much better than we get with just Markov's inequality.

(There are much better ways to do this.)

CategoryRandomizedAlgorithmsNotes