[FrontPage] [TitleIndex] [WordIndex]

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

# 1. Stochastic processes

A stochastic process is a sequence of random variables X0, X1, ..., typically indexed either by ℕ (a discrete-time stochastic process) or ℝ (a continuous-time stochastic process; sometimes ℝ+ if we don't consider times less than 0). Interpretation: A random process that evolves over time. Common examples are martingales (described below), and Markov processes, where the distribution of Xi+1 depends only on Xi and not on any previous states.

# 2. Sigma-algebras

A sigma-algebra (or σ-algebra) is a family of sets that includes a maximal universe Ω and the empty set ∅, is closed under complement with respect to Ω and under countable unions. The elements of this family are called measurable sets. Any probability space includes a σ-algebra that describes what events are assigned probabilities in the space. For discrete spaces this is often all subsets of Ω. For continuous spaces a restricted σ-algebra is needed to avoid paradoxes; the usual choice for e.g. [0,1] are the Lebesgue-measurable sets. For our purposes the useful property of σ-algebras is that they can be used to represent restricted information.

For example, suppose I roll two six-sided dice but only tell you what the sum of the two dice is. What I see is a probability space with 36 elements (all outcomes of the two dice). What you see is a random variable with only 11 values (the possible sums 2..12). You can detect whether some events occured ("is the sum even?") but not others ("is the first die 3?"). The collection of events that you can decide includes both ∅ and Ω, and is closed under complement and union, so it's a σ-algebra. The structure of this σ-algebra is that it includes a set A if and only if whenever A contains some point ω = (x,y), it contains all other points ω'=(x',y') where x+y = x'+y'.

We say that a random variable X is measurable with respect to a σ-algebra ℱ (or just measurable ℱ for short) if the event [X≤x] is measurable (i.e., included in the σ-algebra) for all x. For discrete random variables this just means that any event of the form X∈S is measurable. One way to get a σ-algebra is to take a random variable consider the smallest σ-algebra with respect to which X is measurable; this is called the σ-algebra generated by X and is written <X>. In general, we can also consider σ-algebras generated by any collection of events <A,B,C> or random variables <X1,X2,X3>; in each case we are looking at the smallest σ-algebra with respect to which all of these events or random variables are measurable.

In the two-dice example, the σ-algebra we defined is just <X1+X2>. If I tell you the value of the first as well, we get <X1+X2,X1>, which happens to include every event in the underlying probability space. We could also have σ-algebras <X1>, <X2>, <X1,[X1<X2]>, that correspond to different information.

For discrete probability spaces, we can think of a σ-algebra as consisting of all unions of a set of elementary events or atoms, which are events with the property that no proper subset except ∅ is also contained in the σ-algebra. In this case, a random variable X is measurable ℱ if and only if it is constant on each atom of ℱ. The atoms of ℱ form a partition of Ω; each outcome ω∈Ω lies in exactly one atom. As a special case, the σ-algebra <X> for discrete X has as its elementary events all nonempty sets of the form X-1(x).

## 2.1. Filtrations

With a stochastic process, it is natural to talk not only about the value of the process at time t (which is just the random variable Xt), but also what we know at time t. The usual assumption is that we never forget anything; so at minimum we can decide any event in the σ-algebra <X1...Xt>. If we have additional information (perhaps there is some underlying process used to generate the Xi, and we can observe details of this process that are not necessarily included in the value of the Xi), then we have a sequence of nested σ-algebras ℱ0⊆ℱ1⊆ℱ2⊆... . Any such sequence is called a filtration; a stochastic process {Xt} is adapted to a filtration {ℱt} if Xt is measurable ℱt for all t.

## 2.2. Conditional expectations

We've previously seen the definition of E[X|A] = ∑ x Pr[X=x|A], the expectation of X conditioned on an event A. We can also defined the expectation E[X|Y] of X conditioned on a random variable Y, or in even more generality the expectation E[X|ℱ] of X conditioned on a σ-algebra ℱ. In the first case, we define E[X|Y] = E[X|Y=y] when Y=y; unlike ordinary conditional expectation, which is always a constant, E[X|Y] is a random variable that may take on a different value for each value of y (but is constant if we fix the value of Y). In a discrete probability space, E[X|ℱ] is defined by considering the atoms A of ℱ and defining E[X|ℱ] = E[X|A] whenever we land in A. It is not hard to see that this is equivalent to E[X|Y] if Y generates ℱ.

For continuous spaces, E[X|ℱ] is a little trickier to define, and may not even exist if X is unbounded. One way to define it is to observe that the set of all real-valued random variables defined on a given probability space form a vector space (given any random variables X and Y, aX+bY is also a random variable), and that the subset of random variables that are measurable ℱ is a subspace of this vector space. So we can define E[X|ℱ] as the closest point to X in the subspace of random variables that are measurable ℱ. In practice, we are never going to actually do this.

Conditional expectations satisfy all the usual properties of expectations, so E[aX+bY|ℱ] = aE[X|ℱ] +bE[Y|ℱ], X≥Y ⇒ E[X|ℱ] ≥ E[Y|ℱ], etc. One very useful fact that only comes up because of the conditioning is that if ℱ1⊆ℱ2, then E[X|ℱ1] = E[E[X|ℱ2]|ℱ1]; in other words, when computing an expectation it doesn't matter if we average all at once or in stages, as long as we end up averaging across the same elementary events. A complement to this fact is that if X is measurable ℱ, it acts like a constant: E[X|ℱ] = X and E[XY|ℱ] = X E[Y|ℱ]. (We won't prove these facts here, see e.g. GrimmetStirzaker.)

# 3. Martingales

So now let's imagine we have a stochastic process {Xt} (what we have at time t) adapted to a filtration {ℱt} (what we know at time t). We can think of this process as describing the state of our finances after we've been playing in a casino for a while. If the casino is perfectly fair (unlike what happens in real life), then each bet we place should have an expected return of 0. In other words, if we want to estimate based on what we know at time t how much money we will have at time t+1, our best guess would be Xt.

Formally, a stochastic process as above is a martingale if E[Xt+1|ℱt] = Xt. Often we replace ℱt with the σ-algebra generated by X0...Xt and write this as E[Xt+1|X0...Xt] = Xt. The useful property of martingales is that we can verify the martingale property locally, by proving either that E[Xt+1|ℱt] = Xt or equivalently that E[Xt+1 - Xt|ℱt] = E[Xt+1|ℱt] - Xt = 0. But this local property has strong consequences that apply across long intervals of time, as we will see below.

## 3.1. Examples of martingales

• Let Xt+1 = Xt ± bt where +bt and -bt occur with equal probability bt is measurable ℱt, and the outcome ±bt is measurable ℱt+1 (in other words, my "bet" bt can only depend on what has happened so far and not on the future, but my knowledge ℱt includes the outcome of all past bets). Then {Xt|ℱt} is a martingale. In general, if Yt+1-Yt = bt(Xt+1-Xt) where (Xt,ℱt) is a martingale and bt is measurable ℱt, then Yt is also a martingale with respect ℱt. Proof: E[Yt+1-Yt|ℱt] = E[bt(Xt+1-Xt)|ℱt] = bt E[Xt+1-Xt|ℱt] = bt⋅0 = 0.

• Special case: A random ±1 walk is a martingale.
• Another random walk: Let Xt+1 = Xt±1 with equal probability and let Yt = Xt2 - t. Then E[Yt+1|Yt] = ½(Xt+1)2+½(Xt-1)2 - (t+1) = Xt2 + 1 - (t+1) = Xt2 - t = Yt. So {Yt} is a martingale.

• Doob's martingale: Let Z be any random variable and {ℱt} any filtration. Let Xt= E[Z|ℱt] (suppose that all such expectations exist). Then E[Xt+1|ℱt] = E[E[Z|ℱt+1]|ℱt] = E[Z|ℱt] = Xt; (Xt,ℱt) is a martingale.

## 3.2. Properties of martingales

The defining property of a martingale is that E[Xt+1|ℱt] = Xt. What if we want to compute E[Xt] without conditioning, or E[Xt+k|ℱt], where k > 1?

We can easily show by induction on k that, for k≥0, E[Xt+k|ℱt] = Xt. The base is that when k = 0, E[Xt+k|ℱt] = E[Xt|ℱt] = Xt since {Xt} is adapted to {ℱt} and thus Xt is measurable ℱt. For larger k, compute E[Xt+k|ℱt] = E[E[Xt+k|ℱt+k-1]|ℱt] = E[Xt+k-1|ℱt] = Xt.

What about E[Xt], with no conditioning? Here we observe E[Xt] = E[E[Xt|ℱ0]] = E[X0]. In other words, martingales never go anywhere, at least in expectation.

We can apply this to the martingales we've seen so far:

• For an arbitrary betting strategy on a fair game, I neither make nor lose money on average. So if I have a strategy that makes me $1 by time t with probability p, then I must lose$x with probability (1-p), where E[Xt] = p - (1-p)x = 0 implies x = p/(1-p). This means that I can (by doubling my bet each time I lose, for example) make $1 with probability 1-10-6 if I am willing to accept a probability of 10-6 of losing 10-6-1 dollars. This is probably a bad strategy for ordinary people but it works out well for insurance companies, who charge a small additional premium to make a profit above the zero expected return. • A random walk that runs for t steps starting at position 0 ends with E[Xt2-t] = 0 or E[Xt2] = t. # 4. Martingales as sums of uncorrelated random variables Let {Xt,ℱt} be a martingale with X0=0. Define, for t>0, Δt, = Xt-Xt-1. Then we can write Xt as , where E[Δi1...Δi-1] = E[Xi-Xi-1|ℱi] = 0. In other words, Δi is uncorrelated with Δ1...Δi-1. This is not quite as good a condition as independence, but is good enough that martingales often act very much like sums of independent random variables. ## 4.1. Variance of Xt For example, suppose we want to compute the Var[Xt] = E[(Xt-EXt)2] = E[Xt2] (recall that we are assuming X0 = 0, so EXt = 0 by the martingale property). We have Var[Xt] = Var[∑i≤t Δi] = ∑i,j≤t Cov(Δij) = ∑i≤t Var[Δi] + 2 ∑i<j≤t Cov(Δij). Now Var[Δi] = E[Δi^2^], since EΔi = 0, and Cov(Δij) = E[ΔiΔj] - E[Δi]E[Δj] = E[E[Δjii] - 0⋅0 = 0 - 0 = 0, since E[Δji] = E[E[Δj|ℱj-1]|Δi] = 0. It follows that Var[Xt] is exactly ∑i≤t E[Δi2]. This means that if we have a bound |Δi| ≤ ci for all i, we can compute Var[Xt] ≤ ∑i≤t ci2, and apply Chebyshev's inequality to the result to get a bound on the tails of Xt. But in fact we can do much better by using moment generating functions; the result is an inequality known as the Azuma-Hoeffding inequality (named for its independent co-discoverers), which is the martingale version of Chernoff bounds. ## 4.2. The Azuma-Hoeffding inequality Same setup as above: X0 = 0, Δt = Xt - Xt-1 with |Δt|≤ct for all t. Then Variants: 1. If we don't assume X0 = 0, this becomes a bound on Xt-X0. 2. Since {-Xt,ℱt} is also a martingale, the same bound applies, which gives the symmetric bound Pr[Xt-X0 ≤ -x] ≤ exp(-x2/2∑ci2). 3. Applying the union bound gives Pr[|Xt-X0| ≥ x] ≤ 2 exp(-x2/2∑ci2). The proof is by the usual trick of applying Markov's inequality to E[exp(α∑Xt)] for appropriate choice of α, with some tricks in bounding E[exp(α∑Xt)]. See Azuma-Hoeffding inequality. ### 4.2.1. Application: the method of bounded differences Basic idea: Azuma-Hoeffding applies to any process that we can model as revealing the inputs x1...xn to some function f(x1...xn) one at a time, provided changing any one input changes f by at most c (the Lipschitz condition). The reason is that when the Lipschitz condition holds, then E[f(x1...xn) | x1...xt] is a martingale that satisfies the requirements of the inequality. This allows us to show, for example, that the number of colors needed to color a random graph is tightly concentrated, by considering a process where xi reveals all the edges between vertex i and smaller vertices (a vertex exposure martingale). # 5. Stopping times Let {ℱt} be a filtration. A random variable T is a stopping time for {ℱt} if T≥0 and the event [T≤t] is measurable ℱt for all t. In simple terms, T is a stopping time if you know at time t whether you've stopped or not. Stopping times are very handy when used with martingales, because of the Optional Stopping Theorem If (Xt,ℱt) is a martingale and T is a stopping time for {ℱt}, then E[XT] = E[X0] if 1. Pr[T<∞] = 1, 2. E[|XT|] < ∞, and 3. limt→∞ E[ Xt⋅[T > t] ] = 0. The first condition says that T is finite with probability 1 (i.e., eventually we do stop). The second condition puts a bound on how big |XT| can get, which excludes some bad outcomes where we accept a small probability of a huge loss in order to get a large probability of a small gain. The third says that the contribution of large values of t to E[XT] goes to zero as we consider larger and larger t; the term [T > t] is the indicator variable for the event that T is larger than t. It would be nice if we could show E[XT] = E[X0] without the side conditions, but in general this isn't true. For example, the double-or-win martingale strategy eventually returns 1 with probability 1, so if T is the time we stop playing, we have Pr[T<∞] = 1, E[|XT|] < ∞, but E[XT] = 1 ≠ E[X0] = 0. Here we have E[Xt⋅[T>t]] = -2t-1(1/2)t-1 = -1 for t>0, so it's the third condition that's violated. (The same thing happens for a simple ±1 random walk that stops at +1, but it's a bit harder to calculate E[Xt⋅[T>t]] in this case.) To prove the optional stopping theorem, it helps to start with a simpler version: Optional Stopping Theorem (baby version) If (Xt,ℱt) is a martingale and T is a stopping time for {ℱt}, then for any n∈ℕ, E[Xmin(T,n)] = E[X0]. Proof Define Yt = <<X0 + latex($\sum_{i=1}^{t} (Xt-Xt-1) [T\\le t-1]\$)>>. Then (Yt,ℱt) is a martingale since we can treat [T≤t-1] as a sequence of bets. But then E[Xmin(T,n)] = E[Yn] = E[Y0] = E[X0].

So now we'll prove the full version by considering E[Xmin(T,n)] and showing that, under the conditions of the theorem, it approaches E[XT] as n goes to infinity. First observe that, for any n, XT = Xmin(T,n) + [T>n](XT-Xn), because either T≤n, and we just get XT, or T>n, and we get Xn+(XT-Xn) = XT. Now take expectations: E[XT] = E[Xmin(T,n)] + E[[T>n]XT] - E[[T>n]Xn]. Condition (3) on the theorem gives limn→∞ E[[T>n]Xn]→0. If we can show that the middle term also vanishes in the limit, we are done.

Here we use condition (2). Observe that E[[T>n] XT] = . Compare this with E[XT] = ; this is a convergent series, so in the limit the sum of the terms for i=0..n converges to E[XT]. But this means that the sum of the remaining terms for i=n+1..∞ converges to zero. So the middle term goes to zero as n goes to infinity.

We thus have E[XT] = limn→∞ E[Xmin(T,n)] + E[[T>n]XT] - E[[T>n]Xn] = limn→∞ E[Xmin(T,n)] = E[X0,,]. This completes the proof.

Using the full-blown optional stopping theorem is a pain in the neck, because conditions (2) and (3) are often hard to test directly. So in practice one generally chooses one of several weaker but easier variants:

Optional Stopping Theorem (bounded time)

If (Xt,ℱt) is a martingale and T is a stopping time for {ℱt} with T≤n always for some fixed n, then E[XT] = E[X0]. Proof: XT = Xmin(T,n).

Optional Stopping Theorem (bounded range)

If (Xt,ℱt) is a martingale and T is a stopping time for {ℱt} with Pr[T<∞] = 0 and |Xt| ≤ M always for some fixed M and all t, then E[XT] = E[X0]. Proof: Use the full OST, with (1) given, (2) implied by |Xt| ≤ M, and (3) follows from |E[Xt[T>t]| ≤ E[|Xt|[T>t]] ≤ M Pr[T>t] → 0 since Pr[T<∞] = 1.

Optional Stopping Theorem (bounded increments)

If (Xt,ℱt) is a martingale and T is a stopping time for {ℱt} where Pr[T<∞] = 1, ET≤∞, and |Xt-Xt-1| ≤ c always for all t, then E[XT] = E[X0]. Proof: Here we go back into the original proof of the OST, but there is a simpler argument that E[(XT-Xn)[T > n]] converges to 0. The idea is that |E[(XT-Xn)[T>n]]| = |E[∑t≥n (Xt+1-Xt) [T>t]]| ≤ E[∑t≥n |(Xt+1-Xt)| [T>t]] ≤ E[∑t≥n c[T>t] ]. Now use the fact that E[T] = ∑t [T>t] to show that this is again the tail of a convergent sequence, and thus converges to zero.

## 5.1. Applications

• Let Xt be a ±1 random walk that starts at 0 and stops if it reaches +a or -b. Let T be the time at which the process stops. We can easily show that Pr[T<∞] = 1 and ET < ∞ by observing that from any state of the random walk, there is a probability of at least 2-(a+b) that it stops within a+b steps (by flipping heads a+b times in a row), so that if we consider a sequence of intervals if length a+b, the expected number of such intervals we can have before we stop is at most 2a+b. We have bounded increments by the definition of the process (bounded range also works). So E[XT] = E[X0] = 0 and the probability p of landing on a instead of -b must satisfy pa - (1-p)b = 0, giving p = b/(a+b).

• Let Xt and T be as above and let Yt = Xt2-t, which we previously showed to be a martingale. Again we have bounded increments (but not bounded range!), since Xt2 ≤ max(a,b)2 for all t and t only changes by 1 at each step. So E[YT] = 0, which gives E[T] = E[XT2]. But we can calculate E[XT2], since it is a2 Pr[XT = a] + b2 Pr[Xt = -b] = a2(b/(a+b)) + b2(a/(a+b)) = (a2b+b2a)/(a+b) = ab.

# 6. Submartingales and supermartingales

Sometimes we can't guarantee E[Xt+1|ℱt] = Xt, but we can nonetheless use Xt as a lower or upper bound on E[Xt+1|ℱt]. An adapted process (Xt,ℱt) is a submartingale if Xt ≤ E[Xt+1|ℱt] for all t and a supermartingale if Xt ≥ E[Xt+1|ℱt]. Note that the quantity that is "sub" (below) or "super" (above) is always where we are now: so submartingales tend to go up over time while supermartingales tend to go down. If a process is both a submartingale and a supermartingale, it's a martingale.

Sub/super-martingales are handy when we can't set up an exact martingale, but we don't mind because we only care about one-sided bounds anyway. Most of the properties we have seen for martingales hold for sub/super-martingales if we replace equality with ≤ or ≥: for example, in a submartingale, E[X0] ≤ E[Xt] for all t.

## 6.1. The Doob-Meyer decomposition

The easiest way to get these results in general is to write a submartingale (Xt,ℱt) as Xt = Yt + Zt, where Yt is a martingale and Zt is a non-decreasing predictable process with respect to ℱt; this means that Zt+1 is always measurable ℱt. The existence of this decomposition (known as the Doob decomposition or sometimes the Doob-Meyer decomposition among people who get tired of naming everything after Doob) is easily shown: define Z0 = 0, Zt+1 = Zt + E[Xt+1-Xt|ℱt] and Yt = Xt - Zt. Then Xt is trivially equal to Yt + Zt, and it's not hard to see that Zt is both non-decreasing and predictable (the non-decreasing part is just the submartingale property). To show that Yt is a martingale, compute E[Yt+1|ℱt] = E[Xt+1-Zt+1|ℱt] = E[Xt+1-E[Xt+1-Xt|ℱt]|ℱt] = E[Xt+1-Xt+1+Xt|ℱt] = Xt.

For supermartingales, the same decomposition works, but now Zt is non-increasing.

## 6.2. Azuma-Hoeffding inequality for sub- and super-martingales

We can use the existence of the Doob decomposition to prove results about a submartingale or supermartingale Xt even if we don't know how to compute it. For example, suppose (Xt,ℱt) is a submartingale with bounded increments, that is, that |Xt-Xt-1| ≤ ct for fixed constants ct and all t. Decompose Xt as Yt+Zt as above. From |Xt-Xt-1| ≤ ct we have |(Yt+Zt)-(Yt-1+Zt-1)| ≤ ct or |(Yt-Yt-1)+(Zt-Zt-1)| ≤ ct. We want to get a bound on |Yt-Yt-1|; first we get a bound on Zt-Zt-1 by observing that E[Yt-Yt-1] = 0 implies that Yt-Yt-1≥0 with nonzero probability. Suppose that this event occurs; then everything inside the absolute value is non-negative, and we have (Yt-Yt-1)+(Zt-Zt-1) ≤ ct which gives 0 ≤ (Zt-Zt-1) ≤ ct (if we also take into account the fact that Zt is non-decreasing). But then (Yt-Yt-1) must lie between -2ct and ct, or we violate the constraints on Xt. So we can apply the Azuma-Hoeffding inequality to Yt with the constants 2ct, which gives Pr[Yt-Y0 ≤ -x] ≤ . But since X0 = Y0 and Xt ≥ Yt, this gives the same bound for Pr[Xt - X0 ≤ -x].

A similar bound on Pr[Xt - X0 ≥ x] holds for supermartingales. Note that in each case only one side of the usual Azuma-Hoeffding bound holds. We can't say much about how fast a submartingale with bounded increments rises (or a supermartingale falls), because it could be that Zt accounts for nearly all of Xt.

## 6.3. Sub- and super-martingales from functions of martingales

Suppose (Xt,ℱt) is a martingale, and let f be a convex function. Then Jensen's inequality says that E[f(Xt+1)|ℱt+] ≥ f(E[Xt+1|ℱt]) = f(Xt). In other words, when f is convex, (f(Xt),ℱt) is a submartingale. Where convex functions turn martingales into submartingales, concave functions turn martingales into supermartingales. This can be very handy if you want to establish that some sequence is a sub/super-martingale quickly.

Example

Let f(x) = |x|. Then f is convex, so we have that if (Xt,ℱt) is a martingale, (|Xt|,ℱt) is a submartingale. This means, for example, that if we start a ±1 random walk at some random location X0, then without knowing anything about the distribution of X0 we can nonetheless conclude that for any fixed t, E[|Xt|] ≥ E[|X0|].

This fact is not as useful as one might think for getting bounds: given a martingale, it is almost always better to work with the martingale directly and then apply Jensen's inequality or f afterwards to get the desired bound.

## 6.4. Supermartingales and recurrences

When we solve a probabilistic recurrence, we get an upper bound on the work that remains in each state. If we have such a bound, we can get a supermartingale that may allow us to prove concentration bounds on the cost of our algorithm. Let Ct be the cost we've paid so far, and Yt be an upper bound on the expected cost remaining, with ℱt the usual σ-algebra describing everything that has happened up to time t. Then we have Yt ≥ E[C-Ct|ℱt].

We can reverse this by finding a non-negative Yt such that E[Yt-Yt+1|ℱt] ≥ [Ct+1-Ct|ℱt], so that the change in Yt pays for the cost of each step. If this holds, then for Xt = Yt + Ct we have E[Xt+1-Xt|ℱt] = E[Yt+1-Yt+Ct+1-Ct|ℱt] ≤ 0, and (Xt,ℱt) is a supermartingale. Now let T be the time at which the algorithm terminates, and suppose that one of the conditions for the Optional Stopping Theorem holds for T and (Xt,ℱt) (typical cases are that there is a fixed upper bound on the total time for the algorithm or that there is a fixed upper bound on the cost; if neither holds, we may have to do a little bit of work). Then E[CT] ≤ E[XT] ≤ E[X0] = E[Y0] by the supermartingale property. But since we now have a supermartingale instead of just a recurrence, we may be able to get stronger bounds by using Azuma-Hoeffding.

## 6.5. Example: QuickSort

For example, let's take QuickSort. We previously calculated (see RandomizedAlgorithms) that 2 n ln n is a bound on the number of comparisons needed to sort an array of n elements. Let's try to turn this into a supermartingale. Recall that QuickSort takes an unsorted array and recursively partitions it into smaller unsorted arrays, terminating when each array has only one element. Let's imagine that we do these partitions in parallel; i.e., at each time step we'll partition every block we've got left with more than one element. Since each such step reduces the size of the largest unsorted block, we finish after at most n such steps.

We expect each block of size ni to take no more than 2ni ln ni comparisons to complete. So we will use the sum ∑ 2 ni ln ni over all blocks remaining at time t as our estimate Yt of the remaining time. If we can show that the change in Yt pays for the number of comparisons at step t on average, then the supermartingale property will hold for Ct+Yt and we'll get E[C] ≤ Y0 = 2 n ln n.

Let's consider the partition of a single block of size ni. The cost of the partition is ni-1 comparisons. The expected change in Yt contributed by this particular block is E[2 ni ln n - (2 k ln k + 2 (ni-k-1) ln (n,,i-k-1))], where k is uniform in the range 0..n-1 (we assume that all elements are distinct so that the pivot ends up in a block by itself). The subtracted terms are equal to

The -n pays for the n-1 cost on average. We thus get that {Ct + ∑ 2 ni ln ni} is a supermartingale, and that the expected number of comparisons to sort n elements is at most 2 n ln n.

We already knew this. Can we learn anything else?

### 6.5.1. A mediocre high-probability bound using Azuma-Hoeffding

Let's try to get a bound on how much Xt = Ct+Yt changes at each step. We can compute Ct+1-Ct = ∑ (ni-1) ≤ n, and Yt+1-Yt ≤ 0, so Xt+1-Xt ≤ n. In the other direction, when we split a block of size ni into two blocks of size k and ni-k-1, the smallest value we can get for 2 k ln k + 2 (ni-k-1) ln (ni-k-1) is when the split is exactly even (to prove this, use convexity of the potential function). In this case we get a drop from 2 ni ln ni to 4 ((ni-1)/2) ln ((ni-1)/2) = 2 (ni - 1) ln ((ni-1)/2) = 2 (ni - 1) ln (ni(1-1/2ni)) = 2(ni - 1) (ln ni + ln (1-1/2ni)) ≥ 2(ni - 1) (ln ni - ln 2) = 2 ni ln ni - ln ni - 2 ni ln 2 + ln 2, giving a drop of at most ln ni + 2 ni ln 2 + ln 2 = ln (2ni) + 2ni. It is not hard to see that this expression is maximized when all ni = 1 (which is a sign that our bound is not going to be tight), which gives a total over all i that is at most n(2 + ln 2) < 3n.

Now plug this into Azuma-Hoeffding (the one-sided version) to get Pr[Cn - 2 n ln n ≥ x] ≤ exp(-x2/8n⋅(3n)2), which starts getting small for x ≥ √(72n3) or approximately x ≥ 9n3/2. This is a pretty weak bound, although it's much better than we get with just Markov's inequality.

(There are much better ways to do this.)

2014-06-17 11:58