We start with some definitions:

# 1. Stochastic processes

A **stochastic process** is a sequence of random variables X_{0}, X_{1}, ..., typically indexed either by ℕ (a **discrete-time** stochastic process) or ℝ (a **continuous-time** stochastic process; sometimes ℝ^{+} if we don't consider times less than 0). Interpretation: A random process that evolves over time. Common examples are **martingales** (described below), and **Markov processes**, where the distribution of X_{i+1} depends only on X_{i} and not on any previous states.

# 2. Sigma-algebras

A **sigma-algebra** (or **σ-algebra**) is a family of sets that includes a maximal universe Ω and the empty set ∅, is closed under complement with respect to Ω and under countable unions. The elements of this family are called **measurable sets**. Any probability space includes a σ-algebra that describes what events are assigned probabilities in the space. For discrete spaces this is often all subsets of Ω. For continuous spaces a restricted σ-algebra is needed to avoid paradoxes; the usual choice for e.g. [0,1] are the Lebesgue-measurable sets. For our purposes the useful property of σ-algebras is that they can be used to represent restricted information.

For example, suppose I roll two six-sided dice but only tell you what the sum of the two dice is. What I see is a probability space with 36 elements (all outcomes of the two dice). What you see is a random variable with only 11 values (the possible sums 2..12). You can detect whether some events occured ("is the sum even?") but not others ("is the first die 3?"). The collection of events that you can decide includes both ∅ and Ω, and is closed under complement and union, so it's a σ-algebra. The structure of this σ-algebra is that it includes a set A if and only if whenever A contains some point ω = (x,y), it contains all other points ω'=(x',y') where x+y = x'+y'.

We say that a random variable X is **measurable** with respect to a σ-algebra ℱ (or just **measurable ℱ** for short) if the event [X≤x] is measurable (i.e., included in the σ-algebra) for all x. For discrete random variables this just means that any event of the form X∈S is measurable. One way to get a σ-algebra is to take a random variable consider the smallest σ-algebra with respect to which X is measurable; this is called the σ-algebra **generated** by X and is written <X>. In general, we can also consider σ-algebras generated by any collection of events <A,B,C> or random variables <X_{1},X_{2},X_{3}>; in each case we are looking at the smallest σ-algebra with respect to which all of these events or random variables are measurable.

In the two-dice example, the σ-algebra we defined is just <X_{1}+X_{2}>. If I tell you the value of the first as well, we get <X_{1}+X_{2},X_{1}>, which happens to include every event in the underlying probability space. We could also have σ-algebras <X_{1}>, <X_{2}>, <X_{1},[X_{1}<X_{2}]>, that correspond to different information.

For discrete probability spaces, we can think of a σ-algebra as consisting of all unions of a set of **elementary events** or **atoms**, which are events with the property that no proper subset except ∅ is also contained in the σ-algebra. In this case, a random variable X is measurable ℱ if and only if it is constant on each atom of ℱ. The atoms of ℱ form a partition of Ω; each outcome ω∈Ω lies in exactly one atom. As a special case, the σ-algebra <X> for discrete X has as its elementary events all nonempty sets of the form X^{-1}(x).

## 2.1. Filtrations

With a stochastic process, it is natural to talk not only about the value of the process at time t (which is just the random variable X_{t}), but also what we know at time t. The usual assumption is that we never forget anything; so at minimum we can decide any event in the σ-algebra <X_{1}...X_{t}>. If we have additional information (perhaps there is some underlying process used to generate the X_{i}, and we can observe details of this process that are not necessarily included in the value of the X_{i}), then we have a sequence of nested σ-algebras ℱ_{0}⊆ℱ_{1}⊆ℱ_{2}⊆... . Any such sequence is called a **filtration**; a stochastic process {X_{t}} is **adapted** to a filtration {ℱ_{t}} if X_{t} is measurable ℱ_{t} for all t.

## 2.2. Conditional expectations

We've previously seen the definition of E[X|A] = ∑ x Pr[X=x|A], the expectation of X conditioned on an event A. We can also defined the expectation E[X|Y] of X conditioned on a random variable Y, or in even more generality the expectation E[X|ℱ] of X conditioned on a σ-algebra ℱ. In the first case, we define E[X|Y] = E[X|Y=y] when Y=y; unlike ordinary conditional expectation, which is always a constant, E[X|Y] is a random variable that may take on a different value for each value of y (but is constant if we fix the value of Y). In a discrete probability space, E[X|ℱ] is defined by considering the atoms A of ℱ and defining E[X|ℱ] = E[X|A] whenever we land in A. It is not hard to see that this is equivalent to E[X|Y] if Y generates ℱ.

For continuous spaces, E[X|ℱ] is a little trickier to define, and may not even exist if X is unbounded. One way to define it is to observe that the set of all real-valued random variables defined on a given probability space form a vector space (given any random variables X and Y, aX+bY is also a random variable), and that the subset of random variables that are measurable ℱ is a subspace of this vector space. So we can define E[X|ℱ] as the closest point to X in the subspace of random variables that are measurable ℱ. In practice, we are never going to actually do this.

Conditional expectations satisfy all the usual properties of expectations, so E[aX+bY|ℱ] = aE[X|ℱ] +bE[Y|ℱ], X≥Y ⇒ E[X|ℱ] ≥ E[Y|ℱ], etc. One very useful fact that only comes up because of the conditioning is that if ℱ_{1}⊆ℱ_{2}, then E[X|ℱ_{1}] = E[E[X|ℱ_{2}]|ℱ_{1}]; in other words, when computing an expectation it doesn't matter if we average all at once or in stages, as long as we end up averaging across the same elementary events. A complement to this fact is that if X is measurable ℱ, it acts like a constant: E[X|ℱ] = X and E[XY|ℱ] = X E[Y|ℱ]. (We won't prove these facts here, see e.g. GrimmetStirzaker.)

# 3. Martingales

So now let's imagine we have a stochastic process {X_{t}} (what we have at time t) adapted to a filtration {ℱ_{t}} (what we know at time t). We can think of this process as describing the state of our finances after we've been playing in a casino for a while. If the casino is perfectly fair (unlike what happens in real life), then each bet we place should have an expected return of 0. In other words, if we want to estimate based on what we know at time t how much money we will have at time t+1, our best guess would be X_{t}.

Formally, a stochastic process as above is a **martingale** if E[X_{t}+1|ℱ_{t}] = X_{t}. Often we replace ℱ_{t} with the σ-algebra generated by X_{0}...X_{t} and write this as E[X_{t+1}|X_{0}...X_{t}] = X_{t}. The useful property of martingales is that we can verify the martingale property locally, by proving either that E[X_{t+1}|ℱ_{t}] = X_{t} or equivalently that E[X_{t+1} - X_{t}|ℱ_{t}] = E[X_{t+1}|ℱ_{t}] - X_{t} = 0. But this local property has strong consequences that apply across long intervals of time, as we will see below.

## 3.1. Examples of martingales

Let X

_{t+1}= X_{t}± b_{t}where +b_{t}and -b_{t}occur with equal probability b_{t}is measurable ℱ_{t}, and the outcome ±b_{t}is measurable ℱ_{t+1}(in other words, my "bet" b_{t}can only depend on what has happened so far and not on the future, but my knowledge ℱ_{t}includes the outcome of all past bets). Then {X_{t}|ℱ_{t}} is a martingale. In general, if Y_{t+1}-Y_{t}= b_{t}(X_{t+1}-X_{t}) where (X_{t},ℱ_{t}) is a martingale and b_{t}is measurable ℱ_{t}, then Y_{t}is also a martingale with respect ℱ_{t}. Proof: E[Y_{t+1}-Y_{t}|ℱ_{t}] = E[b_{t}(X_{t+1}-X_{t})|ℱ_{t}] = b_{t}E[X_{t+1}-X_{t}|ℱ_{t}] = b_{t}⋅0 = 0.- Special case: A random ±1 walk is a martingale.
Another random walk: Let X

_{t+1}= X_{t}±1 with equal probability and let Y_{t}= X_{t}^{2}- t. Then E[Y_{t+1}|Y_{t}] = ½(X_{t}+1)^{2}+½(X_{t}-1)^{2}- (t+1) = X_{t}^{2}+ 1 - (t+1) = X_{t}^{2}- t = Y_{t}. So {Y_{t}} is a martingale.Doob's martingale: Let Z be any random variable and {ℱ

_{t}} any filtration. Let X_{t}= E[Z|ℱ_{t}] (suppose that all such expectations exist). Then E[X_{t+1}|ℱ_{t}] = E[E[Z|ℱ_{t+1}]|ℱ_{t}] = E[Z|ℱ_{t}] = X_{t}; (X_{t},ℱ_{t}) is a martingale.

## 3.2. Properties of martingales

The defining property of a martingale is that E[X_{t+1}|ℱ_{t}] = X_{t}. What if we want to compute E[X_{t}] without conditioning, or E[X_{t+k}|ℱ_{t}], where k > 1?

We can easily show by induction on k that, for k≥0, E[X_{t+k}|ℱ_{t}] = X_{t}. The base is that when k = 0, E[X_{t+k}|ℱ_{t}] = E[X_{t}|ℱ_{t}] = X_{t} since {X_{t}} is adapted to {ℱ_{t}} and thus X_{t} is measurable ℱ_{t}. For larger k, compute E[X_{t+k}|ℱ_{t}] = E[E[X_{t+k}|ℱ_{t+k-1}]|ℱ_{t}] = E[X_{t+k-1}|ℱ_{t}] = X_{t}.

What about E[X_{t}], with no conditioning? Here we observe E[X_{t}] = E[E[X_{t}|ℱ_{0}]] = E[X_{0}]. In other words, martingales never go anywhere, at least in expectation.

We can apply this to the martingales we've seen so far:

For an arbitrary betting strategy on a fair game, I neither make nor lose money on average. So if I have a strategy that makes me $1 by time t with probability p, then I must lose $x with probability (1-p), where E[X

_{t}] = p - (1-p)x = 0 implies x = p/(1-p). This means that I can (by doubling my bet each time I lose, for example) make $1 with probability 1-10^{-6}if I am willing to accept a probability of 10^{-6}of losing 10^{-6}-1 dollars. This is probably a bad strategy for ordinary people but it works out well for insurance companies, who charge a small additional premium to make a profit above the zero expected return.A random walk that runs for t steps starting at position 0 ends with E[X

_{t}^{2}-t] = 0 or E[X_{t}^{2}] = t.

# 4. Martingales as sums of uncorrelated random variables

Let {X_{t},ℱ_{t}} be a martingale with X_{0}=0. Define, for t>0, Δ_{t}, = X_{t}-X_{t-1}. Then we can write X_{t} as , where E[Δ_{i}|Δ_{1}...Δ_{i-1}] = E[X_{i}-X_{i-1}|ℱ_{i}] = 0. In other words, Δ_{i} is **uncorrelated** with Δ_{1}...Δ_{i-1}. This is not quite as good a condition as independence, but is good enough that martingales often act very much like sums of independent random variables.

## 4.1. Variance of Xt

For example, suppose we want to compute the Var[X_{t}] = E[(X_{t}-EX_{t})^{2}] = E[X_{t}^{2}] (recall that we are assuming X_{0} = 0, so EX_{t} = 0 by the martingale property). We have Var[X_{t}] = Var[∑_{i≤t} Δ_{i}] = ∑_{i,j≤t} Cov(Δ_{i},Δ_{j}) = ∑_{i≤t} Var[Δ_{i}] + 2 ∑_{i<j≤t Cov(Δ}i_{,Δ}j_{). Now Var[Δ}i_{] = E[Δ}i_{^2^], since EΔ}i_{ = 0, and Cov(Δ}i_{,Δ}j_{) = E[Δ}i_{Δ}j_{] - E[Δ}i_{]E[Δ}j_{] = E[E[Δ}j_{|Δ}i_{]Δ}i_{] - 0⋅0 = 0 - 0 = 0, since E[Δ}j_{|Δ}i_{] = E[E[Δ}j_{|ℱ}j-1]|Δ_{i}] = 0. It follows that Var[X_{t}] is exactly ∑_{i≤t} E[Δ_{i}^{2}].

This means that if we have a bound |Δ_{i}| ≤ c_{i} for all i, we can compute Var[X_{t}] ≤ ∑_{i≤t} c_{i}^{2}, and apply Chebyshev's inequality to the result to get a bound on the tails of X_{t}. But in fact we can do much better by using moment generating functions; the result is an inequality known as the Azuma-Hoeffding inequality (named for its independent co-discoverers), which is the martingale version of Chernoff bounds.

## 4.2. The Azuma-Hoeffding inequality

Same setup as above: X_{0} = 0, Δ_{t} = X_{t} - X_{t-1} with |Δ_{t}|≤c_{t} for all t. Then

Variants:

If we don't assume X

_{0}= 0, this becomes a bound on X_{t}-X_{0}.Since {-X

_{t},ℱ_{t}} is also a martingale, the same bound applies, which gives the symmetric bound Pr[X_{t}-X_{0}≤ -x] ≤ exp(-x^{2}/2∑c_{i}^{2}).Applying the union bound gives Pr[|X

_{t}-X_{0}| ≥ x] ≤ 2 exp(-x^{2}/2∑c_{i}^{2}).

The proof is by the usual trick of applying Markov's inequality to E[exp(α∑X_{t})] for appropriate choice of α, with some tricks in bounding E[exp(α∑X_{t})]. See Azuma-Hoeffding inequality.

### 4.2.1. Application: the method of bounded differences

Basic idea: Azuma-Hoeffding applies to any process that we can model as revealing the inputs x_{1}...x_{n} to some function f(x_{1}...x_{n}) one at a time, provided changing any one input changes f by at most c (the **Lipschitz condition**). The reason is that when the Lipschitz condition holds, then E[f(x_{1}...x_{n}) | x_{1}...x_{t}] is a martingale that satisfies the requirements of the inequality.

This allows us to show, for example, that the number of colors needed to color a random graph is tightly concentrated, by considering a process where x_{i} reveals all the edges between vertex i and smaller vertices (a **vertex exposure martingale**).

# 5. Stopping times

Let {ℱ_{t}} be a filtration. A random variable T is a **stopping time** for {ℱ_{t}} if T≥0 and the event [T≤t] is measurable ℱ_{t} for all t. In simple terms, T is a stopping time if you know at time t whether you've stopped or not.

Stopping times are very handy when used with martingales, because of the

- Optional Stopping Theorem
If (X

_{t},ℱ_{t}) is a martingale and T is a stopping time for {ℱ_{t}}, then E[X_{T}] = E[X_{0}] ifPr[T<∞] = 1,

E[|X

_{T}|] < ∞, andlim

_{t→∞}E[ X_{t}⋅[T > t] ] = 0.

The first condition says that T is finite with probability 1 (i.e., eventually we do stop). The second condition puts a bound on how big |X_{T}| can get, which excludes some bad outcomes where we accept a small probability of a huge loss in order to get a large probability of a small gain. The third says that the contribution of large values of t to E[X_{T}] goes to zero as we consider larger and larger t; the term [T > t] is the indicator variable for the event that T is larger than t.

It would be nice if we could show E[X_{T}] = E[X_{0}] without the side conditions, but in general this isn't true. For example, the double-or-win martingale strategy eventually returns 1 with probability 1, so if T is the time we stop playing, we have Pr[T<∞] = 1, E[|X_{T}|] < ∞, but E[X_{T}] = 1 ≠ E[X_{0}] = 0. Here we have E[X_{t}⋅[T>t]] = -2^{t-1}(1/2)^{t-1} = -1 for t>0, so it's the third condition that's violated. (The same thing happens for a simple ±1 random walk that stops at +1, but it's a bit harder to calculate E[X_{t}⋅[T>t]] in this case.)

To prove the optional stopping theorem, it helps to start with a simpler version:

- Optional Stopping Theorem (baby version)
If (X

_{t},ℱ_{t}) is a martingale and T is a stopping time for {ℱ_{t}}, then for any n∈ℕ, E[X_{min(T,n)}] = E[X_{0}].- Proof
Define Y

_{t}= <<X_{0}+ latex($\sum_{i=1}^{t} (X_{t}-X_{t-1}) [T\\le t-1]$)>>. Then (Y_{t},ℱ_{t}) is a martingale since we can treat [T≤t-1] as a sequence of bets. But then E[X_{min(T,n)}] = E[Y_{n}] = E[Y_{0}] = E[X_{0}].So now we'll prove the full version by considering E[X

_{min(T,n)}] and showing that, under the conditions of the theorem, it approaches E[X_{T}] as n goes to infinity. First observe that, for any n, X_{T}= X_{min(T,n)}+ [T>n](X_{T}-X_{n}), because either T≤n, and we just get X_{T}, or T>n, and we get X_{n}+(X_{T}-X_{n}) = X_{T}. Now take expectations: E[X_{T}] = E[X_{min(T,n)}] + E[[T>n]X_{T}] - E[[T>n]X_{n}]. Condition (3) on the theorem gives lim_{n→∞}E[[T>n]X_{n}]→0. If we can show that the middle term also vanishes in the limit, we are done.Here we use condition (2). Observe that E[[T>n] X

_{T}] = . Compare this with E[X_{T}] = ; this is a convergent series, so in the limit the sum of the terms for i=0..n converges to E[X_{T}]. But this means that the sum of the remaining terms for i=n+1..∞ converges to zero. So the middle term goes to zero as n goes to infinity.We thus have E[X

_{T}] = lim_{n→∞}E[X_{min(T,n)}] + E[[T>n]X_{T}] - E[[T>n]X_{n}] = lim_{n→∞}E[X_{min(T,n)] = E[X}0,,]. This completes the proof.

Using the full-blown optional stopping theorem is a pain in the neck, because conditions (2) and (3) are often hard to test directly. So in practice one generally chooses one of several weaker but easier variants:

- Optional Stopping Theorem (bounded time)
If (X

_{t},ℱ_{t}) is a martingale and T is a stopping time for {ℱ_{t}} with T≤n always for some fixed n, then E[X_{T}] = E[X_{0}]. Proof: X_{T}= X_{min(T,n)}.- Optional Stopping Theorem (bounded range)
If (X

_{t},ℱ_{t}) is a martingale and T is a stopping time for {ℱ_{t}} with Pr[T<∞] = 0 and |X_{t}| ≤ M always for some fixed M and all t, then E[X_{T}] = E[X_{0}]. Proof: Use the full OST, with (1) given, (2) implied by |X_{t}| ≤ M, and (3) follows from |E[X_{t}[T>t]| ≤ E[|X_{t}|[T>t]] ≤ M Pr[T>t] → 0 since Pr[T<∞] = 1.- Optional Stopping Theorem (bounded increments)
If (X

_{t},ℱ_{t}) is a martingale and T is a stopping time for {ℱ_{t}} where Pr[T<∞] = 1, ET≤∞, and |X_{t}-X_{t-1}| ≤ c always for all t, then E[X_{T}] = E[X_{0}]. Proof: Here we go back into the original proof of the OST, but there is a simpler argument that E[(X_{T}-X_{n})[T > n]] converges to 0. The idea is that |E[(X_{T}-X_{n})[T>n]]| = |E[∑_{t≥n}(X_{t+1}-X_{t}) [T>t]]| ≤ E[∑_{t≥n}|(X_{t+1}-X_{t})| [T>t]] ≤ E[∑_{t≥n}c[T>t] ]. Now use the fact that E[T] = ∑_{t}[T>t] to show that this is again the tail of a convergent sequence, and thus converges to zero.

## 5.1. Applications

Let X

_{t}be a ±1 random walk that starts at 0 and stops if it reaches +a or -b. Let T be the time at which the process stops. We can easily show that Pr[T<∞] = 1 and ET < ∞ by observing that from any state of the random walk, there is a probability of at least 2^{-(a+b)}that it stops within a+b steps (by flipping heads a+b times in a row), so that if we consider a sequence of intervals if length a+b, the expected number of such intervals we can have before we stop is at most 2^{a+b}. We have bounded increments by the definition of the process (bounded range also works). So E[X_{T}] = E[X_{0}] = 0 and the probability p of landing on a instead of -b must satisfy pa - (1-p)b = 0, giving p = b/(a+b).Let X

_{t}and T be as above and let Y_{t}= X_{t}^{2}-t, which we previously showed to be a martingale. Again we have bounded increments (but not bounded range!), since X_{t}^{2}≤ max(a,b)^{2}for all t and t only changes by 1 at each step. So E[Y_{T}] = 0, which gives E[T] = E[X_{T}^{2}]. But we can calculate E[X_{T}^{2}], since it is a^{2}Pr[X_{T}= a] + b^{2}Pr[X_{t}= -b] = a^{2}(b/(a+b)) + b^{2}(a/(a+b)) = (a^{2}b+b^{2}a)/(a+b) = ab.

# 6. Submartingales and supermartingales

Sometimes we can't guarantee E[X_{t+1}|ℱ_{t}] = X_{t}, but we can nonetheless use X_{t} as a lower or upper bound on E[X_{t+1}|ℱ_{t}]. An adapted process (X_{t},ℱ_{t}) is a **submartingale** if X_{t} ≤ E[X_{t+1}|ℱ_{t}] for all t and a **supermartingale** if X_{t} ≥ E[X_{t+1}|ℱ_{t}]. Note that the quantity that is "sub" (below) or "super" (above) is always where we are now: so submartingales tend to go up over time while supermartingales tend to go down. If a process is both a submartingale and a supermartingale, it's a martingale.

Sub/super-martingales are handy when we can't set up an exact martingale, but we don't mind because we only care about one-sided bounds anyway. Most of the properties we have seen for martingales hold for sub/super-martingales if we replace equality with ≤ or ≥: for example, in a submartingale, E[X_{0}] ≤ E[X_{t}] for all t.

## 6.1. The Doob-Meyer decomposition

The easiest way to get these results in general is to write a submartingale (X_{t},ℱ_{t}) as X_{t} = Y_{t} + Z_{t}, where Y_{t} is a martingale and Z_{t} is a non-decreasing **predictable process** with respect to ℱ_{t}; this means that Z_{t+1} is always measurable ℱ_{t}. The existence of this decomposition (known as the *Doob decomposition* or sometimes the *Doob-Meyer decomposition* among people who get tired of naming everything after Doob) is easily shown: define Z_{0} = 0, Z_{t+1} = Z_{t} + E[X_{t+1}-X_{t}|ℱ_{t}] and Y_{t} = X_{t} - Z_{t}. Then X_{t} is trivially equal to Y_{t} + Z_{t}, and it's not hard to see that Z_{t} is both non-decreasing and predictable (the non-decreasing part is just the submartingale property). To show that Y_{t} is a martingale, compute E[Y_{t+1}|ℱ_{t}] = E[X_{t+1}-Z_{t+1}|ℱ_{t}] = E[X_{t+1}-E[X_{t+1}-X_{t}|ℱ_{t}]|ℱ_{t}] = E[X_{t+1}-X_{t+1}+X_{t}|ℱ_{t}] = X_{t}.

For supermartingales, the same decomposition works, but now Z_{t} is non-increasing.

## 6.2. Azuma-Hoeffding inequality for sub- and super-martingales

We can use the existence of the Doob decomposition to prove results about a submartingale or supermartingale X_{t} even if we don't know how to compute it. For example, suppose (X_{t},ℱ_{t}) is a submartingale with bounded increments, that is, that |X_{t}-X_{t-1}| ≤ c_{t} for fixed constants c_{t} and all t. Decompose X_{t} as Y_{t}+Z_{t} as above. From |X_{t}-X_{t-1}| ≤ c_{t} we have |(Y_{t+Z}t_{)-(Y}t-1_{+Z}t-1)| ≤ c_{t} or |(Y_{t}-Y_{t-1})+(Z_{t}-Z_{t-1})| ≤ c_{t}. We want to get a bound on |Y_{t}-Y_{t-1|; first we get a bound on Z}t_{-Z}t-1_{ by observing that E[Y}t_{-Y}t-1_{] = 0 implies that Y}t_{-Y}t-1_{≥0 with nonzero probability. Suppose that this event occurs; then everything inside the absolute value is non-negative, and we have (Y}t_{-Y}t-1_{)+(Z}t_{-Z}t-1) ≤ c_{t} which gives 0 ≤ (Z_{t}-Z_{t-1}) ≤ c_{t} (if we also take into account the fact that Z_{t} is non-decreasing). But then (Y_{t}-Y_{t-1}) must lie between -2c_{t} and c_{t}, or we violate the constraints on X_{t}. So we can apply the Azuma-Hoeffding inequality to Y_{t} with the constants 2c_{t}, which gives Pr[Y_{t}-Y_{0} ≤ -x] ≤ . But since X_{0} = Y_{0} and X_{t} ≥ Y_{t}, this gives the same bound for Pr[X_{t} - X_{0} ≤ -x].

A similar bound on Pr[X_{t} - X_{0} ≥ x] holds for supermartingales. Note that in each case only one side of the usual Azuma-Hoeffding bound holds. We can't say much about how fast a submartingale with bounded increments rises (or a supermartingale falls), because it could be that Z_{t} accounts for nearly all of X_{t}.

## 6.3. Sub- and super-martingales from functions of martingales

Suppose (X_{t},ℱ_{t}) is a martingale, and let f be a convex function. Then Jensen's inequality says that E[f(X_{t+1})|ℱ_{t+}] ≥ f(E[X_{t+1}|ℱ_{t}]) = f(X_{t}). In other words, when f is convex, (f(X_{t}),ℱ_{t}) is a submartingale. Where convex functions turn martingales into submartingales, concave functions turn martingales into supermartingales. This can be very handy if you want to establish that some sequence is a sub/super-martingale quickly.

- Example
Let f(x) = |x|. Then f is convex, so we have that if (X

_{t},ℱ_{t}) is a martingale, (|X_{t}|,ℱ_{t}) is a submartingale. This means, for example, that if we start a ±1 random walk at some random location X_{0}, then without knowing anything about the distribution of X_{0}we can nonetheless conclude that for any fixed t, E[|X_{t}|] ≥ E[|X_{0}|].

This fact is not as useful as one might think for getting bounds: given a martingale, it is almost always better to work with the martingale directly and then apply Jensen's inequality or f afterwards to get the desired bound.

## 6.4. Supermartingales and recurrences

When we solve a probabilistic recurrence, we get an upper bound on the work that remains in each state. If we have such a bound, we can get a supermartingale that may allow us to prove concentration bounds on the cost of our algorithm. Let C_{t} be the cost we've paid so far, and Y_{t} be an upper bound on the expected cost remaining, with ℱ_{t} the usual σ-algebra describing everything that has happened up to time t. Then we have Y_{t} ≥ E[C_{∞}-C_{t}|ℱ_{t}].

We can reverse this by finding a non-negative Y_{t} such that E[Y_{t}-Y_{t+1}|ℱ_{t}] ≥ [C_{t+1}-C_{t}|ℱ_{t}], so that the change in Y_{t} pays for the cost of each step. If this holds, then for X_{t} = Y_{t} + C_{t} we have E[X_{t+1}-X_{t}|ℱ_{t}] = E[Y_{t+1}-Y_{t}+C_{t+1}-C_{t}|ℱ_{t}] ≤ 0, and (X_{t},ℱ_{t}) is a supermartingale. Now let T be the time at which the algorithm terminates, and suppose that one of the conditions for the Optional Stopping Theorem holds for T and (X_{t},ℱ_{t}) (typical cases are that there is a fixed upper bound on the total time for the algorithm or that there is a fixed upper bound on the cost; if neither holds, we may have to do a little bit of work). Then E[C_{T}] ≤ E[X_{T}] ≤ E[X_{0}] = E[Y_{0}] by the supermartingale property. But since we now have a supermartingale instead of just a recurrence, we may be able to get stronger bounds by using Azuma-Hoeffding.

## 6.5. Example: QuickSort

For example, let's take QuickSort. We previously calculated (see RandomizedAlgorithms) that 2 n ln n is a bound on the number of comparisons needed to sort an array of n elements. Let's try to turn this into a supermartingale. Recall that QuickSort takes an unsorted array and recursively partitions it into smaller unsorted arrays, terminating when each array has only one element. Let's imagine that we do these partitions in parallel; i.e., at each time step we'll partition every block we've got left with more than one element. Since each such step reduces the size of the largest unsorted block, we finish after at most n such steps.

We expect each block of size n_{i} to take no more than 2n_{i} ln n_{i} comparisons to complete. So we will use the sum ∑ 2 n_{i} ln n_{i} over all blocks remaining at time t as our estimate Y_{t} of the remaining time. If we can show that the change in Y_{t} pays for the number of comparisons at step t on average, then the supermartingale property will hold for C_{t}+Y_{t} and we'll get E[C_{∞}] ≤ Y_{0} = 2 n ln n.

Let's consider the partition of a single block of size n_{i}. The cost of the partition is n_{i}-1 comparisons. The expected change in Y_{t} contributed by this particular block is E[2 n_{i} ln n - (2 k ln k + 2 (n_{i}-k-1) ln (n,,i-k-1))], where k is uniform in the range 0..n-1 (we assume that all elements are distinct so that the pivot ends up in a block by itself). The subtracted terms are equal to

The -n pays for the n-1 cost on average. We thus get that {C_{t} + ∑ 2 n_{i} ln n_{i}} is a supermartingale, and that the expected number of comparisons to sort n elements is at most 2 n ln n.

We already knew this. Can we learn anything else?

### 6.5.1. A mediocre high-probability bound using Azuma-Hoeffding

Let's try to get a bound on how much X_{t} = C_{t}+Y_{t} changes at each step. We can compute C_{t+1}-C_{t} = ∑ (n_{i}-1) ≤ n, and Y_{t+1}-Y_{t} ≤ 0, so X_{t+1}-X_{t} ≤ n. In the other direction, when we split a block of size n_{i} into two blocks of size k and n_{i}-k-1, the smallest value we can get for 2 k ln k + 2 (n_{i}-k-1) ln (n_{i}-k-1) is when the split is exactly even (to prove this, use convexity of the potential function). In this case we get a drop from 2 n_{i} ln n_{i} to 4 ((n_{i}-1)/2) ln ((n_{i}-1)/2) = 2 (n_{i} - 1) ln ((n_{i}-1)/2) = 2 (n_{i} - 1) ln (n_{i}(1-1/2n_{i})) = 2(n_{i} - 1) (ln n_{i} + ln (1-1/2n_{i})) ≥ 2(n_{i} - 1) (ln n_{i} - ln 2) = 2 n_{i} ln n_{i} - ln n_{i} - 2 n_{i} ln 2 + ln 2, giving a drop of at most ln n_{i} + 2 n_{i} ln 2 + ln 2 = ln (2n_{i}) + 2n_{i}. It is not hard to see that this expression is maximized when all n_{i} = 1 (which is a sign that our bound is not going to be tight), which gives a total over all i that is at most n(2 + ln 2) < 3n.

Now plug this into Azuma-Hoeffding (the one-sided version) to get Pr[C_{n} - 2 n ln n ≥ x] ≤ exp(-x^{2}/8n⋅(3n)^{2}), which starts getting small for x ≥ √(72n^{3}) or approximately x ≥ 9n^{3/2}. This is a pretty weak bound, although it's much better than we get with just Markov's inequality.

(There are much better ways to do this.)