Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Contents

Markov chains
Stationary distributions
1. The ergodic theorem
  1. Proof
Bounding convergence rates
1. Coupling
2. Reversible chains
The Metropolis-Hastings algorithm and simulated annealing
1. Examples of simulated annealing with provable convergence times
  1. Single peak
  2. Single peak with very small amounts of noise
Extended examples of convergence bounds
1. k-colorings of a degree d graph
2. Another example: Independent sets

1. Markov chains

A Markov chain or Markov process is a stochastic process where the distribution of X_t+1 depends only on the value of X_t and not any previous history. (Formally, for all sets of states A, E[ [X_t+1∈A] | X₀...X_t ] = E[ [X_t+1∈A] | X_t ]; contrast with the martingale property.) The state space of the chain is just the set of all values that each X_t can have. A Markov chain is finite or countable if it has a finite or countable state space, respectively. We'll mostly be interested in countable Markov chains. We'll also assume that our Markov chains are homogeneous, which means that Pr[X_t+1 = j | X_t = i] doesn't depend on t.

For a countable Markov chain, we can describe its behavior completely by giving the state space and the one-step transition probabilities p_ij = Pr[X_t+1 = j | X_t = i]. Given p_ij, we can calculate two-step transition probabilities p⁽²⁾_ij = Pr[X_t+2 = j | X_t = i] = ∑_k Pr[X_t+2 = j | X_t+1 = k] Pr[X_t+1 = k | X_t = i] = ∑_k p_ik p_kj. This is identical to the formula for matrix multiplication, and so we can think of the transition probabilities as given by a transition matrix P, then p⁽²⁾_ij = (P²)_ij and in general the n-step transition probability p⁽ⁿ⁾_ij = (Pⁿ)_ij.

Conversely, given any matrix with non-negative entries where the rows sum to 1 (∑_j P_ij = 1, or P1 = 1), there is a corresponding Markov chain given by p_ij = P_ij. Such a matrix is called a stochastic matrix.

The general formula for (n+m)-step transition probabilities is that p^(n+m)_ij = ∑_k p⁽ⁿ⁾_ikp^(m)_kj. This is known as the Chapman-Kolmogorov equation and is equivalent to the matrix identity P^n+m = PⁿP^m.

A distribution over states of the Markov chain at some time t can be given by a row vector x, where x_i = Pr[X_t = i]. To compute the distribution at time t+1, we use the law of total probability: Pr[X_t+1 = j] = ∑_i Pr[X_t = i] Pr[X_t+1 = j | X_t = i] = ∑_i x_i p_ij. Again we have the formula for matrix multiplication (where we treat x as a 1×i matrix); so the distribution vector at time t+1 is just xP, and at time t+n is xPⁿ.

We like Markov chains for two reasons:

They describe what happens in a randomized algorithm; the state space is just the set of all states of the algorithm, and the Markov property holds because the algorithm can't remember anything that isn't part of its state. So if we want to analyze randomized algorithms, we will need to get good at analyzing Markov chains.
They can be used to do sampling over interesting distributions. Under appropriate conditions (see below), the state of a Markov chain converges to a stationary distribution. If we build the right Markov chain, we can control what this stationary distribution looks like, run the chain for a while, and get a sample close to the stationary distribution.

In both cases we want to have a bound on how long it takes the Markov chain to converge, either because it tells us when our algorithm terminates, or because it tells us how long to mix it up before looking at the current state.

1.1. Examples

A fair ±1 random walk. The state space is ℤ, the transition probabilities are p_ij = 1/2 if |i-j| = 1, 0 otherwise. This is an example of a Markov chain that is also a martingale.
A fair ±1 random walk on a cycle. As above, but now the state space is ℤ/m, the integers mod m. An example of a finite Markov chain.
Random walks with absorbing/reflecting barriers.
Random walk on a graph G = (V,E). The state space is V, the transition probalities are p_uv = 1/d(u) if uv ∈ E. (One can also have more general transition probabilities.)
The Markov chain given by X_t+1 = X_t+1 with probability 1/2, and 0 with probability 1/2. The state space is ℕ.
2-SAT algorithm. State is a truth-assignment. The transitional probabilities are messy but arise from the following process: pick an unsatisfied clause, pick one of its two variables uniformly at random, and invert it. Then there is an absorbing state at any satisfying assignment. (A similar process works for 2-colorability, 3-SAT, 3-colorability, etc.; note that for the NP-hard problems, it may take a while to reach an absorbing state.)

1.2. Classification of states

Given a Markov chain, define f_ij(n) = Pr[j ∉ {X₁...X_n-1} ∧ j = X_n | X₀ = i]; f_ij(n) is the probability that the first passage time from i to j equals n. Define f_ij = ∑_n∈ℤ^+^ f_ij(n); this is the probability that the chain ever reaches j starting from i. A state is persistent if f_ii = 1 and transient if f_ii < 1. Finally, define the mean recurrence time μ_i = ∑_n∈ℤ^+^ n f_i(n) if i is persistent and μ_i = ∞ if i is transient. We use the mean recurrence time to further classify states:

f_i	μ_i	classification
< 1	∞	transient
1	∞	null persistent
1	< ∞	non-null persistent

Examples: In a random walk with absorbing barriers at ±n, states -n and +n are (non-null) persistent, and the rest of the states are transient. In general, any state in a finite chain is either non-null persistent or transient, and there is at least one persistent state. In a random walk on ℤ, all states are null persistent. In the process on ℕ where X_t+1 = X_t+1 with probability 1, all states are transient.

The period d(i) of a state i is gcd({ n : p_i(n) > 0 }). A state i is aperiodic if d(i) = 1, and periodic otherwise. A Markov chain is aperiodic if all its states are aperiodic.

The most well-behaved states are the aperiodic non-null persistent states; these are called ergodic. A Markov chain is ergodic if all its states are.

1.3. Reachability

State i communicates with state j if p_ij(n) > 0 for some n; i.e., it is possible to reach j from i. This is often abbreviated as i→j. Two states i and j intercommunicate if i→j and j→i; this is similarly abbreviated as i↔j. If i↔j, it's not hard to show that i and j have the same period and classification. A set of states S is closed if p_ij = 0 for all i∈S, j∉S, and irreducible if i↔j for all i,j in S.

Using graph-theoretic terminology, i→j if j is reachable from i (through edges corresponding to transition with nonzero probabilities), and a set of states is closed if it is strongly connected. We can thus decompose a Markov chain into strongly-connected components, and observe that all states in a particular strongly-connected component are persistent if and only if the strongly-connected component has no outgoing edges (i.e. is closed). For finite chains there must be at least one closed strongly-connected component (corresponding to a sink in the quotient DAG); this gives an instant proof that finite chains have persistent states.

If the entire chain forms a single strongly-connected component, we say it is irreducible.

2. Stationary distributions

The key useful fact about irreducible Markov chains is that (at least in the finite case) they have stationary distributions, where a stationary distribution is a distribution π (non-negative row vector summing to 1) such that πP = π. The full theorem is:

Theorem: An irreducible Markov chain has a stationary distribution π if and only if all states are non-null persistent, and if π exists, then π_i = 1/μ_i for all i.

The proof of this theorem takes a while, so we won't do it (see GrimmetStirzaker if you want to see the details). For finite chains, the first part follows from the Perron-Frobenius theorem, which says that stochastic matrices always have an eigenvector corresponding to an eigenvalue of 1. The second part (π_i = 1/μ_i) is just a special case of the renewal theorem.

For non-irreducible Markov chains, there is a stationary distribution on each closed irreducible subset, and the stationary distributions for the chain as a whole are all convex combinations of these stationary distributions.

Examples: In the random walk on ℤ_m the stationary distribution satisfies π_i = 1/m for all i (immediate from symmetry). By contrast, the random walk on ℤ has no stationary distribution (the states are all null persistent). The process on ℕ where p_ij = 1/2 for j = i+1 or j = 0 has a stationary distribution π given by π_i = 2^-i-1; this is an example of an infinite chain that nonetheless has only non-null persistent states. For a random walk with absorbing barriers at ±n, there is no unique stationary distribution; instead, any vector π where π_i = 0 unless i = ±n and π1 = 1 is stationary.

Note that it's generally not a good strategy to compute π by computing μ first; instead, solve the matrix equation πP = π by rewriting it as π(P-I) = 0 and then compute μ_i = 1/p_i.

2.1. The ergodic theorem

The basic version of the ergodic theorem says that if an aperiodic Markov chain has a stationary distribution π (i.e., it's ergodic), then it converges to π if we run it long enough. This is not terribly hard to prove, using a very clever technique called coupling, where we take two copies of the chain, once of which starts in an arbitrary distribution, and one of which starts in the stationary distribution, and show that we can force them to converge to each other by carefully correlating their transitions. Since coupling is a generally useful technique for bounding rate of convergence, we'll give the proof of the ergodic theorem below.

A more general version of the ergodic theorem says that for any Markov chain and any aperiodic state j, then p_jj(n) → 1/μ_j and p_ij(n) → f_ij/μ_j. This allows us in principle to compute the limit distribution for any aperiodic Markov chain with a given starting distribution x by summing ∑_i x_if_ij/μ_j. It also implies that transient and non-null persistent states vanish in the limit.

2.1.1. Proof

We'll do this by constructing a coupling between two copies of the Markov chain. In the first process {X_t}, we start with an arbitrary initial distribution for X₀. In the second {Y_t}, we start with Pr[Y_t = i] = π_i for all i. We now evolve the joint process {(X_t,Y_t)} by the rule (i,i')→(j,j') with probability p_ijp_i'j' if i≠i' and (i,i')→(j,j') with probability p_ij if i=i' and j=j' and 0 if i=i' and j≠j'. In other words, we let X_t and Y_t both wander around independently until they collide, after which they stick together and wander around together.

The reason this shows convergence is that we can write

Pr[X_t=j] = Pr[X_t=Y_t] Pr[Y_t=j|X_t=Y_t] + Pr[X_t≠Y_t] Pr[X_t=j|X_t≠Y_t]

and similarly

Pr[Y_t=j] = Pr[X_t=Y_t] Pr[Y_t=j|X_t=Y_t] + Pr[X_t≠Y_t] Pr[Y_t=j|X_t≠Y_t].

Suppose now that Pr[X_t=Y_t] → 1. Then the first equation gives Pr[X_t=j] - Pr[Y_t=j|X_t=Y_t] → 0 and the second gives Pr[Y_t=j] - Pr[Y_t=j|X_t=Y_t] → 0. Since both Pr[X_t=j] and Pr[Y_t=j] converge to the same value, they converge to each other, giving Pr[X_t=j] → Pr[Y_t=j] = π_j.

So now we just need to show Pr[X_t=Y_t] → 1. Without loss of generality, suppose that X₀ = j, where j is a non-null persistent aperiodic state. With probability π_j, X₀=Y₀ and we are done. If not, let i = Y₀ pick some future time n for which p⁽ⁿ⁾_ij > 0. If we are very luck, p⁽ⁿ⁾_j is also > 0, and so we get a nonzero probability of collision. If we are not so lucky, let m be some value such that p^(n+m)_j > 0 and p^(m)_j > 0 (if no such value exists, then j is periodic). Then we have Pr[X_t=Y_t=j] ≥ Pr[X_t=j] Pr[Y_t=j] = p⁽ⁿ⁾_ijp^(m)_jjp^(n+m)_jj > 0. So there is a finite time n+m after which we get some nonzero probability that X_t and Y_t collide. If we do not collide, repeat the argument starting with i = Y_n+m. If we let k be the maximum value of n+m taken over all initial states i, and p > 0 be the minimum probability of collision over all initial states i after k steps, then we have that Pr[X_αk≠Y_αk] ≤ (1-p)^α → 0.

3. Bounding convergence rates

In order to use the stationary distribution of a Markov chain to do sampling, we need to have a bound on the rate of convergence to tell us when it is safe to take a sample. There are two standard techniques for doing this: coupling, where we show that a copy of the process starting in an arbitrary state can be made to converge to a copy starting in the stationary distribution; and spectral methods, where we bound the rate of convergence by looking at the second-largest eigenvalue of the transition matrix. We'll start with coupling because it requires less development.

(See also this survey of the relationship between the various methods.)

3.1. Coupling

Note: these notes will be somewhat sketchy. If you want to read more about coupling, a good place to start might be Chapter 11 of MitzenmacherUpfal, this chapter of the unpublished but nonetheless famous Aldous-Fill manuscript (which is a good place to learn about Markov chains and Markov chain Monte Carlo methods in general¹, or even an entire book: Lindvall, Torgny, Lectures on the Coupling Method, Wiley, 1992. We'll mostly be using examples from the Aldous-Fill text.

3.1.1. The basic coupling lemma

The same sort of coupling trick that proves eventual convergence to π can in some cases be used to prove fast convergence. The idea is that we can bound the total variation distance d_TV(p, π) = max_A |Pr_p[A]-Pr_π[A]| = ∑_i min(p_i-π_i, 0) = (1/2) ∑_i |p_i-π_i| by bounding Pr[X_t≠Y_t], where X_t is the copy of the process that starts in some arbitrary distribution and Y_t is the copy that starts in π. Recall that, for all j,

Pr[X_t=j] = Pr[X_t=Y_t] Pr[Y_t=j|X_t=Y_t] + Pr[X_t≠Y_t] Pr[X_t=j|X_t≠Y_t]

and

Pr[Y_t=j] = Pr[X_t=Y_t] Pr[Y_t=j|X_t=Y_t] + Pr[X_t≠Y_t] Pr[Y_t=j|X_t≠Y_t].

Subtract to get

Pr[X_t=j] - Pr[Y_t=j] = Pr[X_t≠Y_t] (Pr[X_t=j|X_t≠Y_t] - Pr[Y_t=j|X_t≠Y,,t]).

Now observe that the parenthesized term is somewhere in the range -1 to 1, giving

|Pr[X_t=j] - Pr[Y_t=j]| ≤ Pr[X_t≠Y_t].

But in fact we can do much better than this. If we sum the second-to-last equation over all j, we get

∑_j |Pr[X_t=j] - Pr[Y_t=j]|
= Pr[X_t≠Y_t] ∑_j |Pr[X_t=j|X_t≠Y_t]-Pr[Y_t=j|X_t≠Y_t]|
≤ Pr[X_t≠Y_t] ∑_j (Pr[X_t=j|X_t≠Y_t]+Pr[Y_t=j|X_t≠Y_t])
≤ 2 Pr[X_t≠Y_t].

So the total variation distance between the distributions of X_t and Y_t is bounded by half of this, or Pr[X_t≠Y_t]. Armed with this fact, we can now go hunting for couplings that make X_t=Y_t with high probability for large enough t.

3.1.2. Random walk on a cycle

Let's suppose we do a random walk on ℤ_m, where to avoid periodicity at each step we stay put with probability 1/2, move counterclockwise with probability 1/4, and move clockwise with probability 1/4. To make our lives slightly easier, we'll assume m is even. What's a good choice for a coupling to show this process converges quickly?

We'll assume we choose X's move first and then try to figure out Y's move in response. If X_t=Y_t, we'll just have Y move with X (the usual outcome). If X_t-Y_t is even, we'll have Y_t+1-Y_t = -(X_t+1-X_t); X and Y move in opposite directions. If X_t-Y_t is odd, we'll have Y move if and only if X doesn't; this makes the distance between them (in both directions, because m is even) even after one step, and the other transitions will keep the distance between them even. In each case the movement of Y follows the distribution given by the underlying Markov chain; we just happen to correlate it in a somewhat convoluted way with the movement of X.

Now let us ask how long it takes on average until X_t and Y_t collide. We have at most one step to make the distance between them even. After that point, the distance increases or decreases by exactly 2 with probability 1/4 each, and stays the same with probability 1/2. If we consider only the steps where the distance changes, we have a random ±2 walk with absorbing barriers at 0 and m, which is equivalent to a random ±1 walk with absorbing barriers at 0 and m/2; the worst-case starting position for this second walk is exactly in the middle, giving an expected time to reach a barrier of (m/4)² = m²/16. But we have to multiply this by 2 to account for the fact that we only take a step every other time unit on average, and add one for the possible initial parity-fixing step, which gives a bound of m²/8 + 1 on the expected time to collide. Using Markov's inequality, after m²/4 + 2 steps we have Pr[X_t≠Y_t] ≤ 1/2, and by iterating this argument after αm²/4 steps we will have Pr[X_t≠Y_t] ≤ 2^-α. This gives a mixing time to reach d_TV ≤ of at most (m²/4 + 2) lg(1/ε), which means that the process is rapid mixing: the mixing time is polynomial in the size of the underlying process and log(1/ε).

It's worth noting that this example was very carefully rigged to make the coupling argument clean. It still works (perhaps with a slight change in the bound) if m is not even or Pr[X_t+1=X_t] ≠ Pr[X_t+1≠X_t], but the details are messier.

3.1.3. Random walk on a hypercube

Start with a bit-vector of length n. At each step, choose an index uniformly at random, and set the value of the bit-vector at that index to 0 or 1 with equal probability. How long until we get a nearly-uniform distribution over all 2ⁿ possible bit-vectors?

Here we apply the same transformation to both the X and Y vectors. It's easy to see that the two vectors will be equal once every index has been selected once. The waiting time for this to occur is just the waiting time nH_n for the coupon collector problem. We can either use this expected time directly to show that the process mixes in time O(n log n log (1/ε)) as above, or we can use known sharp concentration bounds on coupon collector (see for example MotwaniRaghavan §3.6.3, which shows lim_n→∞ Pr[T ≥ n(ln n + c)] = 1-exp(-exp(-c))) to show that in the limit n ln n + n ln ln (1/(1-ε)) = n ln n + O(n log(1/ε)) is enough.²

We can improve the bound slightly by observing that, on average, half the bits in X₀ and Y₀ are already equal; doing this right involves summing over a lot of cases, so we won't do it.

3.1.4. Various shuffling algorithms

Here we have a deck of n cards, and we repeatedly apply some random transformation to the deck to converge to a stationary distribution that is uniform over all permutations of the cards (usually this is obvious by symmetry, so we won't bother proving it). Our goal is to show that the expected coupling time at which our deck ends up in the same permutation as an initially-stationary deck is small. We do this by counting how many cards k are in the same position in both decks, and showing that, for a suitable coupling, (a) this quantity never decreases, and (b) it increases with some nonzero probability at each step. The expected coupling time is then ∑_k 1/Pr[k+1|k].

3.1.4.1. Move-to-top

This is a variant of card shuffling that is interesting mostly because it gives about the easiest possible coupling argument. At each step, we choose one of the cards uniformly at random (including the top card) and move it to the top of the deck. How long until the deck is fully shuffled, i.e. until the total variation distance between the actual distribution and the stationary distribution is bounded by ε?

Here the trick is that when we choose a card to move to the top in the X process, we choose the same card in the Y process. It's not hard to see that this links the two cards together so that they are always in the same position in the deck in all future states. So to keep track of how well the coupling is working, we just keep track of how many cards are linked in this way, and observe that as soon as n-1 are, the two decks are identical.

Note: Unlike some of the examples below, we don't consider two cards to be linked just because they are in the same position. We are only considering cards that have gone through the top position in the deck (which corresponds to some initial segment of the deck, viewed from above). The reason is that these cards never become unlinked: if we pick two cards from the initial segment, the cards above them move down together. But deeper cards that happen to match might become separated if we pull a card from one deck that is above the matched pair while its counterpart in the other deck is below the matched pair.

Having carefully processed the above note, given k linked cards the probability that the next step links another pair of cards is exactly (n-k)/n. So the expected time until we get k+1 cards is n/(n-k), and if we sum these waiting times for k=0..n-1, we get nH_n, the waiting time for the coupon collector problem. So the bound on the mixing time is the same as for the random walk on a hypercube.

3.1.4.2. Random exchange of arbitrary cards

Here we pick two cards uniformly and independently at random and swap them. (Note there is a 1/n chance they are the same card; if we exclude this case, the Markov chain has period 2.) To get a coupling, we reformulate this process as picking a random card and a random location, and swapping the chosen card with whatever is in the chosen location in both the X and Y processes.

First let's observe that the number of linked cards never decreases. Let x_i, y_i be the position of card i in each process, and suppose x_i=y_i. If neither card i nor position x_i is picked, i doesn't move, and so it stays linked. If card i is picked, then both copies are moved to the same location; it stays linked. If position x_i is picked, then it may be that i becomes unlinked; but this only happens if the card j that is picked has x_j≠y_j. In this case j becomes linked, and the number of linked cards doesn't drop.

Now we need to know how likely it is that we go from k to k+1 linked cards. We've already seen a case where the number of linked cards increases; we pick two cards that aren't linked and a location that contains cards that aren't linked. The probability of doing this is ((n-k)/n)², so our total expected waiting time is n² ∑ (n-k)^-2 = n² ∑ k^-2 ≤ nπ²/6 (see Basel problem for an extensive discussion of the useful but non-obvious fact that ∑_k∈ℕ+ k^-2 = ζ(2) = π²/6.) The final bound is O(n² log(1/ε)).

This bound is much worse that the bound for move-to-top, which is surprising. In fact, the real bound is O(n log n) with high probability, although the proof uses very different methods (see Chapter 7 of Aldous-Fill). This shows that the coupling method doesn't always give tight bounds (perhaps we need a better coupling?).

3.1.4.3. Random exchange of adjacent cards

Suppose now that we only swap adjacent cards. Specifically, we choose one of the n positions i in the deck uniformly at random, and then swap the cards a positions i and i+1 (mod n) with probability 1/2. (The 1/2 is there for the usual reason of avoiding periodicity.)

So now we want a coupling between the X and Y processes where each possible swap occurs with probability 1/(2n) on both sides, but somehow we correlate things so that like cards are pushed together but never pulled apart. The trick is that we will use the same position i on both sides, but be sneaky about when we swap.

The coupled process works like this. Let D be the set of indices i where the same card appears in both decks at position i or at position i+1. Then we do:

For i∈D, swap (i,i+1) in both decks with probability 1/2n.
For i∉D, swap (i,i+1) in the X deck only with probability 1/2n.
For i∉D, swap (i,i+1) in the Y deck only with probability 1/2n.
Do nothing with probability |D|/2n.

It's worth checking that the total probability of all these events is |D|/2n + 2(n-|D|)/2n + |D|/2n = 1. More important is that if we consider only one of the decks, the probability of doing a swap at (i,i+1) is exactly 1/2n (since we catch either case 1 or 2 for the X deck or 1 or 3 for the Y deck).

Now suppose that some card c is at position x in X and y in Y. If x=y, then both x and x-1 are in D, so the only way the card can move is if it moves in both decks: linked cards stay linked. If x≠y, then c moves in deck X or deck Y, but not both. (The only way it can move in both is in case 1, where i=x and i+1=y or vice verse; but in this case i can't be in D since the copy of c at position x doesn't match whatever is in deck Y, and the copy at position y doesn't match what's in deck X.) In this case the distance x-y goes up or down by 1 with equal probability 1/2n. Considering x-y (mod m), we have a "lazy" random walk that moves with probability 1/n, with absorbing barriers at 0 and n. The worst-case expected time to converge is n(n/2)² = n³/4, giving Pr[time for c to become linked ≥ αn³/8] ≤ 2^-α using the usual argument. Now apply the union bound to get Pr[time for every c to become linked ≥ αn³/8] ≤ n2^-α to get an expected coupling time of O(n³ log n). In this case (sayeth Aldous-Fill, quoting a then-unpublished result of David Bruce Wilson that eventually appeared in this very nice paper) the bound is optimal up to a constant factor.

3.1.4.4. Real-world shuffling

In real life, the handful of people who still use physical playing cards tend to use a dovetail shuffle, which is closely approximated by the reverse of a process where each card in a deck is independently assigned to a left or right pile and the left pile is place on top of the right pile. Coupling doesn't really help much here; instead, the process can be analyzed using more sophisticated techniques due to Bayer and Diaconis; the short version of the result is that Θ(log n) shuffles are needed to randomize a deck of size n.

3.2. Reversible chains

(See also Aldous-Fill chapter 3, from which many of the details in the notes below are taken.)

Some chains have the property of being reversible. Formally, a chain with transition probabilities p_ij is reversible if there is a distribution π such that π_ip_ij = π_jp_ji (the detailed balance equations—this says that in the stationary distribution, the probability of seeing a transition from i to j is equal to the probability of see a transition from j to i). If this is the case, then ∑_i π_ip_ij = ∑_i π_jp_ji = π_j, which means that π is stationary.

This often gives a very quick way to compute the stationary distribution, since if we know π_i, and p_ij≠0, then π_j = π_ip_ij/p_ji.

The reason that such a chain is called reversible is that if we start in the stationary distribution at time 0, then the sequence of random variables (X₀,...,X_t) has exactly the same distribution as the reversed sequence (X_t,...,X₀). Proof: Start with Pr[∀i X_i=x_i] = $\pi_{x_0} \prod_{i=0}^{t} p_{x_i x_{i+1}}$ . Now observe that we can move the $\pi_{x_0}$ across the first factor to get $p_{x_1 x_0} \pi_{x_1} \prod_{i=1}^{t} p_{x_i x_{i+1}}$ and in general $\prod{i=0}^{j-1} p_{x_{i+1} x_{i}} \pi_{x_j} \prod_{i=j}^{t} p_{x_i x_{i+1}}$ . At j=t we get $\pi_{x_t} \prod_{i=0}^{t-1} p_{x_{i+1} x_i}$ = Pr[∀i X_i = x_t-i].

Note that reversible chains can't have a period higher than 2, since we can always step back to where we came from.

Some examples of reversible chains:

Random walk on a graph: Given two adjacent vertices u and v, we have π_v = π_ud(v)/d(u). This suggests π_v = c d(v) for some constant c and all v (the stationary distribution is proportional to the degree), and the constant c is determined by ∑_v π_v = c ∑ d(v) = 1, giving π_v = d(v)/∑_u d(u).
Random walk on a weighted graph: Here each edge has a weight w_uv where 0 < w_uv = w_vu < ∞, with self-loops permitted. A step of the random walk goes from u to v with probability w_uv/∑_v' w_uv'. It is easy to show that this random walk has stationary distribution π_u = ∑_u w_uv / ∑_u∑_v w_uv, generalizing the previous case, and that the resulting Markov chain satisfies the detailed balance equations.

3.2.1. Time-reversed chains

Given any Markov chain with transition matrix P and stationary distribution π, define the time-reversed chain with matrix P^* where π_ip_ij = π_jp^*_ji. (Check: ∑_i p^*_ji = ∑_i p_ijπ_i/π_j = π_j/π_j = 1, so it's a Markov chain, and the fact P^*'s stationary distribution is the same as P's, and that in general P^*'s paths starting from the stationary distribution are a reverse of P's paths starting from the same distribution follows from an argument similar to that for reversible chains.) This gives an alternate definition of a reversible chain as a chain for which P = P^*.

Examples:

Given a biased random walk on a cycle that moves right with probability p and left with probability q, its time-reversal is the walk that moves left with probability p and right with probability q. (Here the fact that the stationary distribution is uniform makes things simple.)
Given the random walk defined by X_t+1 = X_t+1 with probability 1/2 and 0 with probability 1/2, we have π_i = 2^-i-1. This is not reversible (there is a transition from 1 to 2 but none from 2 to 1), but we can reverse it by setting p^*_ij = 1 for i = j+1 and p^*_0i = 2^-i-1. (Check: π_ip_{i i+1} = 2^-i-1(1/2) = π_i+1p^*_{i+1 i} = 2^-i-2(1); π_ip_i0 = 2^-i-1(1/2) = π₀p^*_0i = (1/2) 2^-i-1.)

Reversed versions of chains with messier stationary distributions are messier.

We can use time-reversal to generate reversible chains from arbitrary chains. The chain with transition matrix (P+P^*)/2 (corresponding to moving 1 step forward or back with equal probability at each step) is always a reversible chain.

3.2.2. Spectral properties of a reversible chain

(See Aldous-Fill chapter 3, section 4, or these lecture notes from Ravi Montenegro.)

Suppose P is irreducible and reversible, i.e. that π_i≠0 for all i and π_ip_ij = π_jp_ji for all i,j. Divide both sides by √(π_iπ_j) to get (π_i)^1/2 p_ij (π_j)^-1/2 = (π_j)^1/2 p_ji (π_i)^-1/2. This shows that the matrix S with entries S_ij = (π_i)^1/2 p_ij (π_j)^-1/2 is symmetric.

Why we care about this: Any symmetric real-valued matrix has can, by the spectral theorem, be decomposed as S = UΛU' where Λ is a diagonal matrix whose entries are eigenvalues of S and the columns of U are orthonormal eigenvectors. If you didn't take the advice to take a linear algebra course as part of your undergraduate Computer Science major, an eigenvector of a matrix is a vector v such that vS = λv, where λ is the corresponding eigenvalue—in other words, it's a vector that changes in length but not direction when we run it through S. The fact that the columns of U are orthonormal (orthogonal to each other, and having length 1), means that they act like a set of coordinate axes and that vU gives the coordinates of v in this coordinate system, while (vU)U' undoes the transformation by multiplying each coordinate by the corresponding row vector, since UU' equals the identity matrix I.

In more detail, we can write an arbitrary row vector v = c₁u₁ + c₂u₂ + ... c_nu_n, where the vectors u_i correspond to the rows of U' and the coefficients c_i are unique. If we multiply v by S on the right, we get vUΛU' = ∑ λ_ic_iu_i, and if we multiply it by S^t, we get v(U∆U')^t = vUΛ^tU' = ∑ λ_i^tc_iu_i. So repeatedly multiplying by S corresponds to shrinking v along the eigenvectors with eigenvalues whose absolute value is less than 1, while leaving behind the component(s) corresponding to eigenvalues 1 or greater in absolute value.

What if we multiply v by P^t? Let Π be the diagonal matrix given by Π_ii = (π_i)^1/2; then we can write S = ΠPΠ^-1 and thus P = Π^-1SΠ. So P^t = (Π^-1SΠ)^t = Π^-1S^tΠ = Π^-1UΛ^tU'Π. So we can compute vP^t by dividing each component of v by the square root of the stationary distribution, shrinking along all the eigenvectors, and scaling back. If we expand out all the matrix products to get a specific (vP^t)_ij, we obtain the spectral representation formula:

$\begin{align*} \Pr[X_t = j | X_0 = i] &= \sqrt{\frac{\pi_j}{\pi_i}} \sum_{m=1}^{n} \lambda_m^t u_{mi} u_{mj}. \end{align*}$

Assuming that P is irreducible and aperiodic, we have a unique limit to this process. This means that S has exactly one eigenvalue 1 (corresponding to the eigenvector whose entries are (π_i)^1/2), and all the remaining eigenvalues λ satisfy λ^t → 0 as t → ∞. We can write the eigenvalues of S (which turn out to also be the eigenvalues of P, although the eigenvectors are different) as 1 = λ₁ ≥ λ₂ ≥ ... ≥ λ_n > -1.³ Assuming that |λ₂| ≥ |λ_n|, as t grows large λ₂^t will dominate the other smaller eigenvalues, and so the size of λ₂ will control the rate of convergence of the underlying Markov process.

In detail, we have that the first eigenvector u₁ satisfies u_1i = √π_i; so the first term in the spectral representation formula for Pr[X_t = j | X₀ = i] is √(π_j/π_i) 1^t √(π_iπ_j) = π_j. Each subsequent term is √(π_j/π_i) (λ_m)^t u_mi u_mj ≤ √(π_j/π_i) (λ_m)^t, giving a bound of (n-1) √(π_j/π_i) (λ_max)^t where λ_max = max(λ₂,|λ_n|) on the total (in the unlikely event that all the eigenvalues except λ₁ are close to λ_max) and O(√(π_i/π_j) (λ_max)^t) in the typical case. In either case, since λ_max < 1, we get an exponentially-decreasing bound.

It is often convenient to express this bound as exp(-t/c) for an appropriate constant. Here we observe that λ_max = 1-(1-λ_max) ≤ exp(-(1-λ_max)) gives (λ_max)^t ≤ exp(-(1-λ_max)t). It is customary to describe this fact by giving the mixing rate or relaxation time τ₂ = 1/(1-λ_max), this being the time for which this bound drops by a factor of 1/e.

So far we have just considered the probability that we land in a particular state. To bound total variation distance, we (following the Montenegro notes mentioned above) go through the L_p norm. Given a vector v, define ||v||_p,π = (∑_i |v_i|^p π_i)^1/p. (This is a weighted version of the usual L_p norm.) Then d_TV(p,π) = (1/2) ||π-p||₁ = (1/2) ||1-p/π||_1,π. A general property of L_p norms is that if we increase p, we also increase ||v||_p,π, so we can bound d_TV(p,π) by (1/2) ||1-p/π||_2,π, which is a little easier to compute. Omitting the details of the proof (see here), the result is that, starting from an initial vector x that puts all of its weight on state i,

$\begin{align*} d_{TV}(xP^t,\pi) &\le \frac{1}{2} \lambda_{\max}^t \sqrt{\frac{1-\pi_i}{\pi_i}}. \end{align*}$

So now we just need a tool for bounding λ_max.

3.2.3. Conductance

Definition:

$\begin{align*} \Phi(S) &= \frac{\sum_{i\in S, j \not\in S} \pi_i p_{ij}}{\pi(S)}. \end{align*}$

Basically probability of leaving S on the next step starting from the stationary distribution conditioned on being in S. A measure of how easy it is to escape from a set. Can also be thought of as a weighted version of edge expansion.

$\begin{align*} \Phi &= \min_{0 < \pi(S) \le 1/2} \Phi(S). \end{align*}$

As above, but now we take the minimum over all S with probability ≤ 1/2. The usefulness of conductance is that it bounds λ₂:

Theorem: In a reversible Markov chain, 1 - 2Φ ≤ λ₂ ≤ 1 - Φ²/2.

This result was proved in Alexander Sinclair's Ph.D. dissertation. For lazy walks we always have λ₂ = λ_max, and so we can convert this to a bound on the relaxation time:

Corollary: 1/2Φ ≤ τ₂ ≤ 2/Φ².

In other words the high conductance implies low relaxation time and vice versa, up to squaring.

3.2.3.1. Computing conductance directly

For very simple Markov chains we can compute the conductance directly. Consider a lazy random walk on an odd cycle. Any proper subset S has at least two outgoing edges, each of which carries a flow of 1/4n, giving Φ_S ≥ (1/4n)/π(S). If we now take the minimum of Φ_S over all S with π(S) ≤ 1/2, we get Φ ≥ 1/n, which gives τ₂ ≤ 2n². This is essentially the same bound as we got from coupling.

Here's a slightly more complicated chain. Take two copies of K_n, and join them by a path with n edges. Now consider Φ_S for a lazy random walk on this graph where S consists of half the graph, split in the middle of the path (we can't actually do this exactly, but we'll get as close as we can). There is a single outgoing edge uv, with π(u) = d(u)/2|E| = 2/(2n(n-1)n/2 + n) = 2n^-2 and p_uv = 1/4, for π(u) p_uv = n^-2/2. By symmetry, we have π(S) → 1/2 as n → ∞, giving Φ_S → n^-2(1/2)/(1/2) = n^-2. So we have n²/2 ≤ τ₂ ≤ 2n⁴.2n^-2.

How does this compare to the actual mixing time? In the stationary distribution, we have a constant probability of being in each of the copies of K_n. Suppose we start in the left copy. At each step there is a 1/n chance that we are sitting on the path endpoint. We then step onto the path with probability 1/n, and reach the other end before coming back with probability 1/n. So (assuming we can make this extremely sloppy handwaving argument rigorous) it takes at least n³ steps on average before we reach the other copy of K_n, which gives us a rough estimate of the mixing time of Θ(n³). In this case the exponent is exactly in the middle of the bounds derived from conductance.

3.2.3.2. Edge expansion using canonical paths

(Here and below we are mostly following the presentation in guruswami-survey.pdf, except with slightly different examples and substantially more errors.)

For more complicated Markov chains, it is helpful to have a tool for bounding conductance that doesn't depend on intuiting what sets have the smallest boundary. The canonical paths method does this by assigning a unique path γ_xy from each state x to each state y in a way that doesn't send too many paths across any one edge. So if we have a partition of the state space into sets S and T, then there are |S|⋅|T| paths from states in S to states in T, and since (a) every one of these paths crosses an S-T edge, and (b) each S-T edge carries at most ρ paths, there must be at least |S|⋅|T|/ρ S-T edges.

Let's start with a small example. Let G = C_n□C_m, the n×m torus. A lazy random walk on this graph moves north, east, south, or west with probability 1/4 each, wrapping around when the coordinates reach n or m. Since this is a random walk on a regular graph, the stationary distribution is uniform. What is the relaxation time?

Intuitively, we expect it to be O(max(n,m)²), because we can think of this two-dimensional random walk as consisting of two one-dimensional random walks, one in the horizontal direction and one in the vertical direction, and we know that a random walk on a cycle mixes in O(n²) time. Unfortunately, the two random walks are not independent: every time I take a horizontal step is a time I am not taking a vertical step. We can show that the expected coupling time is O(n²+m²) by running two sequential instances of the coupling argument for the cycle, where we first link the two copies in the horizontal direction and then in the vertical direction. So this gives us one bound on the mixing time. But what happens if we try to use conductance?

Here it is probably easiest to start with just a cycle. Given points x and y on C_n, let the canonical path γ_xy be a shortest path between them, breaking ties so that half the paths go one way and half the other. Then each each is crossed by exactly k paths of length k for each k=1..n/2-1, and at most n/4 paths of length n/2 (0 if n is odd), giving a total of ρ ≤ (n/2-1)(n/2)/2 + n/4 = n²/8 paths across the edge.

If we now take an S-T partition where |S| = m, we get at least m(n-m)/ρ = 8m(n-m)/n² S-T edges. This peaks at m=n/2, where we get 2 edges—exactly the right number—and in general when m≤n/2 we get at least 8m(n/2)/n² = 4m/n outgoing edges, giving a conductance Φ_S ≥ (1/4n)(4m/n)/(m/n) = 1/n. (This is essentially what we got before, except we have to divide by 2 because we are doing a lazy walk. Note that for small m, the bound is a gross underestimate, since we know that every nonempty proper subset has at least 2 outgoing edges.)

Now let's go back to the torus C_n□C_m. Given x and y, define γ_xy to be the L-shaped path that first changes x₁ to y₁ by moving the shortest distance vertically, then changes x₂ to y₂ by moving the shortest distance horizontally. For a vertical edge, the number of such paths that cross it is bounded by n²m/8, since we get at most n²/8 possible vertical path segments and for each such vertical path segment there are m possible horizontal destinations to attach to it. For a horizontal edge, n and m are reversed, giving m²n/8 paths. To make our life easier, let's assume n≥m, giving a maximum of ρ = n²m/8 paths.

For |S| ≤ nm/2, this gives at least |S|⋅|G-S|/ρ ≥ |S|(nm/2)/(n²m/8) = 4|S|/n outgoing edges. We thus have φ_S ≥ (1/8nm)(4|S|/n)/(|S|/nm) = 1/2n. This gives τ₂ ≤ 2/Φ² ≤ 8n², pretty much what we expect.

3.2.3.3. Congestion and canonical paths

For less symmetrical chains, we weight paths by the probabilities of their endpoints when counting how many cross each edge, and treat the flow across the edge as a capacity. This gives the congestion of a collection of canonical paths Γ = {γ_xy}, which is computed as

$\begin{align*} \rho(\Gamma) &= \max_{uv \in E} \frac{1}{\pi(u)p_{uv}} \sum_{\gamma_{xy} \ni uv} \pi(x) \pi(y), \end{align*}$

and we define ρ = min_Γ ρ(Γ).

Now we can show that

$\begin{align*} \Phi \ge \frac{1}{2 \rho}, \end{align*}$

from which it follows that τ₂ ≤ 2/φ² ≤ 8ρ².

Proof: Pick some set of canonical paths Γ with ρ(Γ) = ρ. Now pick some S-T partition with Φ(S) = Φ. Consider a flow where we route π(x)π(y) units of flow along each path γ_xy with x∈S and y∈T. This gives a total flow of π(S)π(T) ≥ π(S)/2. We are going to show that we need a lot of capacity across the S-T cut to carry this flow, which will give the lower bound on conductance.

For any S-T edge uv, we have

$\begin{align*} \frac{1}{\pi(u)p_{uv}} \sum_{\gamma_{xy} \ni uv} \pi(x) \pi(y) &\le \rho \end{align*}$

$\begin{align*} \sum_{\gamma_{xy} \ni uv} \pi(x) \pi(y) &\le \rho \pi(u) p_{uv}. \end{align*}$

Since each S-T path crosses at least one S-T edge, we have

$\begin{align*} \pi(S) \pi(T) &= \sum_{x \in S, y \in T} \pi(x) \pi(y) \\ &\le \sum_{u \in S, t \in V, uv \in E} \sum_{\gamma_{xy} \ni uv} \pi(x) \pi(y) \\ &\le \sum_{u \in S, t \in V, uv \in E} \rho \pi(u) p_{uv}. \\ &= \rho \sum_{u \in S, t \in V, uv \in E} \pi(u) p_{uv}. \end{align*}$

But then

$\begin{align*} \Phi(S) &= \frac{\sum_{u \in S, t \in V, uv \in E} \pi(u) p_{uv}}{\pi(S)} \\ &\ge \frac{\pi(S) \pi(T)/\rho}{\pi(S)} \\ &\ge \frac{1}{2\rho}. \end{align*}$

4. The Metropolis-Hastings algorithm and simulated annealing

The basic idea of Metropolis-Hastings (sometimes just called Metropolis) is that we start with a reversible Markov chain P with a known stationary distribution π, but we'd rather get a different stationary distribution μ, where μ_i = f(i) / ∑_j f(j) is proportional to some function f≥0 on states that we can compute easily. Define q_ij for i≠j to be p_ij min(1, π_if(j)/π_jf(i)), = p_ij min(1, π_iμ_j/π_jμ_i) and let q_ii be whatever probability is left over. Now consider two states i and j, and suppose that π_if(j) ≥ π_jf(i). Then μ_iq_ij = μ_ip_ij while μ_jq_ji = μ_jp_ji(π_jμ_i/π_iμ_j) = p_ji(π_jμ_i/π_i) = μ_i (p_jiπ_j) / π_i = μ_i (p_ijπ_i) / π_i = μ_ip_ij (note the use of reversibility of P in the second-to-last step). So the Q chain is a reversible Markov chain with stationary distribution μ.

We can simplify this in some cases by making our underlying chain a random walk on a graph with maximum degree d, where each transition occurs with probability exactly 1/d (for lower-degree nodes, there is some leftover probability of staying put). Then we have π(i) = π(j) for all i, j, so the new transition probability q_ij are just (1/d) min(1, f(j)/f(j)).⁴

A typical application is that we want to sample according to Pr[i|A], but A is highly improbable (so we can't just use rejection sampling), and Pr[i|A] is easy to compute for any fixed i but tricky to compute for arbitrary events (so we can't use divide-and-conquer). If we let f(i) ∝ Pr[i|A], then Metropolis-Hastings will do exactly what we want, assuming it converges in a reasonable amount of time.

Another application, which generally involves tinkering with the chain while it's running, is the global optimization heuristic known as simulated annealing. Here we have some function g that we are trying to minimize. So we set f(i) = exp(-αg(i)) for some α>0. Running Metropolis-Hastings gives a stationary distribution that is exponentially weighted to small values of g; if i is the global minimum and j is some state with high g(j), then π(i) = π(j) exp(α(g(j)-g(i)), which for large enough α goes a long way towards compensating for the fact that in most problems there are likely to be exponentially more bad j's than good i's. The problem is that the same analysis applies if i is a local minimum and j is on the boundary of some depression around i; large α means that it is exponentially unlikely that we escape this depression and find the global minimum.

The simulated annealing hack is to vary α over time; initially, we set α small, so that we get conductance close to that of the original Markov chain. This gives us a sample that is roughly uniform, with a small bias towards states with smaller g(i). After some time we increase α to force the process into better states. The hope is that by increasing α slowly, by the time we are stuck in some depression, it's a deep one—optimal or close to it. If it doesn't work, we can randomly restart and/or decrease α repeatedly to jog the chain out of whatever depression it is stuck in. How to do this effectively is deep voodoo that depends on the structure of the underlying chain and the shape of g(i), so most of the time people who use simulated annealing just try it out with some generic annealing schedule and hope it gives some useful result. (This is what makes it a heuristic rather than an algorithm. It's continued survival is a sign that it does work at least sometimes.)

4.1. Examples of simulated annealing with provable convergence times

It turns out to be surprisingly hard to come up with an interesting toy example of simulated annealing with provable convergence (see the Extended examples section below for some failed attempts). Here are some non-interesting examples.

4.1.1. Single peak

Let's suppose x is a random walk on an n-dimensional hypercube (i.e., n-bit vectors where we set 1 bit at a time), g(x) = |x|, and we want to maximize g. Now a transition that increase |x| is accepted always and a transition that decreases |x| is accepted only with probability e^-α. For large enough α, this puts a constant fraction of π on the single peak at x=1; the observation is that that there are only ${n \choose k}$ ≤ n^k points with k zeros, so the total weight of all points is at most π(1) ∑_k≥0 n^k exp(-αk) = π(1) ∑ exp(ln n - α)^k = π(1) / (1 - exp(ln n - α)) = π(1) ⋅ O(1) when α > ln n, giving π(1) = Ω(1) in this case.

So what happens with convergence? Let p = exp(-α). Let's try doing a path coupling (see below) between two adjacent copies x and y of the Metropolis-Hastings process, where we first pick a bit to change, then pick a value to assign to it, accepting the change in both processes if we can. The expected change in |x-y| is then (1/2n) (-1-p), since if we pick the bit where x and y differ, we have probability 1/2n of setting both to 1 and probability p/2n of setting both to 0, and if we pick any other bit, we get the same distribution of outcomes in both processes. This gives a general bound of E[|X_t+1-Y_t+1| | |X_t-Y_t|] ≤ (1-(1+p)/2n) |X_t-Y_t|, from which we have E[|X_t-Y_t|] ≤ exp(-t(1+p)/2n) E[|X₀-Y₀|] ≤ n exp(-t(1+p)/2n). So after t = 2n/(1+p) ln(n/ε) steps we expect to converge to within ε of the stationary distribution in total variation distance. This gives an O(n log n) algorithm for finding the peak. This is kind of a silly example, but in this case the running time for simulated annealing is almost equal to the Θ(n) optimum, if we scramble x by letting g(x) = |x⊕r| for some random constant r.

4.1.2. Single peak with very small amounts of noise

Now we'll let g:2ⁿ→ℕ be some arbitrary Lipschitz function (i.e., |g(x)-g(y)| ≤ |x-y|) and ask for what values of p = exp(-α) the Metropolis-Hastings walk with f(i) = exp(-αg(i)) can be shown to converge quickly. Given adjacent states x and y, with x_i≠y_i but x_j=y_j for all j≠i, we still have a probability of at least (1+p)/2n of coalescing the states by setting x_i=y_i. But now there is a possibility that if we try to move to (x[j/b], y[j/b]) for some j and b, that x rejects while y does not or vice versa (note if x_j=y_j=b, we don't move in either copy of the process). Conditioning on j and b, this occurs with probability 1-p precisely when x[j/b] < x and y[j/b] ≥ y or vice versa, giving an expected increase in |x-y| of (1-p)/2n. We still get an expected net change of -2p/2n = -p/n provided there is only one choice of j and b for which this occurs. So we converge in time τ(ε) ≤ (n/p) log(n/ε) in this case.⁵

One way to think of this is that the shape of the neighborhoods of nearby points is similar. If I go up in a particular direction from point x, it's very likely that I go up in the same direction from some neighbor y of x.

If there are more bad choices for j and b, then we need a much larger value of p: the expected net change is now (k(1-p)-1-p)/2n = (k-1-(k+1)p)/2n, which is only negative if p > (k-1)/(k+1). This gives much weaker pressure towards large values of g, which still tends to put us in high neighborhoods but creates the temptation to fiddle with α to try to push us even higher once we think we are close to a peak.

5. Extended examples of convergence bounds

5.1. k-colorings of a degree d graph

(This is pretty ad hoc. See MitzenmacherUpfal §11.5 for a better analysis that only requires 2d colors, or this paper for even more sophisticated results and a history of the problem.)

Consider the following chain on proper k-colorings of a graph with degree d, where k is substantially larger than d (we need at least k ≥ d+1 to guarantee that a coloring exists, but we will assume k somewhat bigger to show fast convergence). At each step, we choose one of the n nodes and one of the k colors, and recolor the node to the new color if and only if no neighbor already has that color.

We'll think of colorings as vectors. Given two colorings x and y, let ||x-y|| be the Hamming distance between them, which is the number of nodes assigned different colorings by x and y. To show convergence, we will construct a coupling that showing that ||X_t-Y_t|| converges to 0 over time starting from arbitrary initial points X₀ and Y₀. The easiest way to do so is to use a variant of coupling called path coupling.

5.1.1. Path coupling

Instead of just looking at X_t and Y_t, consider a path of intermediate states X_t = Z_0,tZ_1,tZ_2,t...Z_m,t = Y_t, where each pair Z_i,tZ_i+1,t is adjacent, i.e., satisfies ||Z_i,t-Z_i+1,t|| = 1. (In general, we'll look at distance according to some natural metric for whatever Markov chain we've got; a typical choice is just the minimum number of steps needed to reach one state from another.) We now construct a coupling only for adjacent nodes that reduces their distance on average. The idea is that d(X_t,Y_t) ≤ ∑ ||Z_i,tZ_i+1,t||, so if the distance between each adjacent pair shrinks on average, so does the total length of the path.

A complication for graph coloring is that it's not immediately evident that d(X_t,Y_t) = ||X_t-Y_t||. The problem is that it may not be possible to transform X_t into Y_t one node at a time without producing improper colorings. With enough colors, we can explicitly construct a short path between X_t and Y_t that uses only proper colorings; but for this particular process it is easier to simply extend the Markov chain to allow improper colorings, and show that our coupling works anyway. This also allows us to start with an improper coloring for X₀ if we are particularly lazy.

For the graph coloring problem, the natural coupling to consider given adjacent X_t and Y_t is to pick the same node and the same new color for both. If we pick the one node on which they differ, and choose a color that is not used by any neighbor (which will be the same for both copies of the process, since all the neighbors have the same colors), then we get X_t+1 = Y_t+1; this even occurs with probability at least (k-d)/kn. If we pick a node that is neither the different node nor adjacent to it, then the distance between X and Y doesn't change; either both get a new identical color or both don't. If we pick a node adjacent to the different node, then there is a possibility that the distance increases; if X_t assigns color c to this node and Y_t color c', then choosing either c or c' for the neighbor will cause one copy to stay the same (because the new color isn't feasible) and the other to change. This even occurs with probability at most 2d/kn = 2/n if we sum over all neighbors. So E[||X_t+1-Y_t+1|| | ||X_t-Y_t|| = 1] ≤ 1-(k-d)/kn + 2d/kn = 1 - (k-3d)/kn. This is less than 1 if k ≥ 3d+1.

Applying this bound to the entire path, we get for general X_t, Y_t that E[||X_t+1-Y_t+1|| | ||X_t-Y_t||] ≤ (1-(k-3d)/kn) ||X_t-Y_t|| ≤ exp(-(k-3d)/kn) ||X_t-Y_t||. A simple induction then gives Pr[X_t ≠ Y_t] ≤ E[||X_t-Y_t||] ≤ exp(-t(k-3d)/kn) E[||X₀-Y₀||] ≤ exp(-t(k-3d)/kn) n.

Setting this bound equal to ε and solving for t gives τ(ε) ≤ kn log(n/ε) / (k-3d).

5.1.2. Counting by sampling

Basic idea is to nail down colors one at a time and use sampling to approximate ratios between (# of colorings with m fixed colors)/(# of colorings with m+1 fixed colors). See MitzenmacherUpfal for more details.

5.1.3. Finding a cheap coloring using Metropolis-Hastings

(I'm not going to do this in class, because the result is not very good, but I will leave the notes here in case anybody is interested.)

Let's see what happens if we try doing simulated annealing with a fixed temperature on the graph coloring chain. The idea is that we will assign cost i to color i=0..k-1, and try to find a proper coloring with low total cost by running a Markov chain with stationary distribution proportional to exp(-α ∑ x_i). The chain is easily constructed using Metropolis-Hastings; when we attempt to switch a node from color i to color j, we accept the transition with probability min(1, exp(-αj)/exp(-αi)) = min(1, exp(-α(j-i))). The question then is how quickly this Markov chain converges for reasonable values of k and α.

Let's adapt the path-coupling analysis we did before. Given adjacent states x and y with x_i≠y_i, we will couple their next transition by having both copies of the process choose the same node and same color. We now have to worry about whether the two transitions will be accepted. Given probabilities p_x ≤ p_y of acceptance, we have both copies accept the transition with probability p_x, and y alone accept with probability p_y-p_x; the situation with p_x ≥ p_y is symmetric.

Now we have to consider what happens to ||x-y|| as the result of a transition. There are two classes of interesting interactions:

Node i chooses a new color c that is not a color of any neighbor of i. Then x_i and y_i both switch to c with probability min(1, exp(-α(c-(min(x_i,y_i))))). In the worst case, we have min(x_i,y_i) = 0, and the smallest nonzero color that does not create a monochromatic edge is d+1. Summing over all possible values of c gives a probability of coalescence of (1/nk) (1 + ∑_c=d+1..k exp(-αc)) = (1/nk) (1 + (exp(-α(d+1)) - exp(-α(k+1)))/(1-exp(-α))).
Some neighbor j of i chooses a new color c that is equal to x_i or y_i. In this case the distance between x and y increases by 1, assuming the transition is accepted. But if i's neighbors all have colors greater than both x_i and y_i, then they will accept the transition with probability 1. So the worst-case probability that this event occurs is 2d/nk, as in the original process.

Ignore the 1/nk denominator for the moment, our distance drops if

$\begin{align*} 2d & < 1 + \frac{e^{-\alpha(d+1)} - e^{-\alpha(k+1)}}{1-e^{-\alpha}} \end{align*}$

Our only real hope of getting the RHS bigger than the LHS is to make the fraction denominator small, which means making α very small. We have exp(-α) ≥ 1-α ⇒ 1-exp(-α) ≤ α ⇒ 1/(1-exp(-α)) ≥ 1/α. So the above inequality holds provided

$\begin{align*} 2d-1 < \frac{e^{-\alpha(d+1)} - e^{-\alpha(k+1)}}{\alpha}. \end{align*}$

We can't do much about the numerator, but if α(d+1) is a small constant and α(k+1) is a much larger constant, we should be happy. Let's let α = β/(d+1) and k+1 = γ(d+1); then the RHS becomes (d+1)(exp(-β)-exp(-γβ))/β. This will easily exceed 2d (giving us an extra 1/nk expected drop) if exp(-β) - exp(-γβ) ≥ 2β, which occurs if exp(-γβ) ≤ exp(-β) - 2β, or -γβ ≤ log(exp(-β) - 2β), or γ ≥ -log(exp(-β) - 2β)/β. For example, setting β = 1/5 and γ = 5 works (this can be determined numerically).

So with α = 1/(5(d+1)) and k ≥ 5d+4, we get E[||X_t+1-Y_t+1||] ≤ (1-1/nk) E[X_t-Y_t]. So the Metropolis-Hastings chain converges to within ε of its stationary distribution (in total variation distance) starting from an arbitrary state after at most nk log(1/ε) steps.

So what does this mean in terms of finding a good coloring? As in the old joke about the field service engineer, it depends on how many bad colorings you bring with you. And in fact most colorings are pretty bad, which means that our exp(-αc) multiplier isn't actually going to push us too far away from them.

Quick calculation: For a single node with no neighbors, the stationary distribution on its color is π(i) = exp(-αi)/∑_i=0..k-1 exp(-αi). The expected value is ∑_i i exp(-αi) / ∑_i exp(-αi). Taking the limit as k→∞ gives E[color] = exp(-α)/(1-exp(-α)) which for small α is in the neighborhood of 1/α, e.g. 5(d+1). Even for k ≅ 1/α (as in the case above), we have π(k-1) ≅ e^-1π(0), which doesn't leave a lot of room for pressure toward 0. So we can reasonably expect the chain to be pretty far from optimal almost all of the time for α this small.

I guess this is why people resort to simulated annealing. I will try to come up with a different example for class.

5.2. Another example: Independent sets

Let's try something we know is hard, so we won't feel bad when it doesn't work. We'd like to sample independent sets of vertices on a graph, with a bias toward large sets.

By analogy with the graph coloring problem, a natural way to set up the random walk is to represent each potentially independent set as a bit vector, where 1 indicates membership in the set, and at each step pick one of the n bits uniformly at random and set it to 0 or 1 with probability 1/2 each, assuming that the resulting set is independent.

Easy to see this is a reversible Markov chain with π(x) = 1/|S| where S is the set of all independent sets. Also easy to see that the distance between x and y is exactly |x-y|₁, since we can alway remove all the extra ones from x and put back the extra ones in y.

Obvious coupling: Pick the same position and value for both copies of the chain. If x and y are adjacent, then they coalesce with probability 1/n (both probability 1/2n transitions are feasible for both copies, since the neighboring nodes always have the same state). What is the probability that they diverge? We can only be prevented from picking a value if the value is 1 and some neighbor is 1. So the bad case is when x_i = 1, y_i = 0, and we attempt to set some neighbor of i to 1; in the worst case, this happens d/2n of the time, which is at least 1/n when d ≥ 2. No coalescence for you!

We now have two choices. We can try to come up with a more clever random walk (see MitzenmacherUpfal §11.6 for proof that a more sophisticated algorithm converges when d≤4), or we can try to persevere with our dumb random walk, a more clever analysis, and possibly a more restricted version of the problem where it will miraculously work even though it shouldn't really. Let's start with d=2. Here the path coupling argument gives no expected change in |X_t-Y_t|, so we have some hope that with enough wandering around they do in fact collide.

How do we prove this? Given arbitrary X_t, Y_t, we know from the path coupling argument above that on average they don't move any farther away. We also know that that there is a probability of at least 1/n that they move closer if they are not already identical (this is slightly tricky to see in the general case; basically we always have a 1/2n chance of removing an extra 1 from one or the other, and either we also have an extra 1 in the other process, we get another 1/2n, and if we don't, we can put in a missing 1 and get 1/2n that way instead). And we know that any change resulting from our not-very-smart coupling will change the distance by ±1. So if we sample only at times when a move has just occurred, we see a random walk (with a possible downward bias) with a reflecting barrier at n and an absorbing barrier at 0: this converges in n² steps. But since our random walk steps may take an expected n real steps each, we get a bound of n³ on the total number of steps to converge.

But if d=3, we seem to be doomed. Also if we try to bias the walk in favor of adding vertices: since our good case is a 1→0 transition, decreasing its probability breaks the very weak balance we have even with d=2 (for d=1, no problem!) So maybe we should see what we can do with the sneakier random walk from MitzenmacherUpfal.

Here the idea is that we pick a random edge uv, and then try to do one of the following operations, all with equal probability:

Set u=v=0.
Set u=0 and v=1.
Set u=1 and v=0.

In each case, if the result would be a non-independent set, we instead do nothing.

Verifying that this has a uniform stationary distribution is mildly painful if we are not careful, since there may be several different transitions that move from some state x to the same state y. But for each transition (occuring with probability 1/(3|E|)), we can see that there is a reverse transition that occurs with equal probability; so we get detailed balance etc. and win.

To make things more exciting, we will assume that we accept changes that decrease the number of elements of the independent set only with probability p; this will turn out to be a bad idea, but we can always recover the original process by setting p=1. This changes the stationary distribution so that π(x) is proportional to p^-|x|.

So now what happens if we run two coupled copies of this process, where the copies differ on exactly one vertex i?

First, every neighbor of i is 0 in both processes. A transition that doesn't involve any neighbors of i will have the same effect on both processes. So we need to consider all choices of edges where one of the endpoints is either i or a neighbor j of i. In the case where the other endpoint isn't i, we'll call it k; there may be several such k.

The tables below adds up the effect of all the transitions that involve some particular neighbor j of i. Each contribution should be multiplied by 1/(3|E|) to get the actual effect on the expectation. We consider several cases, depending on how many neighbors k there are that are set to 1, which affects when j can be set to 1. The values in the left-hand columns are the new values of i, j, and k. For the notes, we assume that x_i=0 and y_i=1; the reverse case is symmetric.

We'll start with the s=0 case:

i	j	k	contribution to \|x-y\|	notes
0	0		-p	Sets x_i=y_i, but only if both processes move.
0	1		-1	Sets x_i=y_i always.
1	0		-1	Sets x_i=y_i always.
	0	0	0	Any change to x also appears in y.
	0	1	0	Any change to x also appears in y.
	1	0	+d(j)-1	Causes divergence between x and y, since we can only move in x.

So in this case, each node j contributes d(j)-3-p to the expected change in distance between x and y.

For s=1, we have:

i	j	k	contribution to \|x-y\|	notes
0	0		-p	Sets x_i=y_i, but only if both processes move.
0	1		0	No effect; can't set j=1 because some k=1.
1	0		-1	Sets x_i=y_i always.
	0	0	0	Any change to x also appears in y.
	0	1	0	Any change to x also appears in y.
	1	0	+2	Diverges at both j and k, but only for one choice of k.

The total contribution is now 1-p.

For s=2, we have:

i	j	k	contribution to \|x-y\|	notes
0	0		-p	Sets x_i=y_i, but only if both processes move.
0	1		0	No effect; can't set j=1 because some k=1.
1	0		-1	Sets x_i=y_i always.
	0	0	0	Any change to x also appears in y.
	0	1	0	Any change to x also appears in y.
	1	0	0	No effect; can't set j=1 because some other k'=1.

Total contribution: -1-p.

Here the fact that j has two nonzero neighbors (besides i) means that we can't set it to 1 no matter what.

To calculate the total shift, we have to multiply by d(i), so that we cover all j. The resulting expected change is at most (1/(3|E|)) d(i) max(d(j)-3-p, 1-p, -1-p) ≤ (d/3|E|) max(d-3-p, 1-p). This is never negative, but we can drive it to 0 by setting d≤4 and p=1.

If we could come up with a variant of this process that converged in polynomial time (for d=3, say) even for small p, miraculous things would follow. In particular, we could use it to find near-maximum independent sets in cubic graphs (graphs where every node has degree 3) in polynomial time. The argument is that in a degree-3 graph, there is always a maximum independent set of size n/4 or greater. If we call an independent set good if it has size within (1-δ) of optimal, then the bad sets are each assigned weight less than p^(δn/4) times the weight of the optimal set, and even if there are 2ⁿ of them, their total weight is still less than 2ⁿp^(δn/4) times the optimal set. So if p ≤ 2^4/δ, a single optimal set is sampled at least as often as all the bad sets put together, and if there are other good sets besides the optimal ones, it only increases the probability of getting a good set in the stationary distribution. But if there is a polynomial-time randomized algorithm that finds a (1-δ)-approximation to maximum independent set in cubic graphs with non-trivial probability, then RP=NP, because of this paper: alimanti-mis-cubic.pdf. So if RP≠NP (as is commonly believed), then for any chain we build, there is some value of p for which it starts failing to converge.

CategoryRandomizedAlgorithmsNotes

To put it more strongly, the Aldous-Fill manuscript is the Necronomicon of Markov chain convergence proofs. You can't truly join the cult without reading it. You should stop reading these notes right now and read the Aldous-Fill manuscript instead. Iä! Iä! Markov fhtagn! (1)
This is a little tricky; we don't know from this bound alone how fast the probability converges as a function of n, so to do this right we need to look into the bound in more detail. (2)
For aperiodic chains and chains with period 2, all the eigenvalues are real; the period-2 case includes an eigenvalue of -1. Chains with periods greater than 2 have pairs of complex-valued eigenvalues that are roots of unity, which happen to cancel out to only produce real probabilities in vP^t. (3)
MitzenmacherUpfal §10.4.1 concentrates on this case. (4)
You might reasonably ask if such functions g exist. One example is g(x) = (x₁⊕x₂) + ∑_i>2 x_i. (5)