For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.
In a shared memory model, it may be possible to solve some problems using wait-free protocols, in which any process can finish the protocol in a bounded number of steps, no matter what the other processes are doing (see ObstructionFreedom for more on this and some variants).
The wait-free hierarchy hrm classifies AsynchronousSharedMemory object types T by consensus number, where a type T has consensus number n if with objects of type T and atomic registers (all initialized to appropriate values) it is possible to solve wait-free consensus (i.e., agreement, validity, wait-free termination) for n processes but not for n+1 processes. The consensus number of any type is at least 1, since 1-process consensus requires no interaction, and may range up to ∞ for particularly powerful objects.
The wait-free hierarchy was suggested by work by Maurice Herlihy Wait-free synchronization, TOPLAS 13(1):124-149 that classified many common (and some uncommon) shared-memory objects by consensus number, and showed that an unbounded collection of objects with consensus number n together with atomic registers gives a wait-free implementation of any object in an n-process system. Various subsequent authors noticed that this did not give a "robust" hierarchy in the sense that combining two types of objects with consensus number n could solve wait-free consensus for larger n, and the hierarchy hrm was proposed by Prasad Jayanti Jayanti, Robust wait-free hierarchies, JACM 44(4):592-614, 1997 as a way of classifying objects that might be robust: an object is at level n of the hrm hierarchy if having unboundedly many objects plus unboundedly many registers solves n-process wait-free consensus but not (n+1)-process wait-free consensus. Whether or not the resulting hierarchy is in fact robust for arbitrary deterministic objects is still open, but Eric Ruppert Ruppert, Determining consensus numbers, SICOMP 30(4):1156-1168, 2000 subsequently showed that it is robust for RMW registers and objects with a read operation that returns the current state, and there is a paper by Borowsky, Gafni, and Afek (Consensus power makes (some) sense! (extended abstract), PODC 1994) that sketches a proof based on a topological characterization of computability that hrm is robust for deterministic objects that don't discriminate between processes (unlike, say, single-writer registers). So for well-behaved shared-memory objects (i.e., deterministic, symmetrically accessible, with read operations, etc.), consensus number appears to give a real classification that allows us to say for example that any collection of read-write registers (consensus number 1), fetch-and-increments (2), test-and-set bits (2), and queues (2) is not enough to build a compare-and-swap (∞).
We won't do the full robustness proofs of Borowsky et al or Ruppert that let us get away with this. Instead, we'll concentrate on Herlihy's original results and show that specific objects have specific consensus numbers when used in isolation. The procedure in each case will be to show an upper bound on the consensus number using a variant of FischerLynchPaterson (made easier because we are wait-free and don't have to worry about fairness) and then show a matching lower bound (for non-trivial upper bounds) by exhibiting an n-process consensus protocol for some n. Essentially everything below is taken from Herlihy's paper, so reading that may make more sense than reading these notes.
1. Classification by consensus number
Here's the quick table:
Read + interfering no-return RMW
Registers, counters, generalized counters
Interfering RMW; queue-like structures
Fetch-and-write, fetch-and-add, queues, process-to-memory swap
m-process consensus objects
atomic m-register write
First write-like operation wins
Queue with peek, fetch-and-cons, sticky bits, compare-and-swap, memory-to-memory swap, memory-to-memory copy
1.1. Level 1: atomic registers, counters, other interfering RMW registers that don't return the old value
First observe that any type has consensus number at least 1, since 1-process consensus is trivial.
We'll argue that a large class of particularly weak objects has consensus number exactly 1, by running FischerLynchPaterson with 2 processes. Recall that in FischerLynchPaterson we classify states as bivalent or univalent depending on whether both decision values are still possible, and that with at least one failure we can always start in a bivalent state (this doesn't depend on what objects we are using, since it depends only on having invisible inputs). Since the system is wait-free there is no constraint on adversary scheduling, and so if any bivalent state has a bivalent successor we can just do it. So to solve consensus we have to reach a bivalent configuration C that has only univalent successors, and in particular has a 0-valent and a 1-valent successor produced by applying operations x and y of processes px and py.
Assuming objects don't interact with each other behind the scenes, x and y must be operations of the same object. Otherwise Cxy = Cyx and we get a contradiction.
Now let's suppose we are looking at atomic registers, and consider cases:
- x and y are both reads, Then x and y commute: Cxy = Cyx ⇒ contradiction.
x is a read and y is a write. Then py can't tell the difference between Cyx and Cxy, so running py to completion gives the same decision value from both Cyx and Cxy, another contradiction.
x and y are both writes. Now py can't tell the difference between Cxy and Cy, so we get the same decision value for both, again constradicting that Cx is 0-valent and Cy is 1-valent.
There's a pattern to these cases that generalizes to other objects. Suppose that an object has a read operation that returns its state and one or more read-modify-write operations that don't return anything (perhaps we could call them "modify-write" operations). We'll say that the MW operations are interfering if for any two operations x and y either:
x and y commute: Cxy = Cyx.
One of x and y overwrites the other: Cxy = Cy or Cyx = Cx.
Then no pair of read or modify-write operations can get us out of a bivalent state, because reads commute, the non-reader can't tell which of a read and a MW operation happened first, and for any two MW operations either they commute or the overwriter can't detect that the first operation happened. So any MW object with uninformative, interfering MW operations has consensus number 1. For example, consider a counter that supports operations read, increment, decrement, and write: a write overwrites any other operation, and increments and decrements commute with each other, so the counter has consensus number 1. The same applies to a generalized counter that supports an atomic x←x+a operation; as long as this operation doesn't return the old value, it still commutes with other atomic increments.
1.2. Level 2: interfering RMW objects that return the old value, queues (without peek)
Suppose now that we have a RMW object that returns the old value, and suppose that it is non-trivial in the sense that it has at least one RMW operation where the embedded function f that determines the new value is not the identity (otherwise RMW is just read). Then there is some value v such that f(v) ≠ v. To solve two-process consensus, have each process pi first write its preferred value to a register ri, then execute the non-trivial RMW operation on the RMW object initialized to v. The first process in sees v and decides its own value. The second process sees f(v) and decides the first process's value (which it reads from the register). It follows that non-trivial RMW object has consensus number at least 2.
In many cases, this is all we get. Suppose that the operations of some RMW type T are interfering in a way analogous to the previous definition, where now we say that x and y commute if they leave the object in the same state (regardless of what values are returned) and that y overwrites x if the object is always in the same state after both x and xy (again regardless of what is returned). The two processes that carry out x and y know what happenened, but a third process z doesn't. So if we run z to completion we get the same decision value after both Cx and Cy, which means that Cx and Cy can't be 0-valent and 1-valent. It follows that no collection of RMW registers with interfering operations can solve 3-process consensus, and thus all such objects have consensus number 2.
There are some other objects with consensus number 2 that don't fit this pattern. Define a wait-free queue as an object with enqueue and dequeue operations (like normal queues), where dequeue returns empty if the queue is empty (instead of blocking). To solve 2-process consensus with a wait-free queue, initialize the queue with a single value (it doesn't matter what the value is). We can then treat the queue as a non-trivial RMW register where a process wins if it successfully dequeues the initial value and loses if it gets empty.
However, enqueue operations are non-interfering: if px enqueues vx and py enqueues vy, then any third process can detect which happened first; similarly we can distinguish enq(x) deq() from deq() enq(x). So to show we can't do three process consensus we do something sneakier: given a bivalent state C with allegedly 0- and 1-valent successors C enq(x) and C enq(y), consider both C enq(x) enq(y) and C enq(y) enq(x) and run x until it does a deq() (which it must, because otherwise it can't tell what to decide) and then stop it. Now run y until it also does a deq() and then stop it. We've now destroyed the evidence of the split and poor hapless z is stuck. In the case of C deq() enq(x) and C enq(x) deq() on a non-empty queue we can kill the initial dequeuer immediately and then kill whoever dequeues x or the value it replaced, and if the queue is empty only the dequeuer knows. In either case we reach indistinguishable states after killing only 2 witnesses, and the queue has number ≤ 2.
Similar arguments work on stacks, deques, and so forth—these all have consensus number exactly 2.
1.3. Level ∞: queue with peek, compare-and-swap, various memory-to-memory operations, fetch-and-cons, sticky bits
Here are a bunch of level-∞ objects:
- Queue with peek
- Has operations enq(x) and peek(), which returns the first value enqueued. (Maybe also deq(), but we don't use it). Protocol is to enq my input and then peek and return the first value into the queue.
- Returns old cdr and adds new car on to the head of a list. Use preceding protocol where peek() = tail(car::cdr).
- Sticky bits
- Has write operation that fails unless register is in the initial ⊥ state. Protocol is to write my input and then return result of a read.
- has CAS(old, new) operation that writes new only if previous value = old. Use it to build a sticky bit.
- Memory-to-memory swap
Has swap(ri, rj) operation that atomically swaps contents of ri with rj, as well as the usual read and write operations for all registers. Use to implement fetch-and-cons. Alternatively, use two registers inputi and victoryi for each process, where victoryi is initialized to 0, and a single central register token, initialized to 1. To execute consensus, write your input to inputi, then swap victoryi with token. The winning value is obtained by scanning all the victory registers for the one that contains a 1, then returning the corresponding input value.)
- Memory-to-memory copy
Has copy(ri, rj) operation that copies ri to rj atomically. Idea is that each process has registers rp1 and rp2 and stakes its claim by copying rp1 to rp2. It then writes 0 to all rp'1 for p' > p, and works backward looking for last non-zero rp'2, which is the winner (because it staked its claim before anybody else could shut it down). No better claim will come up at this point because all the zeroes in rp'1 for p' > p mean that there can be no further changes in any rp'2 for p' > p, and we don't care about rp'2 for p' < p because nobody will get that far during their backward scan without hitting rp2 = 1 first.
1.4. Level 2m-2: simultaneous m-register write
Here we have a (large) collection of atomic registers augmented by an m-register write operation that performs all the writes simultaneously. The intuition for why this is helpful is that if p1 writes r1 and rshared while p2 writes r2 and rshared then any process can look at the state of r1, r2 and rshared and tell which write happened first:
If the process reads r1 = r2 = ⊥, then we don't care which went first, because the reader (or somebody else) already won.
If the process reads r1 = 1 and then r2 = ⊥, then p1 went first.
If the process reads r2 = 2 and then r1 = ⊥, then p2 went first. (This requires at least one more read after checking the first case.)
Otherwise the process saw r1 = 1 and r2 = 2. Now read pshared: if it's 1, p2 went first, and vice versa.
This requires 2-register writes, and will give us a protocol for 2 processes (since the reader above has to participate somewhere to make the first case work). For m processes, we can do the same thing with m-register writes. We have a register rpq = rqp for each pair of distinct processes p and q, plus a register rpp for each p; this gives a total of m(m+1)/2 = O(m2) registers. All registers are initialized to ⊥. Process p then writes its initial preference to some single-writer register prefp and then simultaneously writes p to rpq for all q (including rpp). It then attempts to figure out the first writer by applying the above test for each q to rpq (standing in for rshared), rpp (= r1) and rqq (= r2). If it won against all the other processes, it decides its own value. If not, it repeats the test recursively for some p' that beat it until it finds a process that beat everybody, and returns its value. So m-register writes solve m-process wait-free consensus.
A further tweak gets 2m-2: run two copies of an m-1 process protocol using separate arrays of registers to decide a winner for each group. Then add a second phase where each process has one register sp, in which each process p from group 1 writes the winning id for its group simultaneously into sp and sq for each q in the other group. To figure out who won in the end, build a graph of all victories, where there is an edge from p to q iff p beat q in phase 1 or p's id was written before q's id in phase 2. The winner is the (unique) process with at least one outgoing edge and no incoming edges, which will be the process that won its own group (by writing first) and whose value was written first in phase 2.
1.4.1. Matching impossibility result
It would seem that the technique used to boost from m-process consensus to (2m-2)-process consensus could be repeated to get up to at least Θ(m2), but this turns out not to be the case. The essential idea is to show that in order to escape bivalence, we have to get to a configuration C where every process is about to do an m-register write leading to a univalent configuration (since reads don't help for the usual reasons, and normal writes can be simulated by m-register writes with an extra m-1 dummy registers), and then argue that these writes can't overlap too much. So suppose we are in such a configuration, and suppose that Cx is 0-valent and Cy is 1-valent, and we also have many other operations z1...zk that lead to univalent states. Following Herlihy, we argue in two steps:
There is some register that is written to by x alone out of all the pending operations. Proof: Suppose not. Then the 0-valent configuration Cxyz1...zk is indistinguishable from the 1-valent configuration Cyz1...zk by any process except px, and we're in trouble.
There is some register that is written to by x and y but not by any of the zi. Proof:: Suppose not; then Cxyz1...zk is indistinguishable from Cyxz1...zk for any process other than px and py, and we're still in trouble.
Now suppose we have 2m-1 processes. The first part says that each of the pending operations (x, y, all of the zi) writes to 1 single-writer register and at least k two-writer registers where k is the number of processes leading to a different univalent value. This gives k+1 total registers simultaneously written by this operation. Now observe that with 2m-1 process, there is some set of m processes whose operations all lead to a b-valent state; so for any process to get to a (¬b)-valent state, it must write m+1 registers simultaneously. It follows that with only m simultaneous writes we can only do (2m-2)-consensus.
1.5. Level m: m-process consensus objects
An m-process consensus object has a single consensus operation that, the first m times it is called, returns the input value in the first operation, and thereafter returns only ⊥. Clearly this solves m-process consensus. To show that it doesn't solve (m+1)-process consensus even when augmented with registers, run a bivalent initial configuration to a configuration C where any further operation yields a univalent state. By an argument similar to the m-register write case we can show that the pending operations in C must all be consensus operations on the same consensus object (anything else commutes or overwrites). Now run Cxyz1...zk and Cyxz1...zk, where x and y lead to 0- and 1-valent states, and observe that pk can't distinguish the resulting configurations because all it got was ⊥. (Note: this works even if the consensus object isn't in its initial state, since we know that before x or y the configuration is still bivalent.)
So the m-process consensus object has consensus number m. This shows that hrm is nonempty at each level.
A natural question at this point is whether the inability of m-process consensus objects to solve (m+1)-process consensus implies robustness of the hierarchy. One might consider the following argument: given any object at level m, we can simulate it with an m-process consensus object, and since we can't combine m-process consensus objects to boost the consensus number, we can't combine any objects they can simulate either. The problem here is that while m-process consensus objects can simulate any object in a system with m processes (see below), it may be that some objects can do more in a system with m+1 objects while still not solving (m+1)-process consensus. A simple way to see this would be to imagine a variant of the m-process consensus object that doesn't fail completely after m operations; for example, it might return one of the first two inputs given to it instead of ⊥. This doesn't help with solving consensus, but it might (or might not) make it too powerful to implement using standard m-process consensus objects.
2. Universality of consensus
This says that any type that can implement n-process consensus can, together with atomic registers, give a wait-free implementation of any object in a system with n processes.
See Herlihy paper for the full result, which does a lot of extra work to handle garbage collection and reduce overhead. Simplified profligate version goes like this: the processes repeatedly use consensus to decide between candidate histories of the simulated object, and a process successfully completes an operation when its operation (tagged to distinguish it from other similar operations) appears in a winning history. In slightly more detail:
- Have a separate n-process consensus protocol for each of a series of phases 0, 1, 2, ... . These protocols should allow processes as input values instead of just 0 or 1 (it's possible to build such objects out of binary consensus objects by doing tournaments).
- Processes post a list of (a) the operation they want to do and (b) the last phase they've participated in.
- To do an operation, process i:
- Posts the operation to its register.
- Reads all the last-phase values and takes their max.
- Runs the consensus protocol for the max phase to get the history decided on up to that phase.
- If the max phase history includes the process's pending operation, returns the result that operation would have had in the winning history.
- Otherwise, constructs a new history by appending all announced operations to the previous history, and tries to win with that history in phase max+1.
- Returns to step (b) if its operation doesn't make it into the winning history.
This terminates because even if process i doesn't get its value into the winning history, eventually some other process will pick up the announced value and include it.