For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.
Here we are describing the work of Aspnes, Attiya, and Censor from PODC 2009. See http://www.cs.yale.edu/homes/aspnes/max-registers-abstract.html for the full story.
Basic idea of a max register: read operation returns the largest value previously written, as opposed to the last value previously written. So after writes of 0, 3, 5, 2, 6, 11, 7, 1, 9, a read operation will return 11.
These are perturbable objects in the sense of the JayantiTanToueg bound, so in the worst case a max-register read will have to read at least n-1 distinct atomic registers, giving an n-1 lower bound on both individual work and space. But we can get around this by considering bounded max registers (which only hold values in some range 0..m-1); these are not perturbable because once one hits its upper bound we can no longer insert new operations to change the value returned by a read.
1. Implementing bounded max registers
For m=1, the implementation is trivial: write does nothing and read always returns 0.
For larger m, we'll show how to paste together two max registers left and right with m0 and m1 values together to get a max register r with m0+m1 values. We'll think of each value stored in the max register as a bit-vector, with bit-vectors ordered lexicographically. In addition to left and right, we will need a 1-bit atomic register switch used to choose between them. The read procedure is straightforward:
- if switch = 0:
return 0 . Readleft()
return 1 . Readright()
- if switch = 0:
Here we use the Perlism '.' for concatenation.
For write operations, we have two somewhat asymmetrical cases depending on whether the value we are writing starts with a 0 bit or a 1 bit:
- if switch = 0:
- do nothing
switch <- 1
- if switch = 0:
The intuition is that the max register is really a big tree of switch variables, and we store a particular bit-vector in the max register by setting to 1 the switches needed to make Read() follow the path corresponding to that bit-vector. The procedure for writing 0x tests switch first, because once switch gets set to 1, any 0x values are smaller than the largest value, and we don't want them getting written to left where they might confuse particularly slow readers into returning a value we can't linearize. The procedure for writing 1x sets switch second, because (a) it doesn't need to test switch, since 1x always beats 0x, and (b) it's not safe to send a reader down into right until some value has actually been written there.
It's easy to see that Read and Write operations both require exactly one operation per bit of the value read or written. To show that we get linearizability, we give an explicit linearization ordering (see the paper for a full proof that this works):
- All operations that read 0 from switch go in the first pile.
- Within this pile, we sort operations using the linearization ordering for left.
- All operations that read 1 from switch or write 1 to switch go in the second pile, which is ordered after the first pile.
- Within this pile, operations that touch right are ordered using the linearization ordering for right. Operations that don't (which are the "do nothing" writes for 0x values) are placed consistently with the actual execution order.
To show that this gives a valid linearization, we have to argue first that any Read operation returns the largest earlier Write argument and that we don't put any non-concurrent operations out of order.
For the first part, any Read in the 0 pile returns 0 . Readleft(), and Readleft returns (assuming left is a linearizable max register) the largest value previously written to left, which will be the largest value linearized before the Read, or the all-0 vector if there is no such value; in either case we are happy. Any Read in the 1 pile returns 1 . Readright. Here we have to guard against the possibility of getting an all-0 vector if no Write operations linearize before the Read. But any Write operation that writes 1x doesn't set switch to 1 until after it writes to right, so no Read operation ever starts Readright until after at least one Writeright has completed, implying that that Writeright linearizes before the Readright. So in this case as well all the second-pile operations linearize.
2. Encoding the set of values
If we structure our max register as a balanced tree of depth k, we are essentially encoding the values 0..2k-1 in binary, and the cost of performing a read or write operation on an m-valued register is exactly k = ⌈lg m⌉. But if we are willing to build an unbalanced tree, any prefix code will work.
The paper describes a method of building a max register where the cost of each operation that writes or reads a value v is O(log v). The essential idea is to build a tree consisting of a rightward path with increasingly large left subtrees hanging off of it, where each of these left subtrees is twice as big as the previous. This means that after following a path encoded as 1k0, we hit a 2k-valued max register. The value returned after reading some v' this max register is v' + (2k-1), where the 2k-1 term takes into account all the values represented by earlier max registers in the chain. Formally, this is equivalent to encoding values using an Elias gamma code, tweaked slightly (by changing the prefixes from 0k1 to 1k0) to get the ordering right.
3. Unbounded max registers
While the unbalanced-tree construction could be used to get an unbounded max register, it is possible that read operations might not terminate (if enough writes keep setting 1 bits on the right path before the read gets to them) and for very large values the cost even of terminating reads becomes higher than what we can get out of a simple collect.
Here is the collect-based method: if each process writes its own contribution to the max register to a single-writer register, then we can read the max register with n-1 atomic register reads, by reading everybody else's register and returning the maximum of the other values and our own. (It is not hard to show that this is linearizable.) This gives an unbounded max register with read cost n-1 and write cost 1. So by choosing this in preference to the balanced tree when m is large, the read cost for a max register is min(⌈lg m⌉, n-1).
We can combine this with the unbalanced tree by terminating the right path with a collect-based register. This gives a cost for reads and writes of values v of O(min(log v, n)).
4. Lower bound
The min(⌈lg m⌉, n-1) cost of a max register read turns out to be exactly optimal. Intuitively, we can show by a covering argument that once some process attempts to write to a particular atomic register, then any subsequent writes convey no additional information (because they can be overwritten by the first delayed write)—so in effect, no algorithm can use get more than one bit of information out of each atomic register.
For the lower bound proof, we consider solo-terminating executions in which n-1 writers do any number of max-register writes in some initial prefix Λ, followed by a single max-register read Π by process n. Let T(m,n) be the optimal reader cost for executions with this structure with m values, and let r be the first register read by process n, assuming it is running an algorithm optimized for this class of executions (we do not even require it to be correct for other executions).
We are now going split up our set of values based on which will cause a write to write to r. Let Sk be the set of all sequences of writes that only write values ≤ k. Let t be the smallest value such that some execution in St writes to r (there must be some such t, or our reader can omit reading r ⇒ it isn't optimal).
- Case 1
Since t is smallest, no execution in St-1 writes to r. If we restrict writes to values ≤ t-1, we can omit reading r, giving T(t,n) ≤ T(m,n) - 1 or T(m,n) ≥ T(t,n) + 1.
- Case 2
Let α be some execution in St that writes to r.
Split α as α'δβ where δ is the first write to r by some process pi.
- Construct a new execution α'ν by letting max-register writes except the one performing δ finish.
Now consider any execution α'νγδ, where γ is any sequence of max-register writes with values ≥ t that excludes pi and pn. Then pn always sees the same value in r following these executions, but otherwise (starting after α'ν) we have an (n-1)-process max-register with values t through m-1.
- Omit read of r to get T(m,n) ≥ T(m-t, n-1) + 1.
We've shown the recurrence T(m,n) ≥ mint(max(T(t,n), T(m-t,n))) + 1, with base cases T(1,n) = 0 and T(m,1) = 0. The solution to this recurrence is exactly min(⌈lg m⌉, n-1), the same as the upper bound we got by choosing between a balanced tree for small m and a collect for m ≥ 2n-1. For small m, the recursive split we get is also the same as in the tree-based algorithm: call the r register switch and you can extract a tree from whatever algorithm somebody gives you. So this says that the tree-based algorithm is (up to choice of the tree) essentially the unique optimal bounded max register implementation for m ≤ 2n-1.
Turn any monotone circuit into a data structure by embedding max registers everywhere; for example, we can implement a counter as a tree of 2-input adders with single-writer registers at the bottom. Anybody who updates an input to the circuit is responsible for updating all the outputs to gates on a path from that input to the root, and a reader just reads the max register at the root.
For a counter, an increment operation decomposes recursively like this:
- inc(left) [or inc(right) as appropriate]
output <- left + right
where read and write operations on left, right, and output are all max register operations, to avoid the possibility of lost updates from delayed incrementers. The cost of an increment to an m-valued counter is then O(log n log m), while the cost of a read is O(log m). These reduce to O(log2 n) and O(log n) if we assume m is polynomial in n.
To prove this is linearizable, pretend that all the constituent counters and max registers are atomic (which we can do if they are themselves linearizable) and construct an explicit linearization ordering as follows:
- Assign each increment the value left+right just after it increments left or right, and order the increments by these values. (Note that this is automatically consistent with the real-time ordering.)
- Assign each read the value it reads, and order the reads by the time at which they read output (also consistent with real-time ordering).
Interleave the reads and increments by putting all reads that return v just after the increment assigned v. To show this is consistent with real-time ordering, first suppose some increment is assigned v and some read starts after the increment finishes. Then the increment (because it finished) read values l' and r' from left and right that were at least as large as the value l and r when it finished its increment, and writes l'+r' >= v to output before the reader reads it. This implies the reader is ordered later. Alternatively, if a reader finishes before some increment starts, then any value stored in output before the increment starts is less than l+r (since left and right are non-decreasing and at least one is smaller than it is after the incrementer's first increment), so the reader's value is also less than l+r and it gets ordered first.
This proof breaks down for a generalized counter that allows arbitrary non-negative increments; it is possible to construct an execution where two processes increment left and right by +1 and +2 but readers see all values 0, 1, 2, and 3 (we need a third process to sneak in and read 0 from the left long before reading 2 from right, even though some faster process sees a 1 in left and a 0 in right before the +2 increment happens). For generalized counters and monotone circuits in general, we get a weaker condition called monotone consistency that is similar to the requirements for lattice agreement (see the paper for details).
An open problem is whether a stronger construction could transform any monotone circuit like this into a linearizable data structure while keeping the polylogarithmic cost bounds. One way to do this would be to find a cheap snapshot for two max registers, and use it to read the values of left and right atomically. Unfortunately, we currently don't know how to build a 2-max-register snapshot object. It may be that doing so is inherently hard, but we don't know that either.