The problem: we want to know when we can simulate message passing using shared memory and vice versa. This will let us apply results from either class of systems to the other.
1. Message passing from shared memory
This is the easy direction. We can build a reliable FIFO channel from single-writer single-reader registers using polling. The naive approach is that for each edge uv in the message-passing system, we create a (very big) register ruv, and u writes the entire sequence of every message it has ever sent to v to ruv every time it wants to do a new send. To receive messages, v polls all of its incoming registers periodically and delivers any messages in the histories that it hasn't processed yet.
This can be improved by adding in an acknowledgment mechanism in a separate register ackvu; the idea is that u will only write one message at a time to ruv, and will queue subsequent messages until v writes in ackvu that the message in ruv has been received. With some tinkering it is possible to knock ruv down to only 3 possible states (sending 0, sending 1, and reset) and ackvu down to a single bit (value-received, reset-received), but that's probably overkill for most applications.
Process failures don't affect any of these protocols, except that the dead process stops sending and receiving.
2. Shared memory from message passing
In the absence of process failures, we can just assign each register to some process, and implement both read and write operations by remote procedure calls to the process (in fact, this works for arbitrary shared-memory objects). With process failures, we need to make enough copies of the register that failures can't destroy all of them. We'll assume that our system is asynchronous, that the network is complete, and that we are only dealing with f < n/2 crash failures. We'll also assume we only want to build single-writer registers. Our goal is to simulate a single atomic register, in the sense that no external observing seeing the sequence of invoke and response actions can distinguish the execution from the execution of an atomic register. In particular, we will need to guarantee that if two register operations don't overlap at all that their results are consistent with the first happening first.
Here's the algorithm, which is due to Attiya, Bar-Noy, and Dolev Sharing memory robustly in message-passing systems, JACM 42(1):124-142, 1995; see also LynchBook §17.1.3. (AttiyaWelch §9.3 gives an equivalent algorithm, but the details are buried in an implementation of totally-ordered broadcast). We'll make n copies of the register, one on each process. Each process's copy will hold a pair (value, timestamp) where timestamps are (unbounded) integer values. Initially, everybody starts with (default, 0). A process updates its copy with new values (v,t) upon receiving write(v,t) from any other process p, provided t is greater than the process's current timestamp. It then responds to p with ack(v,t) whether or not it updated its local copy. A process will also respond to a message read(u) with a response ack(value, timestamp, u); here u is a nonce used to distinguish between different read operations so that a process can't be confused by out-of-date acknowledgments.
Now to write a value, the writer increments its timestamp, updates its value and sends write(value, timestamp) to all other processes. The write operation terminates only when the writer has received acknowledgments containing the new timestamp value from a majority of processes.
To read a value, a reader does two steps:
- It sends read(u) to all processes (where u is any value it hasn't used before) and waits to receive acknowledgments from a majority of the processes. It takes the value associated v with the maximum timestamp t as its return value (no matter how many processes sent it).
- It then sends write(v,t) to all processes, and waits for a response ack(v,t) from a majority of the processes. Only then does it return.
(Any extra messages, messages with the wrong nonce, etc. are discarded.)
Intuition: Nobody can return from a write or a read until they are sure that subsequent reads will return the same (or a later) value. A process can only be sure of this if it knows that the values collected by a read will include at least one copy of the value written or read. But since majorities overlap, if a majority of the processes have a current copy of v, then the majority read quorum will include it. Sending write(v,t) to all processes and waiting for acknowledgments from a majority is just a way of ensuring that a majority do in fact have timestamps that are at least t.
Both reads and writes cost Θ(n) messages (Θ(1) per process).
2.2. Proof of linearizability
Of course, we have to prove this. In particular, we want to show that for any trace T of the ABD protocol, there is an trace of an atomic register object that gives the same sequence of invoke and response events. The usual way to do this is to find a linearization of the read and write operations: a total order that extends the observed order in T where op1 < op2 in T if and only if op1 ends before op2 starts. Sometimes it's hard to construct such an order, but in this case it's easy: we can just use the timestamps associated with the values written or read in each operation. Specifically, we define the timestamp of a write or read operation as the timestamp used in the write(v,t) messages sent out during the implementaiton of that operation, and we put op1 before op2 if:
- op1 has a lower timestamp than op2, or
- op1 has the same timestamp as op2, op1 is a write, and op2 is a read, or
op1 has the same timestamp as op2 and op1 <T op2, or
- none of the other cases applies, and we feel like putting op1 first.
The intent is that we pick some total ordering that is consistent with both <T and the timestamp ordering (with writes before reads when timestamps are equal). To make this work we have to show (a) that these two orderings are in fact consistent, and (b) that the resulting ordering produces values consistent with an atomic register: in particular, that each read returns the value of the last preceding write.
Part (b) is easy: since timestamps only increase in response to writes, each write is followed by precisely those reads with the same timestamp ⇒ those that returned the value written.
For part (a), suppose that op1 <T op2. The first case is when op2 is a read. Then before the end of op1, a set S of more than n/2 processes send the op1 process an ack(v1,t1) message. Since local timestamps only increase, from this point on any ack(v2,t2,u) message sent by a process in S has t2 ≥ t1. Let S' of processes sending ack(v2,t2,u) messages processed by op2. Since |S| > n/2 and |S'| > n/2 we have S∩S' is nonempty and so S' contains some ack(v2,t2) with t2 ≥ t1. So op2 is serialized after op1. The second case is when op2 is a write; but then op1 returns a timestamp that precedes the writer's increment in op2, and so again is serialized first.
2.3. Proof that f < n/2 is necessary
This is pretty much the standard partition argument that f < n/2 is necessary to do anything useful in a message-passing system. Split the processes into two sets S and S' of size n/2 each. Suppose the writer is in S. Consider an execution where the writer does a write operation, but all messages between S and S' are delayed; since the writer can't tell if the S' processes are slow or dead, it eventually returns. Now let some reader in S' attempt to read the simulated register, again delaying all messages between S and S'; now the reader is forced to return some value without knowing whether the S processes are slow or dead. If the reader doesn't return the value written, we lose. If by some miracle it does, then we lose in the execution where the write didn't happen and all the processes in S really were dead.
2.4. Multiple writers
So far we have assumed a single writer. The main advantage of this approach is that we don't have to do much to manage timestamps: the single writer can just keep track of its own. With multiple writers we can use essentially the same algorithm, but each write needs to perform an initial round of gathering timestamps so that it can pick a new timestamp bigger than those that have come before. We also extend the timestamps to be of the form (count, id), lexicographically ordered, so that two timestamps with the same count field are ordered by process id. The modified write algorithm is:
- Send read(u) to all processes and wait to receive acknowledgments from a majority of the processes.
- Set my timestamp to t = (max count + 1, id) where the max is taken over all the count fields in the acknowledgments I received. Note that this is a two-field timestamp that is compared lexicographically, with the id field used to prevent duplicate timestamps.
- Send write(v,t) to all processes, a wait for a response ack(v,t) from a majority of processes.
This increases the cost of a write by a constant factor, but in the end we still have only a linear number of messages. The proof of linearizability is essentially the same as for the single-writer algorithm, except now we must consider the case of two write operations by different processes. Here we have that if op1 <T op2, then op1 gets acknowledgments of its write with timestamp t1 from a majority of processes before op2 starts its read phase; since op2 waits of acknowledgments from a majority of processes as well, these majorities overlap, so op2's timestamp t2 must exceed t1. So the linearization ordering previously defined still works.