[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.

Basic idea: assign a timestamp to all events in an AsynchronousMessagePassing system that simulates real time, thereby allowing timing-based algorithms to run despite asynchrony.

1. Foundation: Causal ordering

The underlying notion of a logical clock is causal ordering, a partial order on events that describes when one event e provably occurs before some other event e'.

Given two schedules S and S', let S and S' be similar if S|p = S'|p for all processes p. We can define a causal ordering on the events of some schedule S implicitly by considering all schedules S' similar to S, and declare that e < e' if e precedes e' in all such S. But it is usually more useful to make this ordering explicit.

Following AttiyaWelch §6.1.1, define the happens-before relation ⇒S on a schedule S obtained by including:

  1. All pairs (e, e') where e precedes e' in S and e and e' are events of the same process.
  2. All pairs (e, e') where e is a send event and e' is the receive event for the same message.
  3. All pairs (e, e') where there exists a third event e'' such that e ⇒S e'' and e'' ⇒S e'. (In other words, we take the TransitiveClosure of the relation defined by the previous two cases.)

It is not terribly hard to show that this gives a partial order; the main observation is that if e ⇒S e', then e precedes e' in S. So ⇒S is a subset of the total order <S given by the order of events in S.

A causal shuffle S' of a schedule S is a permutation of S that is consistent with the happens-before relation on S; that is, if e happens-before e' in S, then e precedes e' in S'. The importance of the happens-before relation follows from this lemma:

Lemma 1
Let S' be a permutation of the events in S. Then the following two statements are equivalent:
  1. S' is a causal shuffle of S.
  2. S' is the schedule of an execution fragment of a message-passing system with S|p = S'|p for all S'.
Proof
  • (1 ⇒ 2). We need to show both similarity and that S' corresponds to some execution fragment. We'll show similarity first. Pick some p; then every event at p in S also occurs in S', and they must occur in the same order by the first case of the definition of the happens-before relation. This gets us halfway to showing S' is the schedule of some execution fragment, since it says that any events initiated by p are consistent with p's programming. To get the rest of the way, observe that any other events are receive events. For each receive event e' in S, there must be some matching send event e also in S; thus e and e' are both in S' and occur in the right order by the second case of the definition of happens-before.
  • (2 ⇒ 1). First observe that since every event e in S' occurs at some process p, if S'|p = S|p for all p, then there is a one-to-one correspondence between events in S' and S, and thus S' is a permutation of S. Now we need to show that S' is consistent with ⇒S. Let e ⇒S e'. There are three cases.

    1. e and e' are events of the same process p and e <S e'. But then e <S' e' because S|p = S'|p.

    2. e is a send event and e' is the corresponding receive event. Then e <S' e' because S' is the schedule of an execution fragment.

    3. e ⇒S e' by transitivity. Then each step in the chain connecting e to e' uses one of the previous cases, and e <S' e' by transitivity of <S'.

What this means: if I tell you ⇒S, then you know everything there is to know about the order of events in S that you can deduce from reports from each process together with the fact that messages don't travel back in time. But ⇒S is a pretty big relation (O(n²) bits in the worst case), and seems to require global knowledge of <S to compute. So we can ask if there is some simpler, easily computable description that works almost as well. This is where logical clocks come in.

2. Lamport's logical clocks

Lamport's logical clock runs on top of some other message-passing protocol, adding additional state at each process and additional content to the messages (which is invisible to the underlying protocol). Every process maintains a local logical clock. When a process sends a message or executes an internal step, it sets clock ← clock + 1 and assigns the resulting value as the clock value of the event. If it sends a message, it piggybacks the resulting clock value on the message. When a process receives a message, it sets clock ← max(clock, message timestamp)+1; the resulting clock value is taken as the time of receipt of the message. (To make life easier, we assume messages are received one at a time.)

Claim
If we order all events by clock value, we get an execution of the underlying protocol that is locally indistinguishable from the original execution.
Proof

Let e <L e' if e has a lower clock value than e'. If e and e' are two events of the same process, then e <L e'. If e and e' are send and receive events of the same message, then again e <L e'. So for any events e, e', if e ⇒S e', then e <L e'. Now apply Lemma 1.

3. Welch's logical clocks

Lamport's clock has the advantage of requiring no changes in the behavior of the underlying protocol, but has the disadvantage that clocks are entirely under the control of the logical-clock protocol and may as a result make huge jumps when a message is received. If this is unacceptable—perhaps the protocol needs to do some unskippable maintenance task every 1000 clock ticks—then an alternative approach due to Welch can be used.

Method: Each process maintains its own clock, which it increments whenever it feels like it. To break ties, the process extends the clock value to (clock, process id, event count) where the event count is a count of send and receive events (and possibly local computation steps). As in Lamport's clock, each message in the underlying protocol is timestamped with the current extended clock value. Because the protocol can't change the clock values on its own, when a message is received with a timestamp later than the current clock value, its delivery is delayed until the clock exceeds the message timestamp, at which point the receive event is assigned the extended clock value of the time of delivery.

Claim
If we order all events by clock value, we get an execution of the underlying protocol that is locally indistinguishable from the original execution.
Proof
Again, we have that (a) all events at the same process occur in increasing order (since the event count rises even if the clock value doesn't, and we assume that the clock value doesn't drop) and (b) all receive events occur later than the corresponding send event (since we force them to). So Lemma 1 applies.

The advantage of Welch's clock is that it doesn't impose any assumptions on the clock values, so it is possible to use a real-time clock at each process and nonetheless have a causally-consistent ordering of timestamps even if the local clocks are not perfectly synchronized. If some process's clock is too far off, it will have trouble getting its messages delivered quickly (if its clock is ahead) or receiving messages (if its clock is behind)—the net effect is to add a round-trip delay to that process equal to the difference between its clock and the clock of its interlocutor. But the protocol works well when the processes' clocks are closely synchronized, which has become a plausible assumption in the last 10-15 years thanks to the Network Time Protocol, cheap GPS receivers, and clock synchronization mechanisms built into most cellular phone networks.1

4. Vector clocks

Logical clocks give a superset of the happens-before relation: if e ⇒S e', then e <L e' (or conversely, if e ≮L e', then it is not the case that e ⇒S e'). This is good enough for most applications, but what if we want to compute ⇒S exactly?

Here we can use a vector clock. Instead of a single clock value, each event is stamped with a vector of values, one for each process. When a process executes a local event or a send event, it increments only its own component xp of the vector. When it receives a message, it increments xp and sets each xq to the max of its previous value and the value of xq piggybacked on the message. We define VC(e) ≤ VC(e'), where VC(e) is the value of the vector clock for e, if VC(e)i ≤ VC(e')i for all i.

Claim

Fix a schedule S; then for any e, e', VC(e) < VC(e') if and only if e ⇒S e'.

Proof

The if part follows immediately from the update rules for the vector clock. For the only if part, suppose e does not happen-before e'. Then e and e' are events of distinct processes p and p'. For VC(e) < VC(e') to hold, we must have VC(e)p < VC(e')p; but this can occur only if the value of VC(e)p is propagated to p' by some sequence of messages starting at p and ending at p' at or before e' occurs. In this case we have e ⇒S e'.

5. Consistent snapshots

A consistent snapshot of a message-passing computation is a description of the states of the processes (and possibly messages in transit, but we can reduce this down to just states by keeping logs of messages sent and received) that gives the global configuration at some instant of a schedule that is a consistent reordering of the real schedule (a consistent cut in the terminology of AttiyaWelch §6.1.2). Without shutting down the protocol before taking a snapshot this is the about the best we can hope for in a message-passing system.

Logical time can be used to obtain consistent snapshots: pick some logical time and have each process record its state at this time (i.e. immediately after its last step before the time or immediately before its first step after the time). We have already argued that logical time gives a consistent reordering of the original schedule, so the set of values recorded is just the configuration at the end of an appropriate prefix of this reordering ⇒ it's a consistent snapshot.

If we aren't building logical clocks anyway, there is a simpler consistent snapshot algorithm due to Chandy and Lamport. Here some central initiator broadcasts a snap message, and each process records its state and immediately forwards the snap message to all neighbors when it first receives a snap message. To show that the resulting configuration is a configuration of some consistent reordering, observe that (with FIFO channels) no process receives a message before receiving snap that was sent after the sender sent snap: thus causality is not violated by lining up all the pre-snap operations before all the post-snap ones.

The full Chandy-Lamport algorithm adds a second marker message that is used to sweep messages in transit out of the communications channels, which avoids the need to keep logs if we want to reconstruct what messages are in transit (this can also be done with the logical clock version). The idea is that when a process records its state after receiving the snap message, it issues a marker message on each outgoing channel. For incoming channels, the process all records all messages received between the snapshot and receiving a marker message on that channel (or nothing if it receives marker before receiving snap). A process only reports its value when it has received a marker on each channel. The marker and snap messages can also be combined if the broadcast algorithm for snap resends it on all channels anyway, and a further optimization is often to piggyback both on messages of the underlying protocol if the underlying protocol is chatty enough.

Note that Chandy-Lamport is equivalent to the logical-time snapshot using Lamport clocks, if the snap message is treated as a message with a very large timestamp. For Welch clocks, we get an algorithm where processes spontaneously decide to take snapshots (since Welch clocks aren't under the control of the snapshot algorithm) and delay post-snapshot messages until the local snapshot has been taken. This can be implemented as in Chandy-Lamport by separating pre-snapshot messages from post-snapshot messages with a marker message, and essentially turns into Chandy-Lamport if we insist that a process advance its clock to the snapshot time when it receives a marker.

5.1. Property testing

Consistent snapshots are in principle useful for debugging (since one can gather a consistent state of the system without being able to talk to every process simultaneously), and in practice are mostly used for detecting stable properties of the system. Here a stable property is some predicate on global configurations that remains true in any successor to a configuration in which it is true, or (bending the notion of properties a bit) functions on configurations whose values don't change as the protocol runs. Typical examples are quiescence and its evil twin, deadlock. More exotic examples include total money supply in a banking system that cannot create or destroy money, or the fact that every process has cast an irrevocable vote in favor of some proposal or advanced its Welch-style clock past some threshold.

The reason we can test such properties using consistent snapshot is that when the snapshot terminates with value C in some configuration C', even though C may never have occurred during the actual execution of the protocol, there is an execution which leads from C to C'. So if P holds in C, stability means that it holds in C'.

Naturally, if P doesn't hold in C, we can't say much. So in this case we re-run the snapshot protocol and hope we win next time. If P eventually holds, we will eventually start the snapshot protocol after it holds and obtain a configuration (which again may not correspond to any global configuration that actually occurs) in which P holds.

5.2. Replicated state machines

The main application suggested by Lamport in his original paper was building a replicated state machine, a simulated shared-memory object that is replicated across all processes. Any process can at any time issue an operation on the object by broadcasting it with an attached timestamp. When a process receives an operation, it buffers it in a priority queue ordered by increasing timestamp. It can apply the first operation in the queue only when it can detect that no earlier operation will arrive, which it can do if it sees a message from each other process with a later timestamp (or after a timeout, if we have some sort of clock synchronization guarantee). It is not terribly hard to show that this guarantees linearizability: see LynchBook §18.3.3 for the details. If the processes spam each other regularly with their current clock values, this will occur after at most two message delays (with Lamport clocks) if the clocks are not very well synchronized and after approximately one message delay (with Lamport or Welch clocks) if they are. A process can also execute read operations on its own copy immediately without notifying other processes (if it is willing to give up linearizability for sequential consistency).

Note that replicated state machines assume no failures, so for poorly-synchronized clocks or systems in which sequentially-consistent reads are not good enough, replicated state machines are no better than simply keeping one copy of the object on a single process and having all operations go through that process: 2 message delays + 2 messages per operation for the single copy beats 2 message delays + many messages for full replication. But replicated state machines take less time under good conditions, and when augmented with more powerful tools like consensus or atomic broadcast are the basis of most fault-tolerant implementations of general shared-memory objects.


CategoryDistributedComputingNotes

  1. As I write this, my computer reports that its clock is an estimated 289 microseconds off from the timeserver it is synchronized to, which is less than a tenth of the round-trip delay to machines on the same local-area network and a tiny fraction of the round-trip delay to machines outside the immediate vicinity. (1)


2014-06-17 11:58