Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

There are three main ways to look at distributed systems, each of which provides a slightly different perspective from the others.

From a very theoretical perspective, we can build a model of a distributed system in terms of e.g. communicating automata and ask what tasks we can do in this model under various assumptions about scheduling, failures, and so forth. Much of the research in this area sits right on the edge of impossibility, where dropping a single beneficial assumption makes what you want to do impossible and adding a few more beneficial assumptions makes what you want to do easy (or at least not hard). (To find out more about this, take CS425.)

From a more practical perspective, we can say that we don't mind assuming properties of our system that the real world actually has (like timeouts or reliable clocks). This gives us a more useful model and allows the use of standard coordination algorithms of the sort described in SGG Chapter 18.

From a still more practical perspective, we can adopt the principle that "If we build this right, we won't need any algorithms"1, and aim for distributed systems that don't require much coordination or global consistency. Most of the distributed systems we actually use fall in this last category.

# 1. Examples of distributed systems

Here are the four main distributed systems normal people actually use:

1. The vast collection of routing mechanisms that make up the Internet. Most of these are very local, in the form of routing tables maintained by humans or by (fairly straightforward) route-discovery algorithms.
2. The domain name system (DNS). Essentially a large hierarchical database for translating domain names like www.cs.yale.edu into IP addresses like 128.36.229.30. The underlying mechanism consists of the user propagating RPC-like queries up through a tree of nameservers, to find first the nameserver for .edu, then for .yale.edu, then for .cs.yale.edu, and then finally to find the target host. Fault-tolerance is obtained through replication, scalability through caching, and sound economics through carefully assigning most of the cost of a query to an organization that cares about the user getting the answer (i.e. Yale ITS runs the .yale.edu nameserver). Note that very little coordination is required: domain records don't change very fast, and can be published in a central location.

3. The World Wide Web. Structurally not all that different from the bottom layer of DNS, with webservers replacing nameservers.
4. The SMTP-based email system. This is to packet routing what the web is to DNS: a store-and-forward system implemented at a high level in the protocol stack rather than down in the routers. Reliability is important (TwoGenerals comes up here, as my mail server can't drop an outgoing message until it is sure that your mail server has picked it up), but for the most part there is no need to do any sort of global coordination or consistency enforcement.

The common pattern in each case: to build a successful Internet-scale distributed system, it appears to be that case that you need an application that doesn't require centralized coordination, that allows newcomers to join easily, and that scales in economic terms by lining up costs and benefits so that they are roughly balanced for most potential users. This also characterizes other large-scale distributed systems like UUCP (in the old days) or various peer-to-peer content distribution systems (more recently).

# 2. Distributed coordination

For some distributed systems, it's necessary to get more coordination between components. The classic example is banking: we want to ensure that if I send you \$100, both our banks agree that the transaction took place. Such coordination turns out to be surprisingly difficult if we are not careful.

In the absence of failures, the problem of distributed coordination is essentially just the problem of ConcurrencyControl in a distributed setting, and we can adapt the approaches we've already seen for multiprocessors, primarily mutual exclusion. But we need to be careful to make sure that whatever we use works when our only underlying communication mechanism is message passing. We also have to be careful in that we no longer have as much control over the ordering of events: where a write to a memory location can be treated as atomic operation (assuming our hardware is set up right), transmitting a message across a network necessarily takes time. This can lead to confusion between widely-distributed processors about when particular events happen, or even the order in which they happen.

# 3. Timestamps

We can clear up the confusion somewhat by assigning our own synthetic times, or timestamps to all events. We'd like this assignment to be consistent with what we observe: in particular, the timestamps of successive events at the same machine should be increasing (a property called monotonicity) and the timestamp at which a message is received should exceed the timestamp for when it is sent (consistency). One way to do this would be to use very carefully synchronized clocks. However, there is a simpler approach—known as a logical clock—that just uses counters.

Suppose each machine has a counter for the number of events it has processed; this counter always rises by 1 at each event, so we get the monotonicity property. Whenever we send a message, we attach the timestamp at the sender; the receiver updates its local clock to max(message_timestamp, local_clock)+1. This process also preserves monotonicity (since the new value of local clock is at least as big as the old value plus 1) but in addition gives us consistency (since the message is received later than it is sent). The only real downside of this system is that the local clocks may advance very quickly if some process goes nuts and starts sending out huge timestamps, but we can probably defend against this by dropping messages if the timestamps are implausibly large.

# 4. Distributed mutual exclusion

So now we'd like to block off exclusive access to some resource for some interval of time (where time is now a potentially very squishy logical time). There are several options:

Centralized coordinator

To obtain a lock, a process sends a request message to the coordinator. The coordinator marks the lock as acquired and responds with a reply message. When the initial process is done with the lock, it releases it with a release message. Bad things happen if any of these messages are lost or either process fails (we can use retransmissions and timeouts to work around this). An advantage is that the coordinator can implement any scheduling algorithm it likes to ensure fairness or other desirable guarantees, and that only 3 messages are sent per entry into the critical section.

Token passing
We give a unique token to some process initially. When that process is done with the lock (or if it didn't need to acquire it in the first place), it passes the token on to some other process. Repeat forever. Advantage: no central coordinator. Disadvantage: have to organize the processes into a ring to avoid starvation; token can be lost (or worse, duplicated!); if nobody needs the lock the token still spins through the ring. This is a still a pretty common algorithm for human collaborators not blessed with good distributed version control systems.
Timestamp algorithm

This is similar to the ticket machine approach used in delicatessens (and bakeries, although that risks confusion with the similar Bakery algorithm for shared-memory). To enter a critical section, a process p generates a timestamp t and sends request(p, t) to all of the other processes in the system. Each process q sends back a reply(p, t) message provided (a) q is not already in a critical section, and (b) either q is idle (not attempting to enter a critical section) or q has a higher timestamp than p. If q doesn't send back a reply immediately, it queues p's request until it can. This algorithm has the property of ensuring mutual exclusion since for each pair of conflicting machines p and q, the one with the smaller timestamp won't send a reply to the other until it is done. It guarantees deadlock-freedom and fairness because the process with the smallest timestamp always wins (in the absence of cheating, which we have to assume here). It uses 2(n-1) messages per critical section, which is more expensive than a centralized approach, but could be cheaper than token-passing. The main difficulty with the algorithm is that it doesn't scale well: each process needs to know the identity of all the other processes, so it works best for small, stable groups.

# 5. Distributed transactions

If we are worried about failures, we may want to go beyond a simple mutual exclusion approach and provide actual atomic transactions. Here atomicity means that the transaction either occurs in full (i.e. every participant updates its local state) or not at all (no local state changes). Mutexes are not enough to guarantee this because the process holding the critical section might fail in the middle—and if this occurs, there may be no way to recover a consistent state even if we can break the lock and restore access to the underlying data.

Instead we need a distributed commit protocol that allows all the processors participating in a transaction to agree when the transaction has completed—if this protocol fails, we will do a rollback of each processor's state to what it was before the transaction started (this requires keeping enough history information around to do this).

The simplest distributed commit protocol is known as two-phase commit and uses a central coordinator. However, it does not require that the coordinator survive the full execution of the protocol to ensure atomicity (but recall that aborting the transaction ensures atomicity). It assumes the existence of stable storage (e.g. disk drives) whose contents survive crashes.

The algorithm proceeds as follows, when committing a transaction T (see SGG §18.3.1 for more details).

Phase 1
1. Coordinator writes prepare(T) to its log.

2. Coordinator sends prepare(T) message to all the participants in T.

3. Each participant replies by writing fail(T) or ready(T) to its log.

4. Each participant then sends a message fail(T) or ready(T).

Phase 2
1. Coordinator waits to receive replies from all participants or until a timeout expires.
2. If it gets ready(T) from every participant, it may commit the transaction by writing commit(T) to its log and sending commit(T) to all participants; otherwise, it writes and sends abort(T).

3. Each participant records the message it received from the coordinator in its log. In the case of an abort, it also undoes any changes it made to its state as part of the transaction.

Failure of sites is handled through a recovery process. A non-coordinator can detect whether the transaction committed or aborted from its log entries except if the log contains only a ready(T) record. In this case it must either ask the coordinator what it did (assuming the coordinator has come up), or wait for some other site to tell it that it has a commit(T) or abort(T) record in its log.

Failure of the coordinator is trickier. Temporary failures are not too bad; the coordinator can consult its log when it recovers to decide if T was committed or not. Permanent failures are trickier. Here in the worst case each participant in T has a ready message only in its log, and it is impossible to detect whether the transaction committed without waiting for the coordinator to recover. It is possible to demonstrate theoretically that under certain plausible assumptions, any distributed commit protocol has this property, that the permanent failure of some process in an asynchronous system may cause the protocol itself to fail (this is the well-known Fischer-Lynch-Paterson impossibility result).

# 6. Agreement protocols

There are various ways to get around the FLP impossibility result; the most practical use mechanisms that in the theoretical literature are modeled as abstract FailureDetectors but that in practice tend to specifically involve using timeouts. The problem usually solved is the problem of agreement, where we want all the participants to agree on some value (the agreement condition) after some finite amount of time (the termination condition), where the value is one initially proposed by at least one of the participants (the validity condition). It is easy to see that if we have such an agreement protocol we can solve the distributed commit problem (run the protocol to agree on committing vs aborting). In some circumstances we may even be able to use an agreement protocol more directly to agree on what update was performed to the shared data structure.

There are many protocols for implementing agreement. The best practical protocol for systems with crash failures may be Paxos, a voting-based protocol due to Leslie Lamport. More sophisticated protocols are needed if nodes in the system can misbehave in an attempt to subvert the protocol, a condition known as a Byzantine fault. We won't talk about these in CS422, but you can read about them on the ByzantineAgreement page from CS425.

1. Attributed to Rick_Rashid. (1)

2014-06-17 11:58