TheoryOfDistributedComputing

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.

Distributed computing systems are characterized by their structure: a typical distributed computing system will consist of some large number of interacting devices that each run their own programs but that are affected by receiving messages or observing shared-memory updates from other devices. Examples of distributed computing systems range from simple systems in which a single client talks to a single server to huge amorphous networks like the Internet as a whole.

As distributed systems get larger, it becomes harder and harder to predict or even understand their behavior. Part of the reason for this is that we as programmers have not yet developed the kind of tools for managing complexity (like subroutines or objects with narrow interfaces, or even simple structured programming mechanisms like loops or if/then statements) that are standard in sequential programming. Part of the reason is that large distributed systems bring with them large amounts of inherent nondeterminism—unpredictable events like delays in message arrivals, the suddent failure of components, or in extreme cases the nefarious actions of faulty or malicious machines opposed to the goals of the system as a whole. Because of the unpredictability and scale of large distributed systems, it can often be difficult to test or simulate them adequately. Thus there is a need for theoretical tools that allow us to prove properties of these systems that will let us use them with confidence.

The first task of any theory of distributed systems is modeling: defining a mathematical structure that abstracts out all relevant properties of a large distributed system. There are many foundational models for distributed systems, but for CS425 we will follow AttiyaWelch and use simple automaton-based models. Here we think of the system as a whole as passing from one global state to another in response to events, e.g. local computation at some processor, an operation on shared memory, or the delivery of a message by the network. The details of the model will depend on what kind of system we are trying to represent:

MessagePassing models correspond to systems where processes communicate by sending messages through a network. In synchronous message-passing, every process sends out messages at time t that are delivered at time t+1, at which point more messages are sent out that are delivered at time t+2, and so on: the whole system runs in lockstep, marching forward in perfect synchrony. Such systems are difficult to build when the components become too numerous or too widely dispersed, but they are often easier to analyze than asynchronous systems, where messages are delivered eventually after some unknown delay. Variants on these models include semi-synchronous systems, where message delays are unpredictable but bounded, and various sorts of timed systems. Further variations come from restricting which processes can communicate with which others, by allowing various sorts of failures (crash failures that stop a process dead, Byzantine failures that turn a process evil, or omission failures that drop messages in transit), or—on the helpful side—by supplying additional communications tools like ReliableBroadcast or FailureDetectors.
SharedMemory models correspond to systems where processes communicate by executing operations on shared objects that in the simplest case are typically simple memory cells supporting read and write operations, but which could be more complex hardware primitives like CompareAndSwap, LoadLinkedStoreConditional, atomic queues, or more exotic objects from the seldom-visited theoretical depths. Practical shared-memory systems may be implemented as DistributedSharedMemory on top of a message-passing system in various ways. Shared-memory systems must also deal with issues of asynchrony and failures, both in the processes and in the shared objects.
Other specialized models emphasize particular details of distributed systems, such as the labeled-graph models used for analyzing routing or the topological models used to represent some specialized agreement problems.

We'll see many of these at some point in CS425, and examine which of them can simulate each other under various conditions.

Properties we might want to prove about a model include:

Safety properties, of the form "nothing bad ever happens" or more precisely "there are no bad reachable states of the system." These include things like "at most one of the traffic lights at the intersection of Busy and Main is ever green." Such properties are typically proved using invariants, properties of the state of the system that are true initially and that are preserved by all transitions; this is essentially a disguised InductionProof.
Liveness properties, of the form "something good eventually happens." An example might be "my email is eventually either delivered or returned to me." These are not properties of particular states (I can unhappily await the eventual delivery of my email for decades without violating the liveness property just described), but of executions, where the property must hold starting at some finite time. Liveness properties are generally proved either from other liveness properties (e.g., "all messages in this message-passing system are eventually delivered") or from a combination of such properties and some sort of timer argument where some progress metric improves with every transition and guarantees the desirable state when it reaches some bound (also a disguised InductionProof).
Fairness properties are a strong kind of liveness property of the form "something good eventually happens to everybody." Such properties exclude starvation, where most of the kids are happily chowing down at the orphanage ("some kid eventually eats something" is a liveness property) but poor Oliver Twist is dying for lack of gruel in the corner.
Simulations show how to build one kind of system from another, such as a reliable message-passing system built on top of an unreliable system (TCP), a shared-memory system built on top of a message-passing system (DistributedSharedMemory), or a synchronous system build on top of an asynchronous system (Synchronizers).
Impossibility results describe things we can't do. For example, the classic TwoGenerals impossibility result says that it's impossible to guarantee agreement between two processes across an unreliable message-passing channel if even a single message can be lost. Other results characterize what problems can be solved if various fractions of the processes are unreliable, or if asynchrony makes timing assumptions impossible. These results, and similar lower bounds that describe things we can't do quickly, include some of the most technically sophisticated results in distributed computing. They stand in contrast to the situation with sequential computing, where the reliability and predictability of the underlying hardware makes proving lower bounds extremely difficult.

CategoryDistributedComputingNotes