For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.
Usual sketchy description here. See AttiyaWelch §10.3 or LynchBook §13.3 for the full story. There is also a reasonably recent survey (2005) by Faith Ellen Fich on upper and lower bounds for the problem at http://www.springerlink.com/content/0w288tqvwmvd0dev/ (or this local copy) .
- Atomic snapshots
- The basic trick: two identical collects equals a snapshot
- The Gang of Six algorithm
- Snapshots with bounded registers
- Faster snapshots using lattice agreement
- Practical snapshots using LL/SC
- Lower bounds
1. Atomic snapshots
An atomic snapshot object acts like a collection of n single-writer multi-reader atomic registers with a special snapshot operation that returns (what appears to be) the state of all n registers at the same time. This is easy without failures: we simply lock the whole register file, read them all, and unlock them to let all the starving writers in. But it gets harder if we want a protocol that is wait-free, where any process can finish its own snapshot or write even if all the others lock up.
2. The basic trick: two identical collects equals a snapshot
Let's tag any value written with a sequence number, so that each value written has a seqno field attached to it that increases over time. We can now detect if a new write has occurred between two reads of the same variable. Suppose now that we repeatedly perform collects---reads of all n registers---until two successive collects return exactly the same vector of values and sequence numbers. We can then conclude that precisely these values were present in the registers at some time in between the two collects. This gives us a very simple algorithm for snapshot. Unfortunately, it doesn't terminate if there are a lot of writers around.1 So we need some way to slow the writers down, or at least get them to do snapshots for us.
3. The Gang of Six algorithm
This is the approach taken by Afek and his five illustrious co-authors in Atomic Snapshots of Shared Memory, JACM 40(4):873-890, 1993, also described in LynchBook §13.3.2: before a process can write to its register, it first has to complete a snapshot and leave the results behind with its write.2 This means that if some slow process (including a slow writer, since now writers need to do snapshots too) is prevented from doing the two-collect snapshot because too much writing is going on, eventually it can just grab and return some pre-packaged snapshot gathered by one of the many successful writers.
Specifically, if a process executing a single snapshot operation sees values written by a single process i with three different sequence numbers s1, s2, and s3, then it can be assured that the snapshot gathered by the write with sequence number s3 started no earlier than s2 was written (and thus no earlier than the snapshot started, since the snapshot read s1 after it started) and ended no later than s3 was written (and thus no later than the snapshot ended). It follows that the snapshot can safely return the vector collected by the s3 write, since that vector represents some time inside the s3 write's interval and thus some time inside the snapshot's interval.
So a snapshot repeatedly does collects until either (a) it gets two identical collects, in which case it can return the results, or (b) it sees three different values from the same process, in which case it can take the snapshot collected by the third write. Amazingly, despite the fact that everybody is trying to do snapshots all the time, a snapshot operation of a single process is guaranteed to terminate after at most n+1 collects. The reason is that in order to prevent case (a) from holding, the adversary has to supply at least one new value in each collect after the first. But it can only supply 1 new value for each of the n-1 processes that aren't doing collects before case (b) is triggered. Adding up all the collects gives 1 + (n-1) + 1 = n+1 collects before one of the cases holds. Since each collect takes n-1 read operations (assuming the process is smart enough not to read its own register), a snapshot operation terminates after at most n2-1 reads.
For a write operation, a process first performs a snapshot, then writes the new value, the new sequence number, and the result of the snapshot to its register (these are very wide registers). The total cost is n2-1 read operations for the snapshot plus 1 write operation.
We now need to argue that the snapshot vectors returned by the Afek et al. algorithm really work, i.e. that between each matching invoke-snapshot and respond-snapshot there was some actual time where the registers in the array contained precisely the values returned in the respond-snapshot action. We do so by assigning a serialization point to each snapshot vector, a time at which it appears in the registers (which for correctness of the protocol had better lie within the interval between the snapshot invocation and response). For snapshots obtained through case (a), take any time between the two collects. For snapshots obtained through case (b), take the serialization point already assigned to the snapshot vector provided by the third write. In the latter case we argue by induction on termination times that the serialization point lies inside the snapshot's interval.
Note that this means that all snapshots were ultimately collected by two successive collects returning identical values, since any case-(b) snapshot sits on top of a finite regression of case-(b) snapshots that must end with a case-(a) snapshot. In practice what this means is that if there are many writers, eventually all of them will stall waiting for a case-(a) snapshot to complete, which happens because all the writers are stuck. So effectively the process of requiring writers to do snapshots first almost gives us a form of locking, but without the vulnerability to failures of a real lock. (In fact, crash failures just mean that there are fewer writers to screw things up, allowing snapshots to finish faster.)
4. Snapshots with bounded registers
The simple version of the Afek et al. algorithm requires unbounded registers (since sequence numbers may grow forever). One of the reasons why this algorithm required so many smart people was to get rid of this assumption: the paper describes a (rather elaborate) mechanism for recycling sequence numbers that prevents unbounded growth (see also LynchBook §13.3.3). In practice, unbounded registers are probably not really an issue once one has accepted very large registers, but getting rid of them is an interesting theoretical problem.
It turns out that with a little cleverness we can drop the sequence numbers entirely. The idea is that we just need a mechanism to detect when somebody has done a lot of writes while a snapshot is in progress. A naive approach would be to have sequence numbers wrap around mod m for some small constant modulus m; this fails because if enough snapshots happen between two of my collects, I may notice none of them because all the sequence numbers wrapped around all the way. But we can augment mod-m sequence numbers with a second handshaking mechanism that detects when a large enough number of snapshots have occurred; this acts like the guard bit on an automobile odometer, than signals when the odometer has overflowed to prevent odometer fraud by just running the odometer forward an extra million miles or so.
The result is the full version of Afek et al. (Our presentation here follows §10.3 of AttiyaWelch.) The key mechanism for detecting odometer fraud is a handshake, a pair of single-writer bits used by two processes to signal each other that they have done something. Call the processes S (for same) and D (for different), and supposed we have handshake bits hS and hD. We then provide operations try-handshake (signal that something is happening) and check-handshake (check if something happened) for each process; these operations are asymmetric. The code is:
hS ← hD (make the two bits the same)
hD ← ¬hS (make the two bits different)
return hS ≠ hD (return true if D changed its bit)
return hS = hD (return true if S changed its bit)
The intent is that check-handshake returns true if the other process called try-handshake after I did. The situation is a bit messy, however, since try-handshake involves two register operations (reading the other bit and then writing my own). So in fact we have to look at the ordering of these read and write events. Let's assume that check-handshake is called by S (so it returns true iff it sees different values). Then we have two cases:
check-handshake(S) returns true. Then S reads a different value in hD from the value it read during its previous call to try-handshake(S). It follows that D executed a write as part of a try-handshake(D) operation in between S's previous read and its current read.
check-handshake(S) returns false. Then S reads the same value in hD as it read previously. This does not necessarily mean that D didn't write hD during this interval—it is possible that D is just very out of date, and did a write that didn't change the register value—but it does mean that D didn't perform both a read and a write since S's previous read.
How do we use this in a snapshot algorithm? The idea is that before performing my two collects, I will execute try-handshake on my end of a pair of handshake bits for every other process. After performing my two collects, I'll execute check-handshake. I will also assume each update (after performing a snapshot) toggles a mod-2 sequence number bit on the value stored in its segment of the snapshot array. The hope is that between the toggle and the handshake, I detect any changes. (See AttiyaWelch Algorithm 30 for the actual code.)
Does this work? Let's look at cases:
The toggle bit for some process q is unchanged between the two snapshots taken by p. Since the bit is toggled with each update, this means that an even number of updates to q's segment occurred during the interval between p's writes. If this even number is 0, we are happy: no updates means no call to try-handshake by q, which means we don't see any change in q's segment, which is good, because there wasn't any. If this even number is 2 or more, then we observe that [p's call to try-handshake] < [p's first read] < [q's first write] < [q's call to try-handshake at the start of its second scan] < [q's second write] < [p's second read] < [p's call to check-handshake]. It follows that q both reads and writes the handshake bits in between p's calls to try-handshake and check-handshake, so p correctly sees that q has updated its segment.
- The toggle bit for q has changed. Then q did an odd number of updates (i.e., at least one), and p correctly detects this fact.
What does p do with this information? Each time it sees that q has done a scan, it updates a count for q. If the count reaches 3, then p can determine that q's last scanned value is from a scan that is contained completely within the time interval of p's scan. Either this is a direct scan, where q actually performs two collects with no changes between them, or it's an indirect scan, where q got its value from some other scan completely contained within q's scan. In the first case p is immediately happy; in the second, we observe that this other scan is also contained within the interval of p's scan, and so (after chasing down a chain of at most n-1 indirect scans) we eventually reach a direct scan contained within it that provided the actual value. In either case p returns the value of pair of adjacent collects with no changes between them that occurred during the execution of its scan operation, which gives us linearizability.
5. Faster snapshots using lattice agreement
The Afek et al. algorithm and its contemporaries all require O(n²) operations for each snapshot. It is possible to get this bound down to O(n) using a more clever algorithm. The usual approach is by reduction to a related problem called lattice agreement.
We'll describe some of the results in Attiya, Herlihy, and Rachman, Atomic snapshots using lattice agreement, Distributed Computing 8:121–132, 1995, which gives randomized linear-work algorithms for snapshots based on this reduction.
A lattice is a partial order in which every pair of elements x, y has a least upper bound x∨y called the join and a greatest lower bound x∧y called the meet (see Relations). In the lattice agreement problem, each process starts with an input xi and produces an output yi, where both are elements of some lattice. The requirements of the problem are:
yi ≤ yj or yj ≤ yi for all i, j.
xi ≤ yi for all i.
yi ≤ x1∨x2∨x3∨...∨xn for all i.
Note that if we are really picky, we can observe that we don't actually need meets; a semi-lattice that provides only joins is enough. In practice we almost always end up with a full-blown lattice, because (a) we are working with finite sets, and (b) we generally want to include a bottom element ⊥ that is less than all the other elements (to represent the "empty" state of our data structure). But any finite join-semi-lattice with a bottom element turns out to be a lattice, since we can define x ∧ y as the join of all elements z such that z ≤ x and z ≤ y. We don't use the fact that we are in a lattice anywhere, but it does save us two syllables not to have to say "semilattice agreement."
For the snapshot algorithm, we also demand wait-freedom: each process terminates after a bounded number of its own steps, even if other processes fail.
These requirements are analogous to the requirements for consensus. Comparability acts like agreement, demanding that the non-equal views returned by the lattice-agreement protocol are totally ordered. Upward-validity acts like validity: you can't return anything in a view not implied by an input, while downward-validity says that each process will incorporate its own input into the output.
Where this maps to snapshot is that we can imagine each writer generates a sequence of increasing timestamps r1, r2, ..., and a snapshot corresponds to some vector of timestamps [t1, t2 ... tn] where ti indicates the most recent write by pi that is included in the snapshot (in other words, we are doing vector clocks yet again). Now define v≤v' if vi≤v'i for all i; the resulting partial order is a lattice, and in particular we can compute x∨y by the rule (x∨y)i = xi∨yi.
Suppose now that we have a bunch of snapshots that satisfy the comparability requirement; i.e., they are totally ordered. Then we can construct a sequential execution by ordering the snapshots in increasing order with each update operation placed before the first snapshot that includes it. This sequential execution is not necessarily a linearization of the original execution, and a single lattice agreement object won't support more than one operation for each process, but the idea is that we can nonetheless use lattice agreement objects to enforce comparability between concurrent executions of snapshot, while doing some other tricks (exploiting, among other things, the validity properties of the lattice agreement objects) to get linearizability over the full execution.
5.1. Reduction to lattice agreement
So here is the actual algorithm. It uses an array of registers Ri to hold round numbers (timestamps); an array Si to hold values to scan; an unboundedly humongous array Vir to hold views obtained by each process in some round; and a collection of lattice-agreement objects LAr, one for each round.
The algorithm makes two attempts to obtain a snapshot. In both cases, the algorithm computes a new round number r equal to the max of all previous round numbers and its own previous round number plus 1, attempts a collect, and then runs lattice-agreement to try to get a consistent view. If after getting its first view it finds that some other process has already advanced to a later round, it makes a second attempt at a new, higher round r' and uses some view that it obtains in this second round, either directly from lattice agreement, or (if it discovers that it has again fallen behind), it uses an indirect view from some speedier process.
- First try
Ri ← r ← max(R1...Rn, Ri+1)
collect ← read(S1...Sn)
view ← LAr(collect)
if max(R1...Rn) > Ri:
- Goto second try
Vir ← view
- Second try
Ri ← r ← max(R1...Rn, Ri+1)
collect ← read(S1...Sn)
view ← LAr(collect)
if max(R1...Rn) > Ri:
Vir ← some nonempty Vjr
Vir ← view
The update operation is the usual update-and-scan procedure:
Si ← (new value, Si.seqno + 1)
- return scan()
5.2. Why this works
We need to show three facts:
- All views returned by the scan operation are comparable; that is, there exists a total order on the set of views (which can be extended to a total order on scan operations by breaking ties using the execution order).
- The view returned by an update operation includes the update (this implies that future views will also include the update, giving the correct behavior for snapshot).
The total order on views respects the execution order: if scan1 <S scan2, then view1 ≤ view2. (This gives us linearization.)
Let's start with comparability. First observe that any view returned is either a direct view (obtained from LAr) or an indirect view (obtained from Vjr for some other process j). In the latter case, following the chain of indirect views eventually reaches some direct view. So all views returned for a given round are ultimately outputs of LAr and thus satisfy comparability etc.
But what happens with views from different rounds? The lattice-agreement objects only operate within each round, so we need to ensure that any view returned in round r is included in any subsequent rounds. This is where checking round numbers after calling LAr comes in.
Suppose some process i returns a direct view; that is, it sees no higher round number in either line 4 of its first try or line 4 of its second try. Then at the time it starts executing line 4, no process has yet written a round number higher than the round number of i's view (otherwise i would have seen it). So no process with a higher round number has yet executed the collect in line 2 (of either try). When it does so, it obtains values that are at least as current as those fed into LAr, and i's round-r view is less than or equal to the vector of these values by upward-validity of LAr and thus less than or equal to the vector of values returned by LAr' (r' > r) by upward-validity. So we have comparability of all direct views, which implies comparability of all indirect views as well.
To show that each view returned by scan includes the preceding update, we observe that either a process returns its first-try scan (which includes the update by downward-validity) or it returns the results of a scan in the second-try round (which includes the update by downward-validity in the later round, since any collect in the second-try round starts after the update occurs). So no updates are missed.
Now let's consider two scan operations where scan1 precedes scan2 in the execution. We want to show that view1 ≤ view2. From the comparability property, the only way this can fail is if view2 < view1 that is, there is some update included in view2 that is not included in view1. But this can't happen; if scan2 starts after scan1 finishes, it starts after any update scan1 sees is already in one of the Sj registers, and so scan2 will include this update in its initial collect.
5.3. Implementing lattice agreement
There are many known algorithms for implementing lattice agreement. The best of them (assuming multi-writer registers) is Inoue et al's linear-time lattice agreement protocol from WDAG 1994. (See http://www.springerlink.com/content/pr00614344543t41/ for the original paper.)
The intuition is to implement lattice agreement using divide-and-conquer. The processes are organized into a tree, with each leaf in the tree corresponding to some process's input. Internal nodes of the tree hold data structures that will report increasingly large subsets of the inputs under them as they become available. At each internal node, a double-collect snapshot is used to ensure that the value stored at that node is always the union of two values that appear in its children at the same time. This is used to guarantee that, so long as each child stores an increasing sequence of sets of inputs, the parent does so also.
Each process ascends the tree updating nodes as it goes to ensure that its value is included in the final result. A rather clever data structure is used to ensure that out-of-date smaller sets don't overwrite larger ones at any node, and the cost of using this data structure and carrying out the double-collect snapshot at a node with m leaves below it is shown to be O(m). So the total cost of a snapshot is O(n + n/2 + n/4 + ... 1) = O(n), giving the linear time bound.
Let's now look at the details of this protocol. There are two main components: the Union algorithm used to compute a new value for each node of the tree, and the ReadSet and WriteSet operations used to store the data in the node. These are both rather specialized algorithms and depend on the details of the other, so it is not trivial to describe them in isolation from each other; but with a little effort we can describe exactly what each component demands from the other, and show that it gets it.
The Union algorithm proceeds as follows:
Perform ReadSet on both children. This returns a set of leaf values.
Perform ReadSet on both children again.
If the values obtained the same in both collects, call WriteSet on the current node to store the union of the two sets and proceed to the parent node. Otherwise repeat the preceding step.
The requirement of the Union algorithm is that calling ReadSet on a given node returns a non-decreasing sequence of sets of values; that is, if ReadSet returns some set S at a particular time and later returns S', then S⊆S'. We also require that the set returned by ReadSet is a superset of the set written by any WriteSet that precedes it, and that it is equal to some such set. This last property only works if we guarantee that the values stored by WriteSet are all comparable (which is shown by induction on the behavior of Union at lower levels of the tree).
Suppose that all these conditions hold; we want to show that the values written by successive calls to Union are all comparable, that is, for any values S, S' written by union we have S⊆S' or S'⊆S. Observe that S = L∪R and S' = L'∪R' where L, R and L', R' are sets read from the children. Suppose that the Union operation producing S completes its snapshot before the operation producing S'. Then L⊆L' (by the induction hypothesis) and R⊆R', giving S⊆S'.
We now show how to implement the ReadSet and WriteSet operations. The main thing we want to avoid is the possibility that some large set gets overwritten by a smaller, older one. The solution is to have m registers a[1..m], and write a set of size s to every register in a[1..s] (each register gets a copy of the entire set). Because register a[s] gets only sets of size s or larger, there is no possibility that our set is overwritten by a smaller one.
Naively, one might think that we could just write directly to a[s] and skip the previous ones, but this makes it harder for a reader to detect that a[s] is occupied. By writing all the previous registers, we make it easy to tell if there is a set of size s or bigger in the sequence, and so a reader can start at the beginning and scan forward until it reaches an empty register, secure in the knowledge that no larger value has been written. Since we want to guarantee that no reader every spends more that O(m) operations on an array of m registers (even if it does multiple calls to ReadSet), we also have it remember the last location read in each call to ReadSet and start there again on its next call. Since we want to guarantee that if I return a value, later readers see it as well, we adopt the same trick as in the ABD algorithm and have readers pick up for dead writers, writing the largest set they encounter so far into any new registers it fits in. So the actual code looks like this:
- p ← |max| (max is initially either ∅ or the last set written by this process)
- while p ≤ |max| do
- for j ← p+1 to |max| do write(max, a[p])
- p ← |max|
- if p = m then return max
- p ← p+1
- temp = read(a[p])
if |temp| > |max| then max ← temp
until |max| > p or temp = ∅ or p = m
- return max
Note that each register is written at most once by each process, so there are only O(n) total write operations. Similarly, with the exception of one extra read per call to ReadSet to see that nothing has changed, each register is read at most once per process, giving O(n) total read operations. The overhead of repeatedly executing ReadSet is bounded by observing that the loop in Union only repeats the collect if some set changes, which can only occur O(m) times at a node above m leaves before there is no room for further changes. So summing this extra overhead over the tree gives an additional O(n) cost, giving a total of O(n) operations per call to lattice-agreement. Getting an O(n) snapshot then follows from the reduction of Attiya et al.
The payoff: unless we do more updates that snapshots, don't want to assume multi-writer registers, have a beef with huge registers, or care about constant factors, it costs no more to do a snapshot than a collect. So in theory we can get away with assuming snapshots pretty much wherever we need them.
6. Practical snapshots using LL/SC
Though atomic registers are enough for snapshots, it is possible to get a much more efficient snapshot algorithm using stronger synchronization primitives. An algorithm of Riany, Shavit, and Touitou (Towards a practical snapshot algorithm, Theoretical Computer Science, 269(1–2):163–201, 2001) uses LoadLinkedStoreConditional objects to build an atomic snapshot protocol with linear-time snapshots and constant-time updates using small registers. We'll give a sketch of this algorithm here.
The RST algorithm involves two basic ideas: the first is a snapshot algorithm for a single scanner (i.e., only one process can do snapshots) in which each updater maintains two copies of its segment, a high copy (that may be more recent than the current scan) and a low copy (that is guaranteed to be no more recent than the current scan). The idea is that when a scan is in progress, updaters ensure that the values in memory at the start of the scan are not overwritten before the scan is completed, by copying them to the low registers, while the high registers allow new values to be written without waiting for the scan to complete. Unbounded sequence numbers, generated by the scanner, are used to tell which values are recent or not.
As long as there is only one scanner, nothing needs to be done to ensure that all scans are consistent. But extending the algorithm to multiple scanners is tricky. A simple approach would be to keep a separate low register for each concurrent scan—however, this would require up to n low registers and greatly increase the cost of an update. Instead, the authors devise a mechanism, called a coordinated collect, that allows the scanners collectively to implement a sequence of virtual scans that do not overlap. Each virtual scan is implemented using the single-scanner algorithm, with its output written to a common view array that is protected from inconsistent updates using LL/SC operations. A scanner participates in virtual scans until it obtains a virtual scan that is useful to it (this means that the virtual scan has to take place entirely within the interval of the process's actual scan operation); the simplest way to arrange this is to have each scanner perform two virtual scans and return the value obtained by the second one.
The paper puts a fair bit of work into ensuring that only O(n) view arrays are needed, which requires handling some extra special cases where particularly slow processes don't manage to grab a view before it is reallocated for a later virtual scan. We avoid this complication by simply assuming an unbounded collection of view arrays; see the paper for how to do this right.
A more recent paper by Fatourou and Kallimanis from PODC 2007 gives improved time and space complexity using the same basic technique.
6.1. Details of the single-scanner snapshot
The single-scanner snapshot is implemented using a shared curr_seq variable (incremented by the scanner but used by all processes) and an array memory of n snapshot segments, each of which is divided into a high and low component consisting of a value and a timestamp. Initially, curr_seq is 0, and all memory locations are initialized to (⊥, 0).
The scan algorithm copies memory to an array view and proceeds as follows:
procedure Scan(): curr_seq ← curr_seq + 1 for j ← 0 to n-1: h ← memory[j].high if h.seq < curr_seq: view[j] ← h.value else: view[j] ← memory[j].low.value
Essentially, Scan copies the first of memory[j].high or memory[j].low that has a sequence number less than the current sequence number.
The update operation for process i proceeds as follows:
procedure Update(value): seq ← curr_seq h ← memory[i].high if seq ≠ h.seq: memory[i].low ← h memory[i].high ← (value, seq)
The idea here is that Update always writes its value to memory[i].high, but preserves the previous value in memory[i].low if it sequence number indicates that it may have been present at the start of the most recent call to Scan.
To show this actually works, we need to show that there is a linearization of the scans and updates that has each scan return precisely those values whose corresponding updates are linearized before it. The ordering is based on when Scan operation S increments curr_seq and when each Update operation U reads it; specifically:
If U reads curr_seq after S increments it, then S < U.
If U reads curr_seq before S increments it and S reads memory[i].high (where i is the process carrying out U) before U writes it, then S < U.
If U reads curr_seq before S increments it, but S reads memory[i].high after U writes it, then U < S.
Updates are ordered based on intervening scans (i.e., U₁ < U₂ if U₁<S and S<U₂ by the above rules), or by the order in which they read curr_seq if there is no intervening scan.
To show this is a linearization, we need first to show that it extends the ordering between operations in the original schedule. Each of the above rules has op₁ < op₂ only if some low-level operation of op₁ precedes some low-level operation of op₂, with the exception of the transitive ordering of two update events with an intervening scan. But in this last case we observe that if U₁<S, then U₁ writes memory[i].high before S reads it, so if U₁ precedes U₂ in the actual execution, U₂ must write memory[i].high after S reads it, implying S<U₂.
Now we show that the values returned by Scan are consistent with the linearization ordering; that, is, for each i, Scan copies to view[i] the value in the last Update by process i in the linearization. Examining the code for Scan, we see that a Scan operation S takes memory[i].high if its sequence number is less than curr_seq, i.e. if the Update operation U that wrote it read curr_seq before S incremented it and wrote memory[i].high before S read it; this gives U<S. Alternatively, if Scan takes memory[i].low, then memory[i].low was copied by some update operation U' from the value written to memory[i].high by some update U that read curr_seq before S incremented it. Here U' must have written memory[i].high before S read it (otherwise S would have taken the old value left by U) and since U precedes U' (being an operation of the same process) it must therefor also have written memory[i].high before S read it. So again we get the first case of the linearization ordering and U<S.
So far we have only shown that S obtains values that were linearized before it, but not that it ignores values that were linearized after it. So now let's consider some U with S < U. Then one of two cases holds:
U reads curr_seq after S increments it. Then U writes a seq value in memory[i].high that is greater than or equal to the curr_seq value used by S; so S returns memory[i].low instead, which can't have a sequence number equal to curr_seq and thus can't be U's value either.
U reads curr_seq before S increments it but writes memory[i].high after S reads it. Now S won't return U's value from memory[i].high (it didn't read it), and won't get it from memory[i].low either (because the value that is in memory[i].high will have seq < curr_seq, and so S will take that instead).
So in either case, if S<U, then S doesn't return U's value. This concludes the proof of correctness.
6.2. Extension to multiple scanners
See the paper for details.
The essential idea: view now represents a virtual scan viewr generated cooperatively by all the scanners working together in some asynchronous round r. To avoid conflicts, we update viewr using LL/SC or compare-and-swap (so that only the first scanner to write wins), and pretend that reads of memory[i] by losers didn't happen. When viewr is full, start a new virtual scan and advance to the next round (and thus the next viewr+1).
7.1. Multi-writer registers from single-writer registers
One application of atomic snapshot is building multi-writer registers from single-writer registers. The idea is straightforward: to perform a write, a process does a snapshot to obtain the maximum sequence number, tags its own value with this sequence number plus one, and then writes it. A read consists of a snapshot followed by returning the value associated with the largest sequence number (breaking ties by process id). (See LynchBook §13.5 for the proof that this actually works.) This turns out to be overkill in practice; there are simpler algorithms that give O(n) cost for reads and writes based on timestamps (see AttiyaWelch §10.2.3).
With additional work it is even possible to eliminate the requirement of multi-reader registers, and get a simulation of multi-writer multi-reader registers that goes all the way down to single-writer single-read registers, or even single-writer single-reader bits. See AttiyaWelch §10.2.1–10.2.2 or LynchBook §13.4 for details.
7.2. Counters and accumulators
Given atomic snapshots, it's easy to build a counter (supporting increment, decrement, and read operations); or, in more generality, an accumulator (supporting increments by arbitrary amounts); or, in even more generality, an object supporting any collection of commutative update operations. The idea is that each process stores in its segment the total of all operations it has performed so far, and a read operation is implemented using a snapshot followed by summing the results. This is a case where it is reasonable to consider multi-writer registers in building the snapshot implementation, because there is not necessarily any circularity in doing so.
7.3. Resilient snapshot objects
The previous examples can be generalized to objects with operations that either read the current state of the object but don't update it or update the state but return nothing, provided the update operations either overwrite each other (so that Cxy = Cy or Cyx = Cx) or commute (so that Cxy = Cyx). This was shown by Aspnes and Herlihy in their SPAA 1990 paper Wait-free data structures in the asynchronous PRAM model and improved on by Anderson and Moir in their WDAG 1993 paper Towards a necessary and sufficient condition for wait-free synchronization by eliminating unbounded space usage (this paper also defined the terms snapshot objects for those with separate read and update operations and resilience for the property that all operations commute or overwrite). The basic idea underneath both of these papers is to use the multi-writer register construction given above, but break ties among operations with the same sequence numbers by first placing overwritten operations before overwriting operations and only then using process ids.
This almost shows that snapshots can implement any object with consensus number 1 where update operations return nothing, because an object that violates the commute-or-overwrite condition in some configuration has consensus number at least 2 (see WaitFreeHierarchy). It doesn't quite work (as observed in the Anderson-Moir paper), because the tie-breaking procedure assumes a static ordering of operations, so that given operations x and y where y overwrites x, y overwrites x in any configuration. But there may be objects with dynamic ordering, where y overwrites x in some configuration, x overwrites y in another, and perhaps even the two operations commute in yet another. This prevents us from achieving consensus, but also breaks the tie-braking technique.
8. Lower bounds
There is a lower bound of n-1 on the worst-case cost of a scan operation, even for single-scanner snapshots: see JayantiTanToueg.
This isn't always a problem, since there may be external factors that keep the writers from showing up too much. Maurice Herlihy and I got away with using exactly this snapshot algorithm in this ancient paper. (1)
Historically, this was one of three independent solutions to the problem that appeared simultaneously; a similar algorithm for composite registers was given by James Anderson and a somewhat different algorithm for consistent scan was given by Aspnes and Herlihy. The Afek et al. algorithm had the advantage of using bounded registers (in its full version), and so it and its name for atomic snapshot prevailed over its competitors. (2)