*For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.*

Contents

The gold standard for shared-memory objects is **wait-freedom**: I can finish my operation in a bounded number of steps no matter what anybody else does. Like the gold standard in real life, this can be overly constraining. So researchers have since developed several weaker progress guarantees that are nonetheless useful. The main ones are:

- Lock-freedom
An implementation is

**lock-free**if infinitely many operations finish in any infinite execution. In simpler terms, somebody always makes progress, but maybe not you. (Also called**non-blocking**.)- Obstruction-freedom
An implementation is

**obstruction-free**if, starting from any reachable configuration, any process can finish in a bounded number of steps if all of the other processes stop. This definition was proposed in 2003 by Herlihy, Luchangco, and Moir. In lower bounds (e.g., JayantiTanToueg) this is often called**solo-terminating**.

Both of these properties exclude traditional lock-based algorithms, where some process grabs a lock, updates the data structure, and then release the lock; if this process halts, no more operations finish. Both properties are also weaker than wait-freedom. It is not hard to show that lock-free is a stronger condition that obstruction-freedom: given a lock-free implementation, if we can keep some single process running forever in isolation, we get an infinite execution with only finitely many completed operations. So we have a hierarchy wait-free > lock-free > obstruction-free > locking.

# 1. Why build obstruction-free algorithms?

The pitch is similar to the pitch for building locking algorithms: an obstruction-free algorithm might be simpler to design, implement, and reason about than a more sophisticated algorithm with stronger properties. Unlike locking algorithms, an obstruction-free algorithm won't fail because some process dies holding the lock; instead, it fails if more than one process runs the algorithm at the same time. This possibility may be something we can avoid by building a **contention manager**, a high-level protocol that detects contention and delays some processes to avoid it (say, using randomized exponential back-off).

# 2. Examples

## 2.1. Lock-free implementations

Pretty much anything using compare-and-swap or LL/SC ends up being lock-free. A simple example would be a counter, where an increment operation does

- x ← load-linked(c)
- store-conditional(c, x+1)

This is lock-free (the only way to prevent a store-conditional from succeeding is if some other store-conditional succeeds, giving infinitely many successful increments) but not wait-free (I can starve). It's also obstruction-free, but since it's already lock-free we don't care about that.

Similarly, suppose we are doing AtomicSnapshots. We know that there exist wait-free implementations of atomic snapshots, but they are subtle and confusing. So we want to do something simpler, and hope that we at least get obstruction-freedom.

If we do double-collects, that is, we have updates just write to a register and have snapshots repeatedly collect until they get two collects in a row with the same values, then any snapshot that finishes is correct (assuming no updaters ever write the same value twice, which we can enforce with nonces). This isn't wait-free, because we can keep a snapshot going forever by doing a lot of updates. It *is* lock-free, because we have to keep doing updates to make this happen.

We can make this merely obstruction-free if we work hard (there is no reason to do this, but it illustrates the difference between lock-freedom—good—and obstruction-freedom—not so good). Suppose that every process keeps a count of how many collects it has done in a register that is included in other process's collects (but not its own). Then two concurrent scans can stall each other forever (the implementation is not lock-free), but if only one is running it completes two collects in O(n) operations without seeing any changes (it is obstruction-free).

Similar things happen with SoftwareTransactionalMemory. Suppose that I have an implementation of multiword compare-and-swap, and I want to carry out a transaction. I read all the values I need, then execute an MCAS operation that only updates if these values have not changed. The resulting algorithm is lock-free (if my transaction fails, it's because some update succeeded). If however I am not very clever and allow some values to get written outside of transactions, then I might only be obstruction-free.

So far we don't have any good examples of why we would want to be obstruction-free if our algorithm is based on CAS. So let's describe the case Herlihy et al. suggested.

## 2.2. An obstruction-free deque

A *deque* is a generalized queue that supports push and pop at both ends (thus it can be used as either a queue or a stack, or both). A classic problem in shared-memory objects is to build a deque where operations at one end of the deque don't interfere with operations at the other end. While there exist lock-free implementation with this property, there is a particularly simple implementation using CAS that is only obstruction-free.

Here's the idea: we represent the deque as an infinitely-long array of compare-and-swap registers (this is a simplification from the paper, which gives a bounded implementation of a bounded deque). The middle of the deque holds the actual contents. To the right of this region is an infinite sequence of *right null* (RN) values, which are assumed never to appear as a pushed value. To the left is a similar infinite sequence of *left null* (LN) values. Some magical external mechanism (called an *oracle* in the paper) allows processes to quickly find the first null value at either end of the non-null region; the correctness of the protocol does not depend on the properties of the oracle, except that it has to point to the right place at least some of the time in a solo execution. We also assume that each cell holds a version number whose only purpose is to detect when somebody has fiddled with the cell while we aren't looking (if we use LL/SC, we can drop this).

Here is the code for right-push and right-pop (the code for left-push and left-pop is symmetric):

- right-push(v)
- while true do:
- k ← oracle(right)
- prev ← a[k-1]
- next ← a[k]
- if prev.value ≠ RN and next.value = RN:
- if CAS(a[k-1], prev, [prev.value, prev.version+1]):
- if CAS(a[k], next, [v, next.version+1]):
- we win, go home

- if CAS(a[k], next, [v, next.version+1]):

- if CAS(a[k-1], prev, [prev.value, prev.version+1]):
- else keep trying

- while true do:
- right-pop()
- while true do:
- k ← oracle(right)
- cur ← a[k-1]
- next ← a[k]
- if cur.value ≠ RN and next.value = RN:
- if cur.value = LN and A[k-1] = cur:
- return empty

- else if CAS(a[k], next, [RN, next.version+1]):
- if CAS(a[k-1], cur, [RN, cur.version+1]):
- return cur.value

- if CAS(a[k-1], cur, [RN, cur.version+1]):

- if cur.value = LN and A[k-1] = cur:
- else keep trying

- while true do:

It's easy to see that in a solo execution, if the oracle doesn't lie, either operation finishes and returns a plausible value after O(1) operations. So the implementation is obstruction-free. But is it also correct?

To show that it is, we need to show that any execution leaves the deque in a sane state, in particular that it preserves the invariant that the deque consists of left-nulls followed by zero or more values followed by right-nulls, and that the sequence of values in the queue is what it should be.

This requires a detailed case analysis of which operations interfere with each other, which can be found in the original paper. But we can give some intuition here. The two CAS operations in right-push or right-pop succeed only if neither register was modified between the preceding read and the CAS. If both registers are unmodified at the time of the second CAS, then the two CAS operations act like a single two-word CAS, which replaces the previous values (top, RN) with (top, value) in right-push or (top, val) with (top, RN) in right-pop; in either case the operation preserves the invariant. So the only way we get into trouble is if, for example, a right-push does a CAS on a[k-1] (verifying that it is unmodified and incrementing the version number), but then some other operation changes a[k-1] before the CAS on a[k]. If this other operation is also a right-push, we are happy, because it must have the same value for k (otherwise it would have failed when it saw a non-null in a[k-1]), and only one of the two right-pushes will succeed in applying the CAS to a[k]. If the other operation is a right-pop, then it can only change a[k-1] after updating a[k]; but in this case the update to a[k] prevents the original right-push from changing a[k]. With some more tedious effort we can similarly show that any interference from a left-push or left-pop either causes the interfering operation or the original operation to fail. This covers 4 of the 16 cases we need to consider. The remaining cases will be brushed under the carpet to avoid further suffering.

# 3. Boosting obstruction-freedom to wait-freedom

Naturally, having an obstruction-free implementation of some object is not very helpful if we can't guarantee that some process eventually gets its unobstructed solo execution. In general, we can't expect to be able to do this without additional assumptions; for example, if we could, we could solve consensus using the racing-counters algorithm with no randomization at all.^{1} So we need to make some sort of assumption about timing, or find somebody else who has already figured out the right assumption to make.

Those somebodies turn out to be Faith Ellen Fich, Victor Luchangco, Mark Moir, and Nir Shavit, who show how to boost obstruction-freedom to wait-freeom in their paper Obstruction-free algorithms can be practically wait-free from DISC 2005. The timing assumption is **unknown-bound semisynchrony**, which means that in any execution there is some maximum ratio R between the shortest and longest time interval between any two consecutive steps of the same non-faulty process, but the processes don't know what this ratio is.^{2} In particular, if I can execute more than R steps without you doing anything, I can reasonably conclude that you are dead—the semisynchrony assumption thus acts as a kind of weak failure detector.

The fact that R is unknown might seem to be an impediment to using this failure detector, but we can get around this. The idea is to start with a small guess for R; if a process is suspected but then wakes up again, we increment the guess. Eventually, the guessed value is larger than the correct value, so no live process will be falsely suspected after this point.

To get solo execution, when a process detects a conflict (because their operation didn't finish quickly), it enters into a "panic mode" where processes take turns trying to finish unmolested. A fetch-and-add register is used as a timestamp generator, and only the process with the smallest timestamp gets to proceed. However, if this process is too sluggish, other processes may give up and overwrite its low timestamp with ∞, temporarily ending its turn. If the sluggish process is in fact alive, it can restore its low timestamp and kill everybody else, allowing it to make progress until some other process declares it dead again.

The simulation works because eventually the mechanism for detecting dead processes stops suspecting live ones (using the technique described above), so the live process with the winning timestamp finishes its operation without interference. This allows the next process to proceed, and eventually all live processes complete any operation they start, giving the wait-free property.

Here is the actual code. It's a rather long algorithm but most of the details are just bookkeeping.

- invoke(op)
- if ¬PANIC:
- execute up to B steps of the underlying algorithm
- if we are done:
- return

PANIC ← true

*# enter panic mode*- my-timestamp ← fetch-and-increment()
A[i] ← 1

*# reset my activity counter*- while true:
- T[i] ← my-timestamp
- min-timestamp ← my-timestamp
- winner ← i
for j ← 1..n, j≠i:

*# look for winner, slaughter non-winners*- other-timestamp ← T[j]
if other-timestamp < min-timestamp:

T[winner] ← ∞

*# not looking so winning any more*- min-timestamp ← other-timestamp
- winner ← j

- else
if other-timestamp < ∞:

- T[j] ← ∞

- if i = winner:
- repeat:
- execute up to b steps of the underlying algorithm
- if we are done:
- T[i] ← ∞
- PANIC ← false
- return

- else
- A[i] ← A[i] + 1
- PANIC ← true

- until while T[i] = ∞:

- repeat:
- else:
- repeat:
- a ← A[winner]
- wait a steps
- winner-timestamp ← T[winner]

- until a = A[winner] or winner-timestamp ≠ min-timestamp
- if winner-timestamp = min-timestamp:
T[winner] ← ∞

*# left loop because a didn't change; kill winner for inactivity*

- repeat:

- if ¬PANIC:

The preamble before entering PANIC mode is a fast-path computation that allows a process that actually is running in isolation to skip testing any timestamps or doing any extra work (except for the one register read of PANIC). The assumption is that the constant B is set high enough that any process generally will finish its operation in B steps without interference. If there is interference, then the timestamp-based mechanism kicks in: we grab a timestamp out of the convenient fetch-and-add register and start slugging it out with the other processes.

(A side note: while the algorithm as presented in the paper assumes a fetch-and-add register, any timestamp generator that delivers increasing values over time will work. So if we want to limit ourselves to registers, we could generate timestamps by taking snapshots of previous timestamps, adding 1, and appending process ids for tie-breaking.)

Once I have a timestamp, I try to knock all the higher-timestamp processes out of the way (by writing ∞ to their timestamp registers). If I see a smaller timestamp than my own, I'll drop out myself (T[i] ← ∞), and fight on behalf of its owner instead. At the end of the j loop, either I've decided I am the winner, in which case I try to finish my operation (periodically checking T[i] to see if I've been booted), or I've decided somebody else is the winner, in which case I watch them closely and try to shut them down if they are too slow (T[winner] ← ∞). I detect slow processes by inactivity in A[winner]; similarly, I signal my own activity by incrementing A[i]. The value in A[i] is also used as an increasing guess for the time between increments of A[i]; eventually this exceeds the R(b+O(1)) operations that I execute between incrementing it.

We still need to prove that this all works. The essential idea is to show that whatever process has the lowest timestamp finishes in a bounded number of steps. To do so, we need to show that other processes won't be fighting it in the underlying algorithm. Call a process *active* if it is in the loop guarded by the "if i = winner" statement. Lemma 1 from the paper states:

- Lemma 1
- If processes i and j are both active, then T[i] = ∞ or T[j] = ∞.
- Proof
Assume without loss of generality that i last set T[i] to my-timestamp in the main loop after j last set T[j]. In order to reach the active loop, i must read T[j]. Either T[j] = ∞ at this time (and we are done, since only j can set T[j] < ∞), or T[j] is greater than i's timestamp (or else i wouldn't think it's the winner). In the second case, i sets T[j] = ∞ before entering the active loop, and again the claim holds.

The next step is to show that if there is some process i with a minimum timestamp that executes infinitely many operations, it increments A[i] infinitely often (thus eventually making the failure detector stop suspecting it). This gives us Lemma 2 from the paper:

- Lemma 2
- Consider the set of all processes that execute infinitely many operations without completing an operation. Suppose this set is non-empty, and let i hold the minimum timestamp of all these processes. Then i is not active infinitely often.
- Proof
Suppose that from some time on, i is active forever, i.e., it never leaves the active loop. Then T[i] < ∞ throughout this interval (or else i leaves the loop), so for any active j, T[j] = ∞ by the preceding lemma. It follows that any active T[j] leaves the active loop after b+O(1) steps of j (and thus at most R(b+O(1)) steps of i). Can j re-enter? If j's timestamp is less than i's, then j will set T[i] = ∞, contradicting our assumption. But if j's timestamp is greater than i's, j will not decide it's the winner and will not re-enter the active loop. So now we have i alone in the active loop. It may still be fighting with processes in the initial fast path, but since i sets PANIC every time it goes through the loop, and no other process resets PANIC (since no other process is active), no process enters the fast path after some bounded number of i's steps, and every process in the fast path leaves after at most R(B+O(1)) of i's steps. So eventually i is in the loop alone forever—and obstruction-freedom means that it finishes its operation and leaves. This contradicts our initial assumption that i is active forever.

So now we want to argue that our previous assumption that there exists a bad process that runs forever without winning leads to a contradiction, by showing that the particular i from the Lemma actually finishes (note that the Lemma doesn't quite do this—we only show that i finishes if it stays active long enough, but maybe it doesn't stay active).

Suppose i is as in the preceding Lemma. The Lemma says that i leaves the active loop infinitely often. So in particular it increments A[i] infinitely often. After some finite number of steps, A[i] exceeds the limit R(b+O(1)) on how many steps some other process can take between increments of A[i]. For each other process j, either j has a lower timestamp than i and thus finishes in a finite number of steps (from the premise of the choice of i), or j has a higher timestamp than i. Once we have cleared out all the lower-timestamp processes, we follow the same logic as in the proof of the Lemma to show that eventually (a) i sets T[i] < ∞ and PANIC = true, (b) each remaining j observes T[i] < ∞ and PANIC = true and reaches the waiting loop, (c) all such j wait long enough (since A[i] is now very big) that i can finish its operation. This contradicts the assumption that i never finishes the operation and completes the proof.

## 3.1. Cost

If the parameters are badly tuned, the potential cost of this construction is quite bad. For example, the slow increment process for A[i] means that the time a process spends in the active loop even after it has defeated all other processes can be as much as the square of the time it would normally take to complete an operation alone—and every other process may pay R times this cost waiting. This can be mitigated to some extent by setting b high enough that a winning process is likely to finish in its first unmolested pass through the loop (recall that it doesn't detect that the other processes have reset T[i] until after it makes its attempt to finish). An alternative might be to double A[i] instead of incrementing it at each pass through the loop. However, it is worth noting (as the authors do in the paper) that nothing prevents the underlying algorithm from incorporating its own contention management scheme to ensure that most operations complete in B steps and PANIC mode is rarely entered. So we can think of the real function of the construction as serving as a backstop to some more efficient heuristic approach that doesn't necessarily guarantee wait-free behavior in the worst case.

# 4. Lower bounds for lock-free protocols

So far we have seen that obstruction-freedom buys us an escape from the impossibility results that plague wait-free constructions, while still allowing practical implementations of useful objects under plausible timing assumptions. Yet all is not perfect: it is still possible to show non-trivial lower bounds on the costs of these implementations in the right model. We will present one of these lower bounds, the linear-contention lower bound of Faith Ellen Fich, Danny Hendler, Nir Shavit. Linear lower bounds on real-world implementations of concurrent objects, FOCS 2005. First we have to define what is meant by contention.

## 4.1. Contention

A limitation of real shared-memory systems is that physics generally won't permit more than one process to do something useful to a shared object at a time. This limitation is often ignored in computing the complexity of a shared-memory distributed algorithm (and one can make arguments for ignoring it in systems where communication costs dominate update costs in the shared-memory implementation), but it is useful to recognize it if we can't prove lower bounds otherwise. Complexity measures that take the cost of simultaneous access into account go by the name of **contention**.

The particular notion of contention used in the Fich et al. paper is an adaptation of the contention measure of Dwork et al., Contention in shared-memory algorithms, JACM 1997. The idea is that if I access some shared object, I pay a price in **memory stalls** for all the other processes that are trying to access it at the same time but got in first. In the original definition, given an execution Aφ_{1}φ_{2}...φ_{k}φA', where all operations φ_{i} are applied to the same object as φ, and the last operation in A is not, then φ_{k} incurs k memory stalls. Fich et al. modify this to only count sequences of *non-trivial* operations where an operation is non-trivial if it changes the state of the object in some state (e.g. writes, increments, compare-and-swap—but not reads). Note that this change only strengthens the bound they eventually prove, which shows that in the worst case, obstruction-free implementations of operations on objects in a certain class incur a linear number of memory stalls (possibly spread across multiple base objects).

## 4.2. The class G

The Fich et al. bound is designed to be as general as possible, so the authors define a class *G* of objects to which it applies. As is often the case in mathematics, the underlying meaning of G is "a reasonably large class of objects for which this particular proof works," but the formal definition is given in terms of when certain operations of the implemented object are affected by the presence or absence of other operations—or in other words, when those other operations need to act on some base object in order to let later operations know they occurred.

An object is in the class G if it has some operation Op and initial state s such that for any two processes p and q and every sequence of operations AφA', where (a) φ is an instance of Op executed by p, (b) no operation in A or A' is executed by p, (c) no operation in A' is executed by q,<^{3}> and (d) no two operations in A' are executed by the same process; then there exists a sequence of operations Q by q such that for every sequence HφH' where (a) HH' is an interleaving of Q and the sequences AA'|r for each process r, (b) H' contains no operations of q, and (c) no two operations in H' are executed by the same process; then the return value of φ to p changes depending on whether it occurs after Aφ or Hφ.

This is where "makes the proof work" starts looking like a much simpler definition. The intuition is that deep in the guts of the proof, we are going to be injecting some operations of q into an existing execution (hence adding Q), and we want to do it in a way that forces q to operate on some object that p is looking at (hence the need for Aφ to return a different value than Hφ), without breaking anything else that is going on (all the rest of the conditions). The reason for pulling all of these conditions out of the proof into a separate definition is that we also want to be able to show that particular classes of real objects satisfy the conditions required by the proof, without having to put a lot of special cases into the proof itself.

- Lemma
- A mod-m fetch-and-increment object (with m ≥ n) is in G.
- Proof
- This is a classic proof-by-unpacking-the-definition. Pick some execution AφA' satisfying all the conditions, and let a be the number of fetch-and-increments in A and a' the number in A'. Note a' ≤ n-2, since all operations in A' are by different processes. Now let Q be a sequence of n-a'-1 fetch-and-increments by q, and let HH' be an interleaving of Q and the sequences AA'|r for each r, where H' includes no two operation of the same process and no operations at all of q. Let h, h' be the number of fetch-and-increments in H, H', respectively. Then h+h' = a+a'+(n-a'-1) = n+a-1 and h' ≤ n-2 (since H' contains at most one fetch-and-increment for each process other than p and q). This gives h ≥ (n+a+1)-(n-2) = a+1 and h ≤ n+a-1, and the return value of φ after Hφ is somewhere in this range mod m. But none of these values is equal to a mod m (that's why we specified m ≥ n, although as it turns out m ≥ n-1 would have been enough), so we get a different return value from Hφ than from Aφ.

As a corollary, we also get stock fetch-and-increment registers, since we can build mod-m registers from them by taking the results mod m.

A second class of class-G objects is obtained from snapshot:

- Lemma
Single-writer snapshot objects are in G.

^{4}- Proof
- Let AφA' be as in the definition, where φ is a scan operation. Let Q consist of a single update operation by q that changes its segment. Then in the interleaved sequence HH', this update doesn't appear in H' (it's forbidden), so it must be in H. Nobody can overwrite the result of the update (single-writer!), so it follows that Hφ returns a different snapshot from Aφ.

## 4.3. The lower bound proof

- Theorem
For any obstruction-free implementation of some object in class G from RMW base objects, there is an execution in which some operation incurs n-1 stalls.

^{5}

We can't do better than n-1, because it is easy to come up with implementations of counters (for example) that incur at most n-1 stalls. Curiously, we can even spread the stalls out in a fairly arbitrary way over multiple objects, while still incurring at most n-1 stalls. For example, a counter implemented using a single counter (which is a RMW object) gets exactly n-1 stalls if n-1 processes try to increment it at the same time, delaying the remaining process. At the other extreme, a counter implemented by doing a collect over n-1 single-writer registers (also RMW objects) gets at least n-1 stalls—distributed as one per register—if each register has a write delivered to it while the reader waiting to read it during its collect. So we have to allow for the possibility that stalls are concentrated or scattered or something in between, as long as the total number adds up at least n-1.

The proof supposes that the theorem is not true and then shows how to boost an execution with a maximum number k < n-1 stalls to an execution with k+1 stalls, giving a contradiction. (Alternatively, we can read the proof as giving a mechanism for generating an (n-1)-stall execution by repeated boosting, starting from the empty execution.)

This is pretty much the usual trick: we assume that there is a class of bad executions, then look for an extreme member of this class, and show that it isn't as extreme as we thought. In doing so, we can restrict our attention to particularly convenient bad executions, so long as the existence of some bad execution implies the existence of a convenient bad execution.

Formally, the authors define a *k-stall execution* for process p as an execution Eσ_{1}...σ_{i} where E and σ_{i} are sequence of operations such that:

- p does nothing in E,
Sets of processes S

_{j}, j = 1..i, whose union S=∪S_{j}has size k, are each covering objects O_{j}after E with pending non-trivial operations,Each σ

_{j}consists of p applying events by itself until it is about to apply an event to O_{j}, after which each process in S_{j}accesses O_{j}, after which p accesses O_{j}.- All processes not in S are idle after E,
p starts at most one operation of the implemented object in σ

_{1}...σ_{i}, andIn every extension of E in which p and the processes in S don't take steps, no process applies a non-trivial event to any base object accessed in σ

_{1}...σ_{i}. (We will call this the**weird condition**below.)

So this definition includes both the fact that p incurs k stalls and some other technical details that make the proof go through. The fact that p incurs k stalls follows from observing that it incurs |S_{j}| stalls in each segment σ_{j}, since all processes in S_{j} access O_{j} just before p does.

We'll now show that if a k-stall execution exists, for k < n-2, then a (k+k')-stall execution exists for some k' > 0. Note that some k-stall execution exists, because the empty execution is a 0-stall execution (with i=0) by the definition.

Start with some k-stall execution Eσ_{1}...σ_{i},. Extend this execution by a sequence of operations σ in which p runs in isolation until it finishes its operation φ (which it may start in σ if it hasn't done so already), then each process in S runs in isolation until it completes its operation. Now linearize the high-level operations completed in Eσ_{1}...σ_{i}σ and factor them as AφA' as in the definition of class G.

Let q be some process not equal to p or contained in any S_{j}. Then there is some sequence of high-level operations Q of q such that Hφ does not return the same value as Aφ for any interleaving HH' of Q with the sequences of operations in AA' satisfying the conditions in the definition. We want to use this fact to shove at least one more memory stall into Eσ_{1}...σ_{i}σ, without breaking any of the other conditions that would make the resulting execution a (k+k')-stall execution.

Consider the extension τ of E where q runs alone until it finishes every operation in Q. Then τ applies no nontrivial events to any base object accessed in σ_{1}...σ_{k}, (from the weird condition on k-stall executions) and the value of each of these base objects is the same after E and Eτ, and thus is also the same after Eσ_{1}...σ_{k} and Eτσ_{1}...σ_{k}.

Now let σ' be the extension of Eτσ_{1}...σ_{k} defined analogously to σ: p finishes, then each process in each S_{j} finishes. Let HφH' factor the linearization of Eτσ_{1}...σ_{i}σ'. Observe that HH' is an interleaving of Q and the high-level operations in AA', that H' contains no operations by q (they all finished in τ, before φ started), and that H' contains no two operations by the same process (no new high-level operations start after φ finishes, so there is at most one pending operation per process in S that can be linearized after φ).

Now observe that q does some non-trivial operation in τ to some base object accessed by p in σ. If not, then p sees the same responses in σ' and in σ, and returns the same value, contradicting the definition of class G.

So does q's operation in τ cause a stall in σ? Not necessarily: there may be other operations in between. Instead, we'll use the existence of q's operation to demonstrate the existence of at least one operation, possibly by some other process we haven't even encountered yet, that does cause a stall. We do this by considering the set F of all finite extensions of E that are free of p and S operations, and look for an operation that stalls p somewhere in this infinitely large haystack.

Let O_{i+1} be the first base object accessed by p in σ that is also accessed by some non-trivial event in some sequence in F. We will show two things: first, that O_{i+1} exists, and second, that O[i+1] is distinct from the objects O_{1}...O_{i}. The first part follows from the fact that τ∈F and we have just shown that τ contains a non-trivial operation (by q) on a base object accessed by p in σ. For the second part, we use the weird condition on k-stall executions again: since every extension of E in F is (p∪S)-free, no process applies a non-trivial event to any base object accessed in σ_{1}..σ_{i}, i.e. to any object O_{1}..O_{i}.

You've probably guessed that we are going to put our stalls in on O_{i+1}. We choose some extension X from F that maximizes the number of processes with simultaneous pending non-trivial operations on O_{i+1} (we'll call this set of processes S_{i+1} and let |S_{i+1}| be the number k'>0 we've been waiting for), and let E' be the minimum prefix of X such that these pending operations are still pending after EE'.

We now look at the properties of EE'. We have:

- EE' is p-free (follows from E being p-free and E'∈F, where everything in F is p-free).
Each process in S

_{j}has a pending operation on O_{j}after EE' (it did after E, and didn't do anything in E').

In particular, this means that we can construct an execution EE'σ_{1}...σ_{i}σ_{i+1} that includes k+k' memory stalls, by sending in the same sequences σ_{1}...σ_{i} as before, then appending a new sequence of events where (a) p does all of its operations in σ up to its first operation on O_{i+1}; then (b) all the processes in the set S_{i+1} of processes with pending events on O_{i+1} execute their pending events on O_{i+1}; then (c) p does its first access to O_{i+1} from σ. Note that in addition to giving us k+k' memory stalls, σ_{i+1} also has the right structure for a (k+k')-stall execution. But there is one thing missing: we have to show that the weird condition on further extensions still holds.

In particular, letting S' = S∪S_{i+1}, we need to show that any (p∪S')-free extension α of EE' includes a non-trivial access to a base object accessed in σ_{1}...σ_{i+1}. Observe first that since α is (p∪S')-free, then E'α is (p∪S)-free, and so it's in F: in particular, by the weird condition on Eσ_{1}...σ_{i}, E'α doesn't have any non-trivial accesses to any object with a non-trivial access in σ_{1}...σ_{i}. So we only need to squint very closely at σ_{i+1} to make sure it doesn't get any object in there either.

Recall that σ_{i+1} consists of (a) a sequence of accesses by p to objects O_{1}...O_{i} (already excluded); (b) an access of p to O_{i+1}; and (c) a bunch of accesses by processes in S_{i+1} to O_{i+1}. So we only need to show that α includes no non-trivial accesses to O_{i+1}. Suppose that it does: then there is some process that eventually has a pending non-trivial operation on O_{i+1} somewhere in α. If we stop after this initial prefix α' of α, we get k'+1 processes with pending operations on O_{i+1} in EE'α'. But then E'α' is an extension of E with k'+1 processes with a simultaneous pending operation on O_{i+1}. This contradicts the choice of X to maximize k'. So if our previous choice was in fact maximal, the weird condition still holds, and we have just constructed a (k+k')-stall execution. This concludes the proof.

## 4.4. Consequences

We've just shown that counters and snapshots have (n-1)-stall executions, because they are in the class G. A further, rather messy argument (given in the Fich et al. paper) extends the result to stacks and queues, obtaining a slightly weaker bound of n total stalls and operations for some process in the worst case. In both cases, we can't expect to get a sublinear worst-case bound on time under the reasonable assumption that both a memory stall and an actual operation takes at least one time unit. This puts an inherent bound on how well we can handle hot spots for many practical objects, and means that in an asynchronous system, we can't solve contention at the object level in the worst case (though we may be able to avoid it in our applications).

But there might be a way out for some restricted classes of objects. We saw before that we could escape from the JayantiTanToueg lower bound by considering bounded objects (see MaxRegisters). Something similar may happen here: the Fich-Herlihy-Shavit bound on fetch-and-increments requires executions with n(n-1)^{d}+n increments to show n-1 stalls for some fetch-and-increment if each fetch-and-increment only touches d objects, and even for d = log n this is already superpolynomial. The standard max registers construction of a counter doesn't help here, since everybody hits the switch bit at the top of the max register, giving n-1 stalls if they all hit it at the same time. But there might be some better construction that avoids this.

## 4.5. More lower bounds

There are many more lower bounds one can prove on lock-free implementations, many of which are based on previous lower bounds for stronger models. We won't present these in class, but if you are interested, a good place to start is

# 5. Practical considerations

This paper is also beyond the scope of what we can do, but it gives some nice examples of the practical trade-offs in choosing between multi-register CAS and various forms of software transaction memory in implementing lock-free data structures:

CategoryDistributedComputingNotes

This fact was observed by Herlihy et al. in their original obstruction-free paper; it also implies that there exists a universal obstruction-free implementation of anything based on Herlihy's universal construction. (1)

This is a much older model, which goes back to a famous paper of Dwork, Lynch, and Stockmeyer from JACM 1988. (2)

This part of the definition is slightly different from the definition in the paper, which excludes q from A as well. Unfortunately, that causes problems later in the proof. The revised definition given here is based on a personal communication from Danny Hendler and reflects an updated version of the proof in the (still unpublished as of this writing) journal version of the paper. (3)

For the purposes of this lemma, "single-writer" means that each segment can be written to by only one process, not that there is only one process that can execute update operations. (4)

This appears in the paper as Theorem 6. (5)