1. Software transactional memory
Goes back to Shavit and Touitou (Shavit and Touitou, Software transactional memory, PODC 1995) based on earlier proposals for hardware support for transactions by Herlihy and Moss. Recently very popular in programming language circles.
We start with the basic idea of a transaction. I read a bunch of registers and update their values, with usually database ACID properties: the transaction either happens completely or not at all, serializes with other transactions as if each occurred instantaneously. Our goal is to implement this with minimal hardware support, and use it for everything.
Generally we only consider static transactions where the set of memory locations accessed is known in advance, as opposed to dynamic transactions where it may vary depending on what we read (e.g. if we have to follow pointers). Static transactions are easier because we can treat them as multi-word RMW.
Implementations are usually non-blocking: some infinite stream of transactions succeed, but not necessarily yours. This excludes simple method based on acquiring locks, since we have to keep going even if a lock-holder crashes, but is weaker than wait-free since we can have starvation.
Some selling points for STM:
We get atomic operations without having to use our brains much. Unlike hand-coded AtomicSnaphots, counters, queues, etc., we have a universal construction that converts any sequential data structure built on top of stock memory into a concurrent data structure. This is useful since most programmers don't have very big brains. We also avoid burdening the programming with having to remember to lock things.
We can build large shared data structures with the possibility of concurrent access. For example, we can implement AtomicSnapshots so that concurrent updates don't interfere with each other, or an atomic queue where enqueues and dequeues can happen concurrently so long as the queue always has a few elements in it to separate the enqueuers and dequeuers.
- We can execute atomic operations on distinct data structures, even if the data structures weren't originally designed to work together, provided they are all implemented using the STM mechanism. This is handy in all the classic database-like settings, as when we want to take $5 from my bank account and put it in yours.
On the other hand, we now have to deal with the possibility that operations may fail. There is a price to everything.
3. Basic approaches
- Locking (not non-blocking). Acquire either a single lock for all of memory (doesn't allow much concurrency) or a separate lock for each memory location accessed. The second approach can lead to deadlock if we aren't careful, but we can prove that if every transaction acquires locks in the same order (e.g. by increasing memory address), then we never get stuck: we can order the processes by the highest lock acquired, and somebody comes out on top. Note that acquiring locks in increasing order means that I have to know which locks I want before I acquire any of them, which may rule out dynamic transactions.
Single-pointer compare-and-swap (known as Herlihy's method). Based on Herlihy's universal construction. All access to the data structure goes through a pointer in a CAS. To execute a transaction, I make my own copy of the data structure, update it, and then attempt to redirect the pointer. Advantages: trivial to prove that the result is linearizable (the pointer swing is an atomic action) and non-blocking (somebody wins the CAS), allows dynamic transactions (since you can do anything you want to your copy). Disadvantages: high overhead of the many copies, single-pointer bottleneck limits concurrency even when two transactions use disjoint parts of memory.
- Multiword RMW: This is the approach suggested by Shavit and Touitou, which most subsequent work follows. As usually implemented, it only works for static transactions. The idea is that I write down what registers I plan to update and what I plan to do to them. I then attempt to acquire all the registers. If I succeed, I update all the values, store the old values, and go home. If I fail, it's because somebody else already acquired one of the registers. Since I need to make sure that somebody makes progress (I may be the only process left alive), I'll help that other process finish its transaction if possible. Advantages: allows concurrency between disjoint transactions. Disadvantages: requires implementing multi-word RMW—in particular, requires that any process be able to understand and simulate any other process's transactions. Subsequent work often simplifies this to implementing multi-word CAS, which is sufficient to do non-blocking multi-word RMW since I can read all the registers I need (without any locking) and then do a CAS to update them (which fails only if somebody else succeeded).
4. Implementing multi-word RMW
We'll describe Shavit and Touitou's method, which essentially follows the locking approach but allows other processes to help dead ones so that locks are always released.
Synchronization primitived used is LL/SC: LL (load-linked) reads a register and leaves our id attached to it, SC writes a register only if our id is still attached, and clears any other id's that might also be attached. It's easy to build 1-register CAS (CAS1) out of this, though S&T exploit some additional power of LL/SC.
Overview of a transaction (no contention or failures):
Process intializes the local state rec for the transaction: operation field describes what it wants to do, Addresses lists the relevant memory addresses, OldValues (initially null) are set to provide space for storing old values from the R part of the RMW. Status (initially null) records whether the transaction succeeds or fails. A version number distinguishes reuse of the same storage. A Stable bit is set at the end to tell other processes that all values have been initialized.
Process acquires ownership of registers in Addresses. See AcquireOwnerships code in paper. Basic idea is to use LL/SC to set Owner[address] to the transaction record only if it is currently Nobody; however, in between the LL and the SC the process checks that the record status is still null and that the version number hasn't changed. The status check is done with an overlapping LL/SC: LL(status) LL(owner) SC(status) SC(owner). If both SC's succeed, go on to the next register. If either fails, attempt to fail the transaction (by doing SC(Failure, bad-address) in the status field) and return if the failure is successful. Otherwise try again.
Do a LL on status to see if AcquireOwnerships succeeded. If so, update the memory, store the old results in OldValues and release the ownerships. If it failed on bad address j, release the ownership and then attempt to carry out the transaction that owns j, provided we are the initiator of the current transaction.
With failures or concurrency, a second process may come in and attempt to acquire ownership of a register we've already grabbed. In this case, it will fail its own transaction and then start helping ours, attempting to acquire more registers for our transaction. If it succeeds before we do, it will update the registers and store the previous values in OldValues.
- Linearizability follows from the serializability of the locking protocol: acquiring ownership is equivalent to grabbing a lock, and updates occur only when all registers are locked.
- Complications come from (a) two or more processes trying to complete the same transaction and (b) some process trying to complete an old transaction that has already terminated. For the first part we just make sure that the processes don't interfere with each other, e.g. I am happy when trying to acquire a location if somebody else acquires it for the same transaction. For the second part we have to check rec.status and rec.version before doing just about anything.
To show that the protocol is non-blocking we must show that if an unbounded number of transactions are attempted, one eventually succeeds. First observe that in order to fail, a transaction must be blocked by another transaction that acquired ownership of a higher-address location than it did; eventually we run out of higher-address locations, so there is some transaction that doesn't fail. Of course, this transaction may not succeed (e.g., if its initiator dies), but either (a) it blocks some other transaction, and that transaction's initiator will complete it or die trying, or (b) it blocks no future transactions. In the second case we can repeat the argument for the n-1 surviving processes to show that some of them complete transactions, ignoring the stalled transaction from case (b).
One downside of S&T is that it uses LL/SC very aggressively (e.g. with overlapping LL/SC operations) and uses non-trivial (though bounded, if you ignore the ever-increasing version numbers) amounts of extra space. Subsequent work has aimed at knocking these down; for example, Harris et al, A practical multi-word compare-and-swap, DISC 2002 builds multi-register CAS out of single-register CAS with O(1) extra bits per register. The proof of these later results can be quite involved; Harris et al, for example, base their algorithm on an implementation of 2-register CAS whose correctness has been verified only by machine (which may be a plus in some views).