[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Concurrency control is the process of preventing threads running in the same address space from interfering with each other.

1. The need for concurrency control

1.1. Example: Updating the web hits counter

Let's suppose we have a multi-threaded web server that keeps track of how many times it serves a particular page (since we get paid $0.0002 per impression, say). Naively, we might try using the following code:

   1 void
   2 update_hit_counter(void)
   3 {
   4     hit_counter = hit_counter + 1;
   5 }

On a IA-32 machine, gcc (with no optimization flags) will turn this into

        movl    hit_counter, %eax
        addl    $1, %eax
        movl    %eax, hit_counter

I.e., read hit_counter, add 1 to it, and write the result back.

Suppose we have two threads P and Q both trying to increment hit_counter, which starts at 0. If P is interrupted after reading hit_counter but before writing it back, P and Q will both read 0 from hit_counter and both write 1 back. We just lost $0.0002!

1.2. Race conditions

Any situation where the order of operations between two threads affects the outcome of the computation is called a race condition. Race conditions create nondeterminism and are thus bad. We'd like to get rid of race conditions when we can.

2. Mutual exclusion

Suppose we can ensure that nobody else touches hit_counter during critial section: the 3 machine-language instructions it takes to update it. Then the update_hit_counter procedure will appear to execute instantaneously: it will be atomic. So how do we enforce this?

2.1. Naive approach

   1 void
   2 update_hit_counter(void)
   3 {
   4     while(hit_counter_busy);  /* wait for other process to finish */
   5 
   6     hit_counter_busy = 1;
   7 
   8     /* start of critical section */
   9     hit_counter = hit_counter + 1;
  10     /* end of critical section */
  11 
  12     hit_counter_busy = 0;
  13 }

Here we have an attempt at a spin-lock: a busy bit that warns the other threads that we are updating the hit counter. This particular implementation doesn't work:

  1. P and Q both read 0 from hit_counter_busy and exit their loops.

  2. P and Q both run the rest of the procedure together.

This does work if we can test if hit_counter_busy is 0 and set it to 1 as a single atomic operation (called a test-and-set). Some hardware provides this operation. But it is in fact possible to do mutual exclusion even without test-and-set.

2.2. Another naive approach

   1 /* For two threads 0 and 1 */
   2 int thread_waiting[2];  /* initially zero */
   3 
   4 void
   5 update_hit_counter(int me)
   6 {
   7     /* try to grab the lock */
   8     thread_waiting[me] = 1;
   9 
  10     /* did I win? */
  11     for(;;) {
  12         if(thread_waiting[!me] == 0) break;
  13     }
  14 
  15     /* start of critical section */
  16     hit_counter = hit_counter + 1;
  17     /* end of critical section */
  18 
  19     thread_waiting[me] = 0;
  20 }

This does guarantee mutual exclusion. One of the threads sets thread_waiting first, and the second thread will read it and spin until the first thread finishes (since it does its read after setting thread_waiting for itself). Unfortunately it also allows for deadlock: if each thread sets its thread_waiting flag before looking at the other's, both with spin. So we are still losing.

2.3. Peterson's algorithm

   1 /* For two threads 0 and 1 */
   2 int thread_waiting[2];  /* initially zero */
   3 int turn;               /* initially arbitrary */
   4 
   5 void
   6 update_hit_counter(int me)
   7 {
   8     /* try to grab the lock */
   9     thread_waiting[me] = 1;
  10     turn = me;
  11 
  12     /* did I win? */
  13     for(;;) {
  14         if(thread_waiting[!me] == 0) break;
  15         if(turn != me) break;
  16     }
  17 
  18     /* start of critical section */
  19     hit_counter = hit_counter + 1;
  20     /* end of critical section */
  21 
  22     thread_waiting[me] = 0;
  23 }

Intuition: We use the same argument as the previous solution to show that at most one thread wins the thread_waiting race and executes the first break statement. If neither thread wins the thread_waiting race, then only one thread is the first to write to turn. The first thread to write to turn escapes as soon as the second thread writes to turn. (Think of turn as saying whose turn it is to wait, not whose turn it is to go.) So we avoid the previous deadlock, since the winner eventually completes the critical section and sets its thread_waiting flag back to zero, letting the other thread out. (See MutualExclusion for a more detailed proof and some other algorithms.)

Problem: only works for two threads. To handle more threads, we need to build a tree of these, which gets very complex very fast, or use a different algorithm. Even with a smarter algorithm we provably need to allocate—and have each thread read—at least one memory location per thread that might access the mutex, which gets expensive.

2.4. Preventing pre-emption

On a uniprocessor (or inside a single process), we may be able to implement a critical section by temporarily preventing pre-emption.

   1 void
   2 update_hit_counter(int me)
   3 {
   4     turn_off_interrupts();
   5 
   6     /* start of critical section */
   7     hit_counter = hit_counter + 1;
   8     /* end of critical section */
   9 
  10     turn_on_interrupts();
  11 }

This is a pretty good strategy for threads inside the kernel, where we have direct access to e.g. CLI and STI instructions. It also works well with co-operative multitasking (just remember not to call yield inside the critical section). But it doesn't help with multiple processors, since our two threads might genuinely be running in parallel!

2.5. Hardware support for locking

For a multi-processor machine, we pretty much want some sort of hardware support, e.g. the xchg operation in IA-32, which performs a hardware-level lock on the shared-memory bus and then exchanges a value with memory, effectively acting as a test-and-set.

Usually the details of the particular operation are hidden under some higher-level abstraction, such as a lock (also called a mutex). A lock is a data structure that supports acquire and release operations, with the rule that at most one process can successfully complete an acquire operation before somebody executes release. Further abstractions can then be built on top of the lock.

3. Spinlocks

A spinlock is a lock where blocked processes "busy wait," checking the lock over and over again until it becomes free. The selling point for a spinlock is that the code can be made very simple and the cost of acquiring and releasing the lock is very very small, at most a few memory-bus clock cycles. The disadvantage is that a processor spinning on a lock doesn't do anything else while it's waiting.

An example of an implementation of a spinlock for x86 using the xchg operation can be found at Spinlock (warning: Intel assembler syntax).

One way to get around the problem of busy-waiting is to have the waiting process sleep if it can't acquire the lock quickly, as in the following code, which assumes a (hardware-dependent) test_and_set procedure defined elsewhere.

   1 /* atomically write a new value to location and return the old value */
   2 extern int test_and_set(int *location, int value);
   3 
   4 #define LOCK_TRY_TIMES (1024)
   5 
   6 void
   7 acquire_lock(int *lock)
   8 {
   9     int count;
  10 
  11     for(;;) {
  12         for(count = LOCK_TRY_TIMES; count > 0; count--) {
  13             if(test_and_set(lock, 1) == 0) {
  14                 /* got it */
  15                 return;
  16             }
  17         }
  18 
  19         /* didn't get it, go to sleep */
  20         yield();
  21     }
  22 }
  23 
  24 void
  25 release_lock(int *lock)
  26 {
  27     test_and_set(lock, 0);
  28 }

The value of LOCK_TRY_TIMES should be set to minimize the total cost of (1) waiting too long before giving up, and (2) the context switch involved in calling yield. If nobody else is waiting for the lock, then it doesn't matter what we set it to, so the question is how long we expect to wait when there is high contention (many processors trying to acquire the lock at once). A reasonable strategy if we are just trying to minimize overhead might be to set LOCK_TRY_TIMES so that the waiting time is roughly equal to the time to perform two thread context switches (since we will have to swap the thread back in later); this way we pay at most twice the overhead that we would pay if we knew to swap out immediately. This assumes that we mostly care about minimizing overhead and not other factors (like proceeding as quickly as possible once the lock becomes available). If we want to acquire the lock as soon as possible we might be willing to set LOCK_TRY_TIMES higher and accept the busy-waiting overhead. At the other extreme, trying the lock just once minimizes busy-waiting, but if nobody holds the lock for very long we may be better off waiting for it.

Another limitation of this simple spinlock is that it doesn't involve the thread scheduler beyond calling yield. So it may be that when the scheduler swaps us back in, we still can't get the lock. This is bad not only because we may waste time checking, but also because we may miss a chance to get in because the scheduler didn't know to schedule us while the lock was free and somebody else snaffled it up while we were waiting. Such a situation is called starvation.

4. Semaphores

Semaphores were invented by Edsger_Dikstra precisely to deal with the problem of waiting for bounded supplies of resources. The basic idea of a semaphore is to provide a counter with two operations wait and signal (called P and V in Dijkstra's original paper, for proberen = test and verhogen = increment in Dutch), where wait waits for the counter value to be positive and then decrements it and signal increments the counter.

The simple version of the code looks like this:

   1 void
   2 wait(int *sem, int *lock)
   3 {
   4     for(;;) {
   5         acquire_lock(lock);
   6         if(*sem > 0) {
   7             sem--;
   8             release_lock(lock);
   9             return;
  10         }
  11         release_lock(lock);
  12     }
  13 }
  14 
  15 void
  16 signal(int *sem, int *lock)
  17 {
  18     acquire_lock(lock);
  19     sem++;
  20     release_lock(lock);
  21 }

Note the use of a mutex to protect the semaphore from lost updates. With additional hardware support (e.g. built-in atomic semaphore operations) it may be possible to avoid this.

4.1. Applications

We can use semaphores to protect a resource or collection of resources. A semaphore that is initially set to 1 acts like a lock: only one thread can acquire it at a time. A semaphore initially set to a higher value will allow multiple threads in simultaneously. We can also use signal to indicate that some resource is available; for example, if we have a group of worker threads waiting for things to do (e.g., kernel threads waiting to run user threads in a many-to-many thread model), we can have them all execute wait on a semaphore that is initially zero and use signal to indicate that some new job has shown up. Similarly in a producer-consumer model we can use signal whenever the producer adds a new item to the buffer and wait when the consumer is ready to remove one; this blocks the consumer when there is nothing to do. (With a second semaphore that tracks empty space in the buffer we can also block the producer when the buffer is full.)

4.2. Implementation

Semaphores are integrated with scheduling by including a wait queue along with the counter value. When a process sees a zero or negative value in the semaphore, instead of spinning it adds itself to the end of the wait queue and blocks (informs the scheduler that it is no longer ready to run). When a process increments the counter, it will wake a blocked process at the front of the queue. A standard optimization is to use negative values in the semaphore counter to indicate the number of processes waiting; this avoids having to test separately if the queue is empty or not.

The code might look something like this:

typedef struct semaphore {
    int value;
    ProcessQueue *p;   /* queue of processes, implemented elsewhere */
}

extern int GlobalSchedulerLock;

void
wait(Process *p, Semaphore *s)
{
    acquire_lock(GlobalSchedulerLock);

    /* see if we can get in */
    s->value--;

    if(s->value < 0) {
        enqueue(s->q, current_process);
        block(p);  /* mark process p as unschedulable */
    }

    release_lock(GlobalSchedulerLock);
}

void
signal(Semaphore *s)
{
    acquire_lock(GlobalSchedulerLock);

    s->value++;

    if(s->value <= 0) {
        wake(dequeue(s->q));     /* mark process as runnable */
    } 

    release_lock(GlobalSchedulerLock);
}

Both operations must be implemented atomically—including changes to the scheduler state. The wait operation must also be implemented so that the blocked process isn't the one responsible for releasing the scheduler lock. Typically this involves implementing semaphores as system calls.

One decision that will affect how processes interact with semaphores is how we implement the wait list. The fastest and simplest approach is to implement it as a stack (since linked lists like pushing and popping at the front). This, however, may lead to starvation if some thread gets pushed to the end and stays there because other threads grab the semaphore before it rises up. Implementing it as a queue guarantees no starvation (assuming no deadlock), so may be a better choice if fairness is an issue. But the definition of a semaphore is agnostic about which thread wakes up in response to a signal, so unless you know that the semaphore you are using guarantees fairness, you can't count on getting it.

On POSIX machines semaphores are available using sem_init, sem_wait, sem_post (signal), etc.

5. Monitors

Both mutexes and semaphores behave badly if mistreated. To prevent foolish programmers from failing to release locks, from releasing them more than once (if implemented as semaphores), or from attempting to reacquire a lock that they already have, many program languages provide explicit monitors, a language construct that indicates that some block of code should be executed only while in possession of a particular lock. An example are synchronized methods and synchronized statements in Java. For example, in Java our hit counter could use a synchronized method like this:

public synchronized void update_hit_counter() {
    hits++;
}

and the run-time would take care of locking the object to avoid trouble.

We could also synchronize on an explicit lock object:

public void update_hit_counter() {
    do_something_expensive_where_we_do_not_need_to_lock();

    synchronized(this) {
        hits++;
    }
}

Note the use of this as a lock; in Java all objects have a lock attached to them to support synchronized methods, which is a reasonable default to choose unless there is some reason to use something else.

One advantage of monitors is that they allow the system to choose whatever locking mechanism is most efficient (or even compile out locks if it can prove that they are not needed).

6. Condition variables

A weaker relative of a semaphore is a condition variable, which acts like the wait list in a semaphore without the counter. When executing signal on a condition variable, exactly one waiting process is unblocked. If no processes are waiting, nothing happens.

One thing to watch out for with condition variables is a missed connection, where process P starts waiting for a condition just after process Q signals it, leading to deadlock. The usual way to avoid this is to tie a condition variable to a mutex, so that P unlocks the mutex and waits on the condition variable as a single atomic operation, and Q locks the mutex before executing the signal (and unlocks it afterwards). Most implementations of condition variables (e.g. pthread_cond_wait) provide such an atomic unlock-and-wait operation, which is impossible to implement directly from a separate mutex and simple condition variable since there is otherwise a gap between the unlock and the wait that allows the signal to be missed. (The other order doesn't work because the thread blocks on the wait and never gets to the unlock.)

7. Deadlock detection and avoidance

See Deadlock.


CategoryOperatingSystemsNotes


2014-06-17 11:58