[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. What's wrong with comparison-based sorting

The standard quicksort routine is an example of a comparison-based sorting algorithm. This means that the only information the algorithm uses about the items it is sorting is the return value of the compare routine. This has a rather nice advantage of making the algorithm very general, but has the disadvantage that the algorithm can extract only one bit of information from every call to compare. Since there are n! possible ways to reorder the input sequence, this means we need at least log(n!) = O(n log n) calls to compare to finish the sort. If we are sorting something like strings, this can get particularly expensive, because calls to strcmp may take time linear in the length of the strings being compared. In the worst case, sorting n strings of length m each could take O(n m log n) time.

2. Bucket sort

The core idea of radix sort is that if we want to sort values from a small range, we can do it by making one bucket for each possible value and throw any object with that value into the corresponding bucket. In the old days, when Solitaire was a game played with physical pieces of cardboard, a player who suspected that that one of these "cards" had fallen under the couch might sort the deck by dividing it up into Spades, Hearts, Diamonds, and Club piles and then sorting each pile recursively. The same trick works in a computer, but there the buckets are typically implemented as an array of some convenient data structure.

If the number of possible values is too big, we may still be able to use bucket sort digit-by-digit. The resulting algorithms are known generally as radix sort. These are a class of algorithms designed for sorting strings in lexicographic order—the order used by dictionaries where one string is greater than another if the first character on which they differ is greater. One particular variant, most-significant-byte radix sort or MSB radix sort, has the beautiful property that its running time is not only linear in the size of the input in bytes, but is also linear in the smallest number of characters in the input that need to be examined to determine the correct order. This algorithm is so fast that it takes not much more time to sort data than it does to read the data from memory and write it back. But it's a little trickier to explain that the original least-significant-byte radix sort or LSB radix sort.

3. Classic LSB radix sort

This is the variant used for punch cards, and works well for fixed-length strings. The idea is to sort on the least significant position first, then work backwards to the most significant position. This works as long as each sort is stable, meaning that it doesn't reorder values with equal keys. For example, suppose we are sorting the strings:

sat
bat
bad

The first pass sorts on the third column, giving:

bad
sat
bat

The second pass sorts on the second column, producing no change in the order (all the characters are the same). The last pass sorts on the first column. This moves the s after the two bs, but preserves the order of the two words starting with b. The result is:

bad
bat
sat

There are three downsides to LSB radix sort:

  1. All the strings have to be the same length (this is not necessarily a problem if they are really fixed-width data types like ints).

  2. The algorithm used to sort each position must be stable, which may require more effort than most programmers would like to take.
  3. It may be that the late positions in the strings don't affect the order, but we have to sort on them anyway. If we are sorting aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa and baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, we spend a lot of time matching up as against each other.

4. MSB radix sort

For these reasons, MSB radix sort is used more often. This is basically the radix sort version of QuickSort, where instead of partitioning our input data into two piles based on whether each element is less than or greater than a random pivot, we partition the input into 256 piles, one for each initial byte. We can then sort each pile recursively using the same algorithm, taking advantage of the fact that we know that the first byte (or later, the first k bytes) are equal and so we only need to look at the next one. The recursion stops when we get down to 1 value, or in practice where we get down to a small enough number of values that the cost of doing a different sorting algorithm becomes lower than the cost of creating and tearing down the data structures for managing the piles.

4.1. Issues with recursion depth

The depth of recursion for MSB radix sort is equal to the length of the second-longest string in the worst case. Since strings can be pretty long, this creates a danger of blowing out the stack. The solution (as in QuickSort) is to use tail recursion for the largest pile. Now any pile we recurse into with an actual procedure call is at most half the size of the original pile, so we get stack depth at most O(log n).

4.2. Implementing the buckets

There is a trick we can do analagous to the Dutch flag algorithm where we sort the array in place. The idea is that we first count the number of elements that land in each bucket and assign a block of the array for each bucket, keeping track in each block of an initial prefix of values that belong in the bucket with the rest not yet processed. We then walk through the buckets swapping out any elements at the top of the good prefix to the bucket they are supposed to be in. This procedure puts at least one element in the right bucket for each swap, so we reorder everything correctly in at most n swaps and O(n) additional work.

To keep track of each bucket, we use two pointers bucket[i] for the first element of the bucket and top[i] for the first unused element. We could make these be integer array indices, but this slows the code down by about 10%. This seems to be a situation where our use of pointers is complicated enough that the compiler can't optimize out the array lookups.

4.3. Further optimization

Since we are detecting the heaviest bucket anyway, there is a straightforward optimization that speeds the sort up noticeably on inputs with a lot of duplicates: if everything would land in the same bucket, we can skip the bucket-sort and just go directly to the next character.

4.4. Sample implementation

Here is an implementation of MSB radix sort using the ideas above:

   1 #include <assert.h>
   2 #include <limits.h>
   3 #include <string.h>
   4 
   5 #include "radixsort.h"
   6 
   7 /* in-place MSB radix sort for null-terminated strings */
   8 
   9 /* helper routine for swapping */
  10 static void
  11 swapStrings(const char **a, const char **b)
  12 {
  13     const char *temp;
  14 
  15     temp = *a;
  16     *a = *b;
  17     *b = temp;
  18 }
  19 
  20 /* this is the internal routine that assumes all strings are equal for the
  21  * first k characters */
  22 static void
  23 radixSortInternal(int n, const char **a, int k)
  24 {
  25     int i;
  26     int count[UCHAR_MAX+1];  /* number of strings with given character in position k */
  27     int mode;                /* most common position-k character */
  28     const char **bucket[UCHAR_MAX+1]; /* position of character block in output */
  29     const char **top[UCHAR_MAX+1];    /* first unused index in this character block */
  30 
  31     /* loop implements tail recursion on most common character */
  32     while(n > 1) {
  33 
  34         /* count occurrences of each character */
  35         memset(count, 0, sizeof(int)*(UCHAR_MAX+1));
  36 
  37         for(i = 0; i < n; i++) {
  38             count[(unsigned char) a[i][k]]++;
  39         }
  40 
  41         /* find the most common nonzero character */
  42         /* we will handle this specially */
  43         mode = 1;
  44         for(i = 2; i < UCHAR_MAX+1; i++) {
  45             if(count[i] > count[mode]) {
  46                 mode = i;
  47             }
  48         }
  49 
  50         if(count[mode] < n) {
  51 
  52             /* generate bucket and top fields */
  53             bucket[0] = top[0] = a;
  54             for(i = 1; i < UCHAR_MAX+1; i++) {
  55                 top[i] = bucket[i] = bucket[i-1] + count[i-1];
  56             }
  57 
  58             /* reorder elements by k-th character */
  59             /* this is similar to dutch flag algorithm */
  60             /* we start at bottom character and swap values out until everything is in place */
  61             /* invariant is that for all i, bucket[i] <= j < top[i] implies a[j][k] == i */
  62             for(i = 0; i < UCHAR_MAX+1; i++) {
  63                 while(top[i] < bucket[i] + count[i]) {
  64                     if((unsigned char) top[i][0][k] == i) {
  65                         /* leave it in place, advance bucket */
  66                         top[i]++;
  67                     } else {
  68                         /* swap with top of appropriate block */
  69                         swapStrings(top[i], top[(unsigned char) top[i][0][k]]++);
  70                     }
  71                 }
  72             }
  73 
  74             /* we have now reordered everything */
  75             /* recurse on all but 0 and mode */
  76             for(i = 1; i < UCHAR_MAX+1; i++) {
  77                 if(i != mode) {
  78                     radixSortInternal(count[i], bucket[i], k+1);
  79                 }
  80             }
  81 
  82             /* tail recurse on mode */
  83             n = count[mode];
  84             a = bucket[mode];
  85             k = k+1;
  86 
  87         } else {
  88 
  89             /* tail recurse on whole pile */
  90             k = k+1;
  91         }
  92     }
  93 }
  94 
  95 void
  96 radixSort(int n, const char **a)
  97 {
  98     radixSortInternal(n, a, 0);
  99 }
radixsort.c

Some additional files: radixsort.h, test_radixsort.c, Makefile, sortInput.c. The last is a program that sorts lines on stdin and writes the result to stdout, similar to the GNU sort utility. When compiled with -O3 and run on my machine, this runs in about the same time as the standard sort program when run on a 4.7 million line input file consisting of a random shuffle of 20 copies of /usr/share/dict/words, provided sort is run with LANG=C sort < /usr/share/dict/words to keep it from having to deal with locale-specific collating issues. On other inputs, sort is faster. This is not bad given how thoroughly sort has been optimized, but is a sign that further optimization is possible.


CategoryProgrammingNotes


2014-06-17 11:58