# 1. What's wrong with comparison-based sorting

The standard `quicksort` routine is an example of a **comparison-based sorting algorithm**. This means that the only information the algorithm uses about the items it is sorting is the return value of the `compare` routine. This has a rather nice advantage of making the algorithm very general, but has the disadvantage that the algorithm can extract only one bit of information from every call to `compare`. Since there are n! possible ways to reorder the input sequence, this means we need at least log(n!) = O(n log n) calls to `compare` to finish the sort. If we are sorting something like strings, this can get particularly expensive, because calls to `strcmp` may take time linear in the length of the strings being compared. In the worst case, sorting n strings of length m each could take O(n m log n) time.

# 2. Bucket sort

The core idea of radix sort is that if we want to sort values from a small range, we can do it by making one bucket for each possible value and throw any object with that value into the corresponding bucket. In the old days, when Solitaire was a game played with physical pieces of cardboard, a player who suspected that that one of these "cards" had fallen under the couch might sort the deck by dividing it up into Spades, Hearts, Diamonds, and Club piles and then sorting each pile recursively. The same trick works in a computer, but there the buckets are typically implemented as an array of some convenient data structure.

If the number of possible values is too big, we may still be able to use bucket sort digit-by-digit. The resulting algorithms are known generally as **radix sort**. These are a class of algorithms designed for sorting strings in lexicographic orderâ€”the order used by dictionaries where one string is greater than another if the first character on which they differ is greater. One particular variant, **most-significant-byte radix sort** or MSB radix sort, has the beautiful property that its running time is not only linear in the size of the input in bytes, but is also linear in the smallest number of characters in the input that need to be examined to determine the correct order. This algorithm is so fast that it takes not much more time to sort data than it does to read the data from memory and write it back. But it's a little trickier to explain that the original **least-significant-byte radix sort** or LSB radix sort.

# 3. Classic LSB radix sort

This is the variant used for punch cards, and works well for fixed-length strings. The idea is to sort on the least significant position first, then work backwards to the most significant position. This works as long as each sort is *stable*, meaning that it doesn't reorder values with equal keys. For example, suppose we are sorting the strings:

sat bat bad

The first pass sorts on the third column, giving:

bad sat bat

The second pass sorts on the second column, producing no change in the order (all the characters are the same). The last pass sorts on the first column. This moves the `s` after the two `b`s, but preserves the order of the two words starting with `b`. The result is:

bad bat sat

There are three downsides to LSB radix sort:

All the strings have to be the same length (this is not necessarily a problem if they are really fixed-width data types like

`int`s).- The algorithm used to sort each position must be stable, which may require more effort than most programmers would like to take.
It may be that the late positions in the strings don't affect the order, but we have to sort on them anyway. If we are sorting

`aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa`and`baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa`, we spend a lot of time matching up`a`s against each other.

# 4. MSB radix sort

For these reasons, MSB radix sort is used more often. This is basically the radix sort version of QuickSort, where instead of partitioning our input data into two piles based on whether each element is less than or greater than a random pivot, we partition the input into 256 piles, one for each initial byte. We can then sort each pile recursively using the same algorithm, taking advantage of the fact that we know that the first byte (or later, the first k bytes) are equal and so we only need to look at the next one. The recursion stops when we get down to 1 value, or in practice where we get down to a small enough number of values that the cost of doing a different sorting algorithm becomes lower than the cost of creating and tearing down the data structures for managing the piles.

## 4.1. Issues with recursion depth

The depth of recursion for MSB radix sort is equal to the length of the second-longest string in the worst case. Since strings can be pretty long, this creates a danger of blowing out the stack. The solution (as in QuickSort) is to use tail recursion for the largest pile. Now any pile we recurse into with an actual procedure call is at most half the size of the original pile, so we get stack depth at most O(log n).

## 4.2. Implementing the buckets

There is a trick we can do analagous to the Dutch flag algorithm where we sort the array in place. The idea is that we first count the number of elements that land in each bucket and assign a block of the array for each bucket, keeping track in each block of an initial prefix of values that belong in the bucket with the rest not yet processed. We then walk through the buckets swapping out any elements at the top of the good prefix to the bucket they are supposed to be in. This procedure puts at least one element in the right bucket for each swap, so we reorder everything correctly in at most n swaps and O(n) additional work.

To keep track of each bucket, we use two pointers `bucket[i]` for the first element of the bucket and `top[i]` for the first unused element. We could make these be integer array indices, but this slows the code down by about 10%. This seems to be a situation where our use of pointers is complicated enough that the compiler can't optimize out the array lookups.

## 4.3. Further optimization

Since we are detecting the heaviest bucket anyway, there is a straightforward optimization that speeds the sort up noticeably on inputs with a lot of duplicates: if everything would land in the same bucket, we can skip the bucket-sort and just go directly to the next character.

## 4.4. Sample implementation

Here is an implementation of MSB radix sort using the ideas above:

```
1 #include <assert.h>
2 #include <limits.h>
3 #include <string.h>
4
5 #include "radixsort.h"
6
7 /* in-place MSB radix sort for null-terminated strings */
8
9 /* helper routine for swapping */
10 static void
11 swapStrings(const char **a, const char **b)
12 {
13 const char *temp;
14
15 temp = *a;
16 *a = *b;
17 *b = temp;
18 }
19
20 /* this is the internal routine that assumes all strings are equal for the
21 * first k characters */
22 static void
23 radixSortInternal(int n, const char **a, int k)
24 {
25 int i;
26 int count[UCHAR_MAX+1]; /* number of strings with given character in position k */
27 int mode; /* most common position-k character */
28 const char **bucket[UCHAR_MAX+1]; /* position of character block in output */
29 const char **top[UCHAR_MAX+1]; /* first unused index in this character block */
30
31 /* loop implements tail recursion on most common character */
32 while(n > 1) {
33
34 /* count occurrences of each character */
35 memset(count, 0, sizeof(int)*(UCHAR_MAX+1));
36
37 for(i = 0; i < n; i++) {
38 count[(unsigned char) a[i][k]]++;
39 }
40
41 /* find the most common nonzero character */
42 /* we will handle this specially */
43 mode = 1;
44 for(i = 2; i < UCHAR_MAX+1; i++) {
45 if(count[i] > count[mode]) {
46 mode = i;
47 }
48 }
49
50 if(count[mode] < n) {
51
52 /* generate bucket and top fields */
53 bucket[0] = top[0] = a;
54 for(i = 1; i < UCHAR_MAX+1; i++) {
55 top[i] = bucket[i] = bucket[i-1] + count[i-1];
56 }
57
58 /* reorder elements by k-th character */
59 /* this is similar to dutch flag algorithm */
60 /* we start at bottom character and swap values out until everything is in place */
61 /* invariant is that for all i, bucket[i] <= j < top[i] implies a[j][k] == i */
62 for(i = 0; i < UCHAR_MAX+1; i++) {
63 while(top[i] < bucket[i] + count[i]) {
64 if((unsigned char) top[i][0][k] == i) {
65 /* leave it in place, advance bucket */
66 top[i]++;
67 } else {
68 /* swap with top of appropriate block */
69 swapStrings(top[i], top[(unsigned char) top[i][0][k]]++);
70 }
71 }
72 }
73
74 /* we have now reordered everything */
75 /* recurse on all but 0 and mode */
76 for(i = 1; i < UCHAR_MAX+1; i++) {
77 if(i != mode) {
78 radixSortInternal(count[i], bucket[i], k+1);
79 }
80 }
81
82 /* tail recurse on mode */
83 n = count[mode];
84 a = bucket[mode];
85 k = k+1;
86
87 } else {
88
89 /* tail recurse on whole pile */
90 k = k+1;
91 }
92 }
93 }
94
95 void
96 radixSort(int n, const char **a)
97 {
98 radixSortInternal(n, a, 0);
99 }
```

Some additional files: radixsort.h, test_radixsort.c, Makefile, sortInput.c. The last is a program that sorts lines on `stdin` and writes the result to `stdout`, similar to the GNU `sort` utility. When compiled with `-O3` and run on my machine, this runs in about the same time as the standard `sort` program when run on a 4.7 million line input file consisting of a random shuffle of 20 copies of `/usr/share/dict/words`, provided `sort` is run with `LANG=C sort < /usr/share/dict/words` to keep it from having to deal with locale-specific collating issues. On other inputs, `sort` is faster. This is not bad given how thoroughly `sort` has been optimized, but is a sign that further optimization is possible.