[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

These are notes on practical implementations of suffix arrays, which are a data structure for searching quickly for substrings of a given large string. Some of these notes are adapted from the StringAlgorithms page from CS365.

1. Why do we want to do this?

2. String search algorithms

Without preprocessing, searching an n-character string for an m-character substring can be done using various sophisticated algorithms, the worst of which run in time O(nm) (run strncmp on each position in the big string), and best of which run in time O(n+m) Boyer-Moore string search algorithm. But we are interested in the case where we can preprocess our big string into a data structure that will let us do lots of searches for cheap.

3. Suffix trees and suffix arrays

Suffix trees and suffix arrays are data structures for representing texts that allow substring queries like "where does this pattern appear in the text" or "how many times does this pattern occur in the text" to be answered quickly. Both work by storing all suffixes of a text, where a suffix is a substring that runs to the end of the text. Of course, storing actual copies of all suffixes of an n-character text would take O(n2) space, so instead each suffix is represented by a pointer to its first character.

A suffix array stores all the suffixes sorted in dictionary order. For example, the suffix array of the string abracadabra is shown below. The actual contents of the array are the indices in the left-hand column; the right-hand shows the corresponding suffixes.

11  \0
10  a\0
 7  abra\0
 0  abracadabra\0
 3  acadabra\0
 5  adabra\0
 8  bra\0
 1  bracadabra\0
 4  cadabra\0
 6  dabra\0
 9  ra\0
 2  racadabra\0

A suffix tree is similar, but instead using an array, we use some sort of tree data structure to hold the sorted list. A common choice given an alphabet of some fixed size k is a trie (see RadixSearch), in which each node at depth d represents a string of length d, and its up to k children represent all (d+1)-character extensions of the string. The advantage of using a suffix trie is that searching for a string of length m takes O(m) time, since we can just walk down the trie at the rate of one node per character in m. A further optimization is to replace any long chain of single-child nodes with a compressed edge labeled with the concatenation all the characters in the chain. Such compressed suffix tries can not only be searched in linear time but can also be constructed in linear time with a sufficiently clever algorithm. Of course, we could also use a simple balanced binary tree, which might be preferable if the alphabet is large.

The disadvantage of suffix trees over suffix arrays is that they generally require more space to store all the internal pointers in the tree. If we are indexing a huge text (or collection of texts), this extra space may be too expensive.

3.1. Building a suffix array

A straightforward approach to building a suffix array is to run any decent comparison-based sorting algorithm on the set of suffixes (represented by pointers into the text). This will take O(n log n) comparisons, but in the worst case each comparison will take O(n) time, for a total of O(n2 log n) time. This is the approach used in the sample code below.

The original suffix array paper by Manber and Myers ("Suffix arrays: a new method for on-line string searches," SIAM Journal on Computing 22(5):935-948, 1993) gives an O(n log n) algorithm, somewhat resembling radix sort, for building suffix arrays in place with only a small amount of additional space. They also note that for random text, simple radix sorting works well, since most suffixes become distinguishable after about logk n characters (where k is the size of the alphabet). Assuming random data would also give an O(n log2 n) running time for a comparison-based sort.

The fastest approach is to build a suffix tree in O(n) time and extract the suffix array by traversing the tree. The only complication is that we need the extra space to build the tree, although we get it back when we throw the tree away.

3.2. Searching a suffix array

Suppose we have a suffix array corresponding to an n-character text and we want to find all occurrences in the text of an m-character pattern. Since the suffixes are ordered, the easiest solution is to do binary search for the first and last occurrences of the pattern (if any) using O(log n) comparisons. (The code below does something even lazier than this, searching for some match and then scanning linearly for the first and last maches.) Unfortunately, each comparison may take as much as O(m) time, since we may have to check all m characters of the pattern. So the total cost will be O(m log n) in the worst case.

By storing additional information about the longest common prefix of regisions of contiguous suffixes, it is possible to avoid having to re-examine every character in the pattern for every comparison, reducing the search cost to O(m + log n). With a sufficiently clever algorithm, this information can be computed in linear time, and can also be used to solve quickly such problems as finding the longest duplicate substrings, or most frequently occurring strings (GusfieldBook ยง7.14.4).

Using binary search on the suffix array, most searching tasks are now easy:

4. Burrows-Wheeler transform

Closely related to suffix arrays is the Burrows-Wheeler transform (Burrows and Wheeler, A Block-Sorting Lossless Data Compression Algorithm, DEC Systems Research Center Technical Report number 124, 1994), which is the basis for some of the best currently known algorithms for text compression (it's the technique that is used, for example, in bzip2).

The idea of the Burrows-Wheeler Transform is to construct an array whose rows are all cyclic shifts of the input string in dictionary order, and return the last column of the array. The last column will tend to have long runs of identical characters, since whenever some substring (like the appears repeatedly in the input, shifts that put the first character t in the last column will put the rest of the substring he in the first columns, and the resulting rows will tend to be sorted together. The relative regularity of the last column means that it will compress well with even very simple compression algorithms like run-length encoding.

Below is an example of the Burrows-Wheeler transform in action, with $ marking end-of-text. The transformed value of abracadabra$ is $drcraaaabba, the last column of the sorted array; note the long run of a's (and the shorter run of b's).

abracadabra$     abracadabra$
bracadabra$a     abra$abracad
racadabra$ab     acadabra$abr
acadabra$abr     adabra$abrac
cadabra$abra     a$abracadabr
adabra$abrac     bracadabra$a
dabra$abraca --> bra$abracada
abra$abracad     cadabra$abra
bra$abracada     dabra$abraca
ra$abracadab     racadabra$ab
a$abracadabr     ra$abracadab
$abracadabra     $abracadabra

The most useful property of the Burrows-Wheeler transform is that it can be inverted; this distinguishes it from other transforms that produce long runs like simply sorting the characters. We'll describe two ways to do this; the first is less efficient, but more easily grasped, and involves rebuilding the array one column at a time, starting at the left. Observe that the leftmost column is just all the characters in the string in sorted order; we can recover it by sorting the rightmost column, which we have to start off with. If we paste the rightmost and leftmost columns together, we have the list of all 2-character substrings of the original text; sorting this list gives the first two columns of the array. (Remember that each copy of the string wraps around from the right to the left.) We can then paste the rightmost column at the beginning of these two columns, sort the result, and get the first three columns. Repeating this process eventually reconstructs the entire array, from which we can read off the original string from any row. The initial stages of this process for abracadabra$ are shown below:

$    a       $a    ab       $ab    abr
d    a       da    ab       dab    abr
r    a       ra    ac       rac    aca
c    a       ca    ad       cad    ada
r    a       ra    a$       ra$    a$a
a    b       ab    br       abr    bra
a -> b       ab -> br       abr -> bra
a    c       ac    ca       aca    cad
a    d       ad    da       ada    dab
b    r       br    ra       bra    rac
b    r       br    ra       bra    ra$
a    $       a$    $a       a$a    $ab

Rebuilding the entire array in this fashion takes O(n2) time and O(n2) space. In their paper, Burrows and Wheeler showed that one can in fact reconstruct the original string from just the first and last columns in the array in O(n) time.

Here's the idea: Suppose that all the characters were distinct. Then after reconstructing the first column we would know all pairs of adjacent characters. So we could just start with the last character $ and regenerate the string by appending at each step the unique successor to the last character so far. If all characters were distinct, we would never get confused about which character comes next.

The problem is what to do with pairs with duplicate first characters, like ab and ac in the example above. We can imagine that each a in the last column is labeled in some unique way, so that we can talk about the first a or the third a, but how do we know which a is the one that comes before b or d?

The trick is to look closely at how the original sort works. Look at the rows in the original transformation. If we look at all rows that start with a, the order they are sorted in is determined by the suffix after a. These suffixes also appear as the prefixes of the rows that end with a, since the rows that end with a are just the rows that start with a rotated one position. It follows that all instances of the same letter occur in the same order in the first and last columns. So if we use a stable sort to construct the first column, we will correctly match up instances of letters.

This method is shown in action below. Each letter is annotated uniquely with a count of how many identical letters equal or precede it. Sorting recovers the first column, and combining the last and first columns gives a list of unique pairs of adjacent annotated characters. Now start with $1 and construct the full sequence $1 a1 b1 r1 a3 c1 a4 d1 a2 b2 r2 a5 $1. The original string is obtained by removing the end-of-string markers and annotations: abracadabra.

$1     a1
d1     a2
r1     a3
c1     a4
r2     a5
a1     b1
a2 --> b2
a3     c1
a4     d1
b1     r1
b2     r2
a5     $1

Because we are only sorting single characters, we can perform the sort in linear time using counting sort. Extracting the original string also takes linear time if implemented reasonably.

4.1. Suffix arrays and the Burrows-Wheeler transform

A useful property of the Burrows-Wheeler transform is that each row of the sorted array is essentially the same as the corresponding row in the suffix array, except for the rotated string prefix after the $ marker. This means, among other things, that we can compute the Burrows-Wheeler transform in linear time using suffix trees. Ferragina and Manzini (http://www.imc.pi.cnr.it/~manzini/papers/focs00.html) have further exploited this correspondence (and some very clever additional ideas) to design compressed suffix arrays that compress and index a text at the same time, so that pattern searches can be done directly on the compressed text in time close to that needed for suffix array searches.

5. Sample implementation

As mentioned above, this is a pretty lazy implementation of suffix arrays, that doesn't include many of the optimizations that would be necessary to deal with huge source texts.

   1 /* we expose this so user can iterate through it */
   2 
   3 struct suffixArray {
   4     size_t n;               /* length of string INCLUDING final null */
   5     const char *string;     /* original string */
   6     const char **suffix;    /* suffix array of length n */
   7 };
   8 
   9 typedef struct suffixArray *SuffixArray;
  10 
  11 /* construct a suffix array */
  12 /* it is a bad idea to modify string before destroying this */
  13 SuffixArray suffixArrayCreate(const char *string);
  14 
  15 /* destructor */
  16 void suffixArrayDestroy(SuffixArray);
  17 
  18 /* return number of occurrences of substring */
  19 /* if non-null, index of first occurrence is place in first */
  20 size_t
  21 suffixArraySearch(SuffixArray, const char *substring, size_t *first);
  22 
  23 /* return the Burrows-Wheeler transform of the underlying string 
  24  * as malloc'd data of length sa->n */
  25 /* note that this may have a null in the middle somewhere */
  26 char *suffixArrayBWT(SuffixArray sa);
  27 
  28 /* invert BWT of null-terminated string, returning a malloc'd copy of original */
  29 char *inverseBWT(size_t len, const char *s);
suffixArray.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 #include <string.h>
   4 #include <limits.h>
   5 
   6 #include "suffixArray.h"
   7 
   8 static int
   9 saCompare(const void *s1, const void *s2)
  10 {
  11     return strcmp(*((const char **) s1), *((const char **) s2));
  12 }
  13 
  14 SuffixArray
  15 suffixArrayCreate(const char *s)
  16 {
  17     size_t i;
  18     SuffixArray sa;
  19 
  20     sa = malloc(sizeof(*sa));
  21     assert(sa);
  22 
  23     sa->n = strlen(s) + 1;
  24     sa->string = s;
  25 
  26     sa->suffix = malloc(sizeof(*sa->suffix) * sa->n);
  27     assert(sa->suffix);
  28 
  29     /* construct array of pointers to suffixes */
  30     for(i = 0; i < sa->n; i++) {
  31         sa->suffix[i] = s+i;
  32     }
  33 
  34     /* this could be a lot more efficient */
  35     qsort(sa->suffix, sa->n, sizeof(*sa->suffix), saCompare);
  36 
  37     return sa;
  38 }
  39 
  40 void
  41 suffixArrayDestroy(SuffixArray sa)
  42 {
  43     free(sa->suffix);
  44     free(sa);
  45 }
  46 
  47 size_t
  48 suffixArraySearch(SuffixArray sa, const char *substring, size_t *first)
  49 {
  50     size_t lo;
  51     size_t hi;
  52     size_t mid;
  53     size_t len;
  54     int cmp;
  55 
  56     len = strlen(substring);
  57 
  58     /* invariant: suffix[lo] <= substring < suffix[hi] */
  59     lo = 0;
  60     hi = sa->n;
  61 
  62     while(lo + 1 < hi) {
  63         mid = (lo+hi)/2;
  64         cmp = strncmp(sa->suffix[mid], substring, len);
  65 
  66         if(cmp == 0) {
  67             /* we have a winner */
  68             /* search backwards and forwards for first and last */
  69             for(lo = mid; lo > 0 && strncmp(sa->suffix[lo-1], substring, len) == 0; lo--);
  70             for(hi = mid; hi < sa->n && strncmp(sa->suffix[hi+1], substring, len) == 0; hi++);
  71 
  72             if(first) {
  73                 *first = lo;
  74             }
  75 
  76             return hi - lo + 1;
  77         } else if(cmp < 0) {
  78             lo = mid;
  79         } else {
  80             hi = mid;
  81         }
  82     }
  83 
  84     return 0;
  85 }
  86 
  87 char *
  88 suffixArrayBWT(SuffixArray sa)
  89 {
  90     char *bwt;
  91     size_t i;
  92 
  93     bwt = malloc(sa->n);
  94     assert(bwt);
  95 
  96     for(i = 0; i < sa->n; i++) {
  97         if(sa->suffix[i] == sa->string) {
  98             /* wraps around to nul */
  99             bwt[i] = '\0';
 100         } else {
 101             bwt[i] = sa->suffix[i][-1];
 102         }
 103     }
 104 
 105     return bwt;
 106 }
 107 
 108 char *
 109 inverseBWT(size_t len, const char *s)
 110 {
 111     /* basic trick: stable sort of s gives successor indices */
 112     /* then we just thread through starting from the nul */
 113 
 114     size_t *successor;
 115     int c;
 116     size_t count[UCHAR_MAX+1];
 117     size_t offset[UCHAR_MAX+1];
 118     size_t i;
 119     char *ret;
 120     size_t thread;
 121 
 122     successor = malloc(sizeof(*successor) * len);
 123     assert(successor);
 124 
 125     /* counting sort */
 126     for(c = 0; c <= UCHAR_MAX; c++) {
 127         count[c] = 0;
 128     }
 129 
 130     for(i = 0; i < len; i++) {
 131         count[(unsigned char) s[i]]++;
 132     }
 133 
 134     offset[0] = 0;
 135 
 136     for(c = 1; c <= UCHAR_MAX; c++) {
 137         offset[c] = offset[c-1] + count[c-1];
 138     }
 139 
 140     for(i = 0; i < len; i++) {
 141         successor[offset[(unsigned char) s[i]]++] = i;
 142     }
 143 
 144     /* find the nul */
 145     for(thread = 0; s[thread]; thread++);
 146 
 147     /* thread the result */
 148     ret = malloc(len);
 149     assert(ret);
 150 
 151     for(i = 0, thread = successor[thread]; i < len; i++, thread = successor[thread]) {
 152         ret[i] = s[thread];
 153     }
 154 
 155     return ret;
 156 }
suffixArray.c

Here is a Makefile and test code: Makefile, testSuffixArray.c.

The output of make test shows all occurences of a target string, the Burrows-Wheeler transform of a the source string (second-to-last line), and its inversion (last line, which is just the original string):

$ make test
/bin/echo -n abracadabra-abracadabra-shmabracadabra | ./testSuffixArray abra
Count: 6
abra
abra-abr
abra-shm
abracada
abracada
abracada
aaarrrdddm\x00-rrrcccaaaaaaaaaaaashbbbbbb-
abracadabra-abracadabra-shmabracadabra


CategoryProgrammingNotes


2014-06-17 11:58