These are notes on practical implementations of suffix arrays, which are a data structure for searching quickly for substrings of a given large string. Some of these notes are adapted from the StringAlgorithms page from CS365.
1. Why do we want to do this?
- Answer from the old days: Fast string searching is the key to dealing with mountains of information. Why, a modern (c. 1960) electronic computer can search the equivalent of hundreds of pages of text in just a few hours...
- More recent answer:
We still need to search enormous corpuses of text (see http://www.google.com).
Algorithms for finding long repeated substrings or patterns can be useful for data compression (see Data_compression) or detecting plagiarism.
Finding all occurrence of a particular substring in some huge corpus is the basis of statistical machine translation.
- We are made out of strings over a particular finite alphabet GATC. String search is a central tool in computational biology.
2. String search algorithms
Without preprocessing, searching an n-character string for an m-character substring can be done using various sophisticated algorithms, the worst of which run in time O(nm) (run strncmp on each position in the big string), and best of which run in time O(n+m) Boyer-Moore string search algorithm. But we are interested in the case where we can preprocess our big string into a data structure that will let us do lots of searches for cheap.
3. Suffix trees and suffix arrays
Suffix trees and suffix arrays are data structures for representing texts that allow substring queries like "where does this pattern appear in the text" or "how many times does this pattern occur in the text" to be answered quickly. Both work by storing all suffixes of a text, where a suffix is a substring that runs to the end of the text. Of course, storing actual copies of all suffixes of an n-character text would take O(n2) space, so instead each suffix is represented by a pointer to its first character.
A suffix array stores all the suffixes sorted in dictionary order. For example, the suffix array of the string abracadabra is shown below. The actual contents of the array are the indices in the left-hand column; the right-hand shows the corresponding suffixes.
11 \0 10 a\0 7 abra\0 0 abracadabra\0 3 acadabra\0 5 adabra\0 8 bra\0 1 bracadabra\0 4 cadabra\0 6 dabra\0 9 ra\0 2 racadabra\0
A suffix tree is similar, but instead using an array, we use some sort of tree data structure to hold the sorted list. A common choice given an alphabet of some fixed size k is a trie (see RadixSearch), in which each node at depth d represents a string of length d, and its up to k children represent all (d+1)-character extensions of the string. The advantage of using a suffix trie is that searching for a string of length m takes O(m) time, since we can just walk down the trie at the rate of one node per character in m. A further optimization is to replace any long chain of single-child nodes with a compressed edge labeled with the concatenation all the characters in the chain. Such compressed suffix tries can not only be searched in linear time but can also be constructed in linear time with a sufficiently clever algorithm. Of course, we could also use a simple balanced binary tree, which might be preferable if the alphabet is large.
The disadvantage of suffix trees over suffix arrays is that they generally require more space to store all the internal pointers in the tree. If we are indexing a huge text (or collection of texts), this extra space may be too expensive.
3.1. Building a suffix array
A straightforward approach to building a suffix array is to run any decent comparison-based sorting algorithm on the set of suffixes (represented by pointers into the text). This will take O(n log n) comparisons, but in the worst case each comparison will take O(n) time, for a total of O(n2 log n) time. This is the approach used in the sample code below.
The original suffix array paper by Manber and Myers ("Suffix arrays: a new method for on-line string searches," SIAM Journal on Computing 22(5):935-948, 1993) gives an O(n log n) algorithm, somewhat resembling radix sort, for building suffix arrays in place with only a small amount of additional space. They also note that for random text, simple radix sorting works well, since most suffixes become distinguishable after about logk n characters (where k is the size of the alphabet). Assuming random data would also give an O(n log2 n) running time for a comparison-based sort.
The fastest approach is to build a suffix tree in O(n) time and extract the suffix array by traversing the tree. The only complication is that we need the extra space to build the tree, although we get it back when we throw the tree away.
3.2. Searching a suffix array
Suppose we have a suffix array corresponding to an n-character text and we want to find all occurrences in the text of an m-character pattern. Since the suffixes are ordered, the easiest solution is to do binary search for the first and last occurrences of the pattern (if any) using O(log n) comparisons. (The code below does something even lazier than this, searching for some match and then scanning linearly for the first and last maches.) Unfortunately, each comparison may take as much as O(m) time, since we may have to check all m characters of the pattern. So the total cost will be O(m log n) in the worst case.
By storing additional information about the longest common prefix of regisions of contiguous suffixes, it is possible to avoid having to re-examine every character in the pattern for every comparison, reducing the search cost to O(m + log n). With a sufficiently clever algorithm, this information can be computed in linear time, and can also be used to solve quickly such problems as finding the longest duplicate substrings, or most frequently occurring strings (GusfieldBook ยง7.14.4).
Using binary search on the suffix array, most searching tasks are now easy:
- Finding if a subtring appears in the array uses binary search directly.
- Finding all occurrences requires two binary searches, one for the first occurrence and one for the last. If we only want to count the occurrences and not return their positions, this takes O(m + log n) time. If we want to return their positions, it takes O(m + log n + k) time, where k is the number of times the pattern occurs.
Finding duplicate substrings of length m or more can be done by looking for adjacent entries in the array with long common prefixes, which takes O(mn) time in the worst case if done naively (and O(n) time if we have already computed longest common prefix information; see GusfieldBook).
4. Burrows-Wheeler transform
Closely related to suffix arrays is the Burrows-Wheeler transform (Burrows and Wheeler, A Block-Sorting Lossless Data Compression Algorithm, DEC Systems Research Center Technical Report number 124, 1994), which is the basis for some of the best currently known algorithms for text compression (it's the technique that is used, for example, in bzip2).
The idea of the Burrows-Wheeler Transform is to construct an array whose rows are all cyclic shifts of the input string in dictionary order, and return the last column of the array. The last column will tend to have long runs of identical characters, since whenever some substring (like the appears repeatedly in the input, shifts that put the first character t in the last column will put the rest of the substring he in the first columns, and the resulting rows will tend to be sorted together. The relative regularity of the last column means that it will compress well with even very simple compression algorithms like run-length encoding.
Below is an example of the Burrows-Wheeler transform in action, with $ marking end-of-text. The transformed value of abracadabra$ is $drcraaaabba, the last column of the sorted array; note the long run of a's (and the shorter run of b's).
abracadabra$ abracadabra$ bracadabra$a abra$abracad racadabra$ab acadabra$abr acadabra$abr adabra$abrac cadabra$abra a$abracadabr adabra$abrac bracadabra$a dabra$abraca --> bra$abracada abra$abracad cadabra$abra bra$abracada dabra$abraca ra$abracadab racadabra$ab a$abracadabr ra$abracadab $abracadabra $abracadabra
The most useful property of the Burrows-Wheeler transform is that it can be inverted; this distinguishes it from other transforms that produce long runs like simply sorting the characters. We'll describe two ways to do this; the first is less efficient, but more easily grasped, and involves rebuilding the array one column at a time, starting at the left. Observe that the leftmost column is just all the characters in the string in sorted order; we can recover it by sorting the rightmost column, which we have to start off with. If we paste the rightmost and leftmost columns together, we have the list of all 2-character substrings of the original text; sorting this list gives the first two columns of the array. (Remember that each copy of the string wraps around from the right to the left.) We can then paste the rightmost column at the beginning of these two columns, sort the result, and get the first three columns. Repeating this process eventually reconstructs the entire array, from which we can read off the original string from any row. The initial stages of this process for abracadabra$ are shown below:
$ a $a ab $ab abr d a da ab dab abr r a ra ac rac aca c a ca ad cad ada r a ra a$ ra$ a$a a b ab br abr bra a -> b ab -> br abr -> bra a c ac ca aca cad a d ad da ada dab b r br ra bra rac b r br ra bra ra$ a $ a$ $a a$a $ab
Rebuilding the entire array in this fashion takes O(n2) time and O(n2) space. In their paper, Burrows and Wheeler showed that one can in fact reconstruct the original string from just the first and last columns in the array in O(n) time.
Here's the idea: Suppose that all the characters were distinct. Then after reconstructing the first column we would know all pairs of adjacent characters. So we could just start with the last character $ and regenerate the string by appending at each step the unique successor to the last character so far. If all characters were distinct, we would never get confused about which character comes next.
The problem is what to do with pairs with duplicate first characters, like ab and ac in the example above. We can imagine that each a in the last column is labeled in some unique way, so that we can talk about the first a or the third a, but how do we know which a is the one that comes before b or d?
The trick is to look closely at how the original sort works. Look at the rows in the original transformation. If we look at all rows that start with a, the order they are sorted in is determined by the suffix after a. These suffixes also appear as the prefixes of the rows that end with a, since the rows that end with a are just the rows that start with a rotated one position. It follows that all instances of the same letter occur in the same order in the first and last columns. So if we use a stable sort to construct the first column, we will correctly match up instances of letters.
This method is shown in action below. Each letter is annotated uniquely with a count of how many identical letters equal or precede it. Sorting recovers the first column, and combining the last and first columns gives a list of unique pairs of adjacent annotated characters. Now start with $1 and construct the full sequence $1 a1 b1 r1 a3 c1 a4 d1 a2 b2 r2 a5 $1. The original string is obtained by removing the end-of-string markers and annotations: abracadabra.
$1 a1 d1 a2 r1 a3 c1 a4 r2 a5 a1 b1 a2 --> b2 a3 c1 a4 d1 b1 r1 b2 r2 a5 $1
Because we are only sorting single characters, we can perform the sort in linear time using counting sort. Extracting the original string also takes linear time if implemented reasonably.
4.1. Suffix arrays and the Burrows-Wheeler transform
A useful property of the Burrows-Wheeler transform is that each row of the sorted array is essentially the same as the corresponding row in the suffix array, except for the rotated string prefix after the $ marker. This means, among other things, that we can compute the Burrows-Wheeler transform in linear time using suffix trees. Ferragina and Manzini (http://www.imc.pi.cnr.it/~manzini/papers/focs00.html) have further exploited this correspondence (and some very clever additional ideas) to design compressed suffix arrays that compress and index a text at the same time, so that pattern searches can be done directly on the compressed text in time close to that needed for suffix array searches.
5. Sample implementation
As mentioned above, this is a pretty lazy implementation of suffix arrays, that doesn't include many of the optimizations that would be necessary to deal with huge source texts.
1 /* we expose this so user can iterate through it */
2
3 struct suffixArray {
4 size_t n; /* length of string INCLUDING final null */
5 const char *string; /* original string */
6 const char **suffix; /* suffix array of length n */
7 };
8
9 typedef struct suffixArray *SuffixArray;
10
11 /* construct a suffix array */
12 /* it is a bad idea to modify string before destroying this */
13 SuffixArray suffixArrayCreate(const char *string);
14
15 /* destructor */
16 void suffixArrayDestroy(SuffixArray);
17
18 /* return number of occurrences of substring */
19 /* if non-null, index of first occurrence is place in first */
20 size_t
21 suffixArraySearch(SuffixArray, const char *substring, size_t *first);
22
23 /* return the Burrows-Wheeler transform of the underlying string
24 * as malloc'd data of length sa->n */
25 /* note that this may have a null in the middle somewhere */
26 char *suffixArrayBWT(SuffixArray sa);
27
28 /* invert BWT of null-terminated string, returning a malloc'd copy of original */
29 char *inverseBWT(size_t len, const char *s);
1 #include <stdlib.h>
2 #include <assert.h>
3 #include <string.h>
4 #include <limits.h>
5
6 #include "suffixArray.h"
7
8 static int
9 saCompare(const void *s1, const void *s2)
10 {
11 return strcmp(*((const char **) s1), *((const char **) s2));
12 }
13
14 SuffixArray
15 suffixArrayCreate(const char *s)
16 {
17 size_t i;
18 SuffixArray sa;
19
20 sa = malloc(sizeof(*sa));
21 assert(sa);
22
23 sa->n = strlen(s) + 1;
24 sa->string = s;
25
26 sa->suffix = malloc(sizeof(*sa->suffix) * sa->n);
27 assert(sa->suffix);
28
29 /* construct array of pointers to suffixes */
30 for(i = 0; i < sa->n; i++) {
31 sa->suffix[i] = s+i;
32 }
33
34 /* this could be a lot more efficient */
35 qsort(sa->suffix, sa->n, sizeof(*sa->suffix), saCompare);
36
37 return sa;
38 }
39
40 void
41 suffixArrayDestroy(SuffixArray sa)
42 {
43 free(sa->suffix);
44 free(sa);
45 }
46
47 size_t
48 suffixArraySearch(SuffixArray sa, const char *substring, size_t *first)
49 {
50 size_t lo;
51 size_t hi;
52 size_t mid;
53 size_t len;
54 int cmp;
55
56 len = strlen(substring);
57
58 /* invariant: suffix[lo] <= substring < suffix[hi] */
59 lo = 0;
60 hi = sa->n;
61
62 while(lo + 1 < hi) {
63 mid = (lo+hi)/2;
64 cmp = strncmp(sa->suffix[mid], substring, len);
65
66 if(cmp == 0) {
67 /* we have a winner */
68 /* search backwards and forwards for first and last */
69 for(lo = mid; lo > 0 && strncmp(sa->suffix[lo-1], substring, len) == 0; lo--);
70 for(hi = mid; hi < sa->n && strncmp(sa->suffix[hi+1], substring, len) == 0; hi++);
71
72 if(first) {
73 *first = lo;
74 }
75
76 return hi - lo + 1;
77 } else if(cmp < 0) {
78 lo = mid;
79 } else {
80 hi = mid;
81 }
82 }
83
84 return 0;
85 }
86
87 char *
88 suffixArrayBWT(SuffixArray sa)
89 {
90 char *bwt;
91 size_t i;
92
93 bwt = malloc(sa->n);
94 assert(bwt);
95
96 for(i = 0; i < sa->n; i++) {
97 if(sa->suffix[i] == sa->string) {
98 /* wraps around to nul */
99 bwt[i] = '\0';
100 } else {
101 bwt[i] = sa->suffix[i][-1];
102 }
103 }
104
105 return bwt;
106 }
107
108 char *
109 inverseBWT(size_t len, const char *s)
110 {
111 /* basic trick: stable sort of s gives successor indices */
112 /* then we just thread through starting from the nul */
113
114 size_t *successor;
115 int c;
116 size_t count[UCHAR_MAX+1];
117 size_t offset[UCHAR_MAX+1];
118 size_t i;
119 char *ret;
120 size_t thread;
121
122 successor = malloc(sizeof(*successor) * len);
123 assert(successor);
124
125 /* counting sort */
126 for(c = 0; c <= UCHAR_MAX; c++) {
127 count[c] = 0;
128 }
129
130 for(i = 0; i < len; i++) {
131 count[(unsigned char) s[i]]++;
132 }
133
134 offset[0] = 0;
135
136 for(c = 1; c <= UCHAR_MAX; c++) {
137 offset[c] = offset[c-1] + count[c-1];
138 }
139
140 for(i = 0; i < len; i++) {
141 successor[offset[(unsigned char) s[i]]++] = i;
142 }
143
144 /* find the nul */
145 for(thread = 0; s[thread]; thread++);
146
147 /* thread the result */
148 ret = malloc(len);
149 assert(ret);
150
151 for(i = 0, thread = successor[thread]; i < len; i++, thread = successor[thread]) {
152 ret[i] = s[thread];
153 }
154
155 return ret;
156 }
Here is a Makefile and test code: Makefile, testSuffixArray.c.
The output of make test shows all occurences of a target string, the Burrows-Wheeler transform of a the source string (second-to-last line), and its inversion (last line, which is just the original string):
$ make test /bin/echo -n abracadabra-abracadabra-shmabracadabra | ./testSuffixArray abra Count: 6 abra abra-abr abra-shm abracada abracada abracada aaarrrdddm\x00-rrrcccaaaaaaaaaaaashbbbbbb- abracadabra-abracadabra-shmabracadabra