[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Radix search refers to a variety of data structures that support searching for strings considered as sequences of digits in some large base (or radix). These are generally faster than simple BinarySearchTrees because they usually only require examining one byte or less of the target string at each level of the tree, as compared to every byte in the target in a full string comparison. In many cases, the best radix search trees are even faster than HashTables, because they only need to look at a small part of the target string to identify it.

We'll describe several radix search trees, starting with the simplest and working up.

1. Tries

A trie is a binary tree (or more generally, a k-ary tree where k is the radix) where the root represents the empty bit sequence and the two children of a node representing sequence x represent the extended sequences x0 and x1 (or generally x0, x1, ... x(k-1)). So a key is not stored at a particular node but is instead represented bit-by-bit (or digit-by-digit) along some path. Typically a trie assumes that the set of keys is prefix-free, i.e. that no key is a prefix of another; in this case there is a one-to-one correspondence between keys and leaves of the trie. If this is not the case, we can mark internal nodes that also correspond to the ends of keys, getting a slightly different data structure known as a digital search tree. For null-terminated strings as in C, the null terminator ensures that any set of strings is prefix-free.

Given this simple description, a trie storing a single long key would have a very large number of nodes. A standard optimization is to chop off any path with no branches in it, so that each leaf corresponds to the shortest unique prefix of a key. This requires storing the key in the leaf so that we can distinguish different keys with the same prefix.

The name trie comes from the phrase "information retrieval." Despite the etymology, trie is now almost always pronounced like try instead of tree to avoid confusion with other tree data structures.

1.1. Searching a trie

Searching a trie is similar to searching a binary search tree, except that instead of doing a comparison at each step we just look at the next bit in the target. The time to perform a search is proportional to the number of bits in the longest path in the tree matching a prefix of the target. This can be very fast for search misses if the target is wildly different from all the keys in the tree.

1.2. Inserting a new element into a trie

Insertion is more complicated for tries than for binary search trees. The reason is that a new element may add more than one new node. There are essentially two cases:

  1. (The simple case.) In searching for the new key, we reach a null pointer leaving a non-leaf node. In this case we can simply add a new leaf. The cost of this case is essentially the same as for search plus O(1) for building the new leaf.
  2. (The other case.) In searching for the new key, we reach a leaf, but the key stored there isn't the same as the new key. Now we have to generate a new path for as long as the old key and the new key have the same bits, branching out to two different leaves at the end. The cost of this operation is within a constant factor of the cost for searching for the new leaf after it is inserted, since that's how long the newly-built search path will be.

In either case, the cost is bounded by the length of the new key, which is about the best we can hope for in the worst case for any data structure.

1.3. Implementation

A typical trie implementation in C might look like this. It uses a GET_BIT macro similar to the one from C/BitExtraction, except that we reverse the bits within each byte to get the right sorting order for keys.

   1 typedef struct trie_node *Trie;
   2 
   3 #define EMPTY_TRIE (0)
   4 
   5 /* returns 1 if trie contains target */
   6 int trie_contains(Trie trie, const char *target);
   7 
   8 /* add a new key to a trie */
   9 /* and return the new trie */
  10 Trie trie_insert(Trie trie, const char *key);
  11 
  12 /* free a trie */
  13 void trie_destroy(Trie);
  14 
  15 /* debugging utility: print all keys in trie */
  16 void trie_print(Trie);
trie.h

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <string.h>
   4 #include <assert.h>
   5 
   6 #include "trie.h"
   7 
   8 #define BITS_PER_BYTE (8)
   9 
  10 /* extract the n-th bit of x */
  11 /* here we process bits within bytes in MSB-first order */
  12 /* this sorts like strcmp */
  13 #define GET_BIT(x, n) ((((x)[(n) / BITS_PER_BYTE]) & (0x1 << (BITS_PER_BYTE - 1 - (n) % BITS_PER_BYTE))) != 0)
  14 
  15 #define TRIE_BASE (2)
  16 
  17 struct trie_node {
  18     char *key;
  19     struct trie_node *kids[TRIE_BASE];
  20 };
  21 
  22 #define IsLeaf(t) ((t)->kids[0] == 0 && (t)->kids[1] == 0)
  23 
  24 /* returns 1 if trie contains target */
  25 int
  26 trie_contains(Trie trie, const char *target)
  27 {
  28     int bit;
  29 
  30     for(bit = 0; trie && !IsLeaf(trie); bit++) {
  31         /* keep going */
  32         trie = trie->kids[GET_BIT(target, bit)];
  33     }
  34 
  35     if(trie == 0) {
  36         /* we lost */
  37         return 0;
  38     } else {
  39         /* check that leaf really contains the target */
  40         return !strcmp(trie->key, target);
  41     }
  42 }
  43 
  44 /* gcc -pedantic kills strdup! */
  45 static char *
  46 my_strdup(const char *s)
  47 {
  48     char *s2;
  49 
  50     s2 = malloc(strlen(s) + 1);
  51     assert(s2);
  52 
  53     strcpy(s2, s);
  54     return s2;
  55 }
  56 
  57 
  58 /* helper functions for insert */
  59 static Trie
  60 make_trie_node(const char *key)
  61 {
  62     Trie t;
  63     int i;
  64 
  65     t = malloc(sizeof(*t));
  66     assert(t);
  67 
  68     if(key) {
  69         t->key = my_strdup(key);
  70         assert(t->key);
  71     } else {
  72         t->key = 0;
  73     }
  74 
  75     for(i = 0; i < TRIE_BASE; i++) t->kids[i] = 0;
  76 
  77     return t;
  78 }
  79 
  80 /* add a new key to a trie */
  81 /* and return the new trie */
  82 Trie
  83 trie_insert(Trie trie, const char *key)
  84 {
  85     int bit;
  86     int bitvalue;
  87     Trie t;
  88     Trie kid;
  89     const char *oldkey;
  90 
  91     if(trie == 0) {
  92         return make_trie_node(key);
  93     }
  94     /* else */
  95     /* first we'll search for key */
  96     for(t = trie, bit = 0; !IsLeaf(t); bit++, t = kid) {
  97         kid = t->kids[bitvalue = GET_BIT(key, bit)];
  98         if(kid == 0) {
  99             /* woohoo! easy case */
 100             t->kids[bitvalue] = make_trie_node(key);
 101             return trie;
 102         }
 103     }
 104 
 105     /* did we get lucky? */
 106     if(!strcmp(t->key, key)) {
 107         /* nothing to do */
 108         return trie;
 109     }
 110 
 111     /* else */
 112     /* hard case---have to extend the @#!$ trie */
 113     oldkey = t->key;
 114 #ifdef EXCESSIVE_TIDINESS
 115     t->key = 0;      /* not required but makes data structure look tidier */
 116 #endif
 117 
 118     /* walk the common prefix */
 119     while(GET_BIT(oldkey, bit) == (bitvalue = GET_BIT(key, bit))) {
 120         kid = make_trie_node(0);
 121         t->kids[bitvalue] = kid;
 122         bit++;
 123         t = kid;
 124     }
 125 
 126     /* then split */
 127     t->kids[bitvalue] = make_trie_node(key);
 128     t->kids[!bitvalue] = make_trie_node(oldkey);
 129 
 130     return trie;
 131 }
 132 
 133 /* kill it */
 134 void
 135 trie_destroy(Trie trie)
 136 {
 137     int i;
 138 
 139     if(trie) {
 140         for(i = 0; i < TRIE_BASE; i++) {
 141             trie_destroy(trie->kids[i]);
 142         } 
 143 
 144         if(IsLeaf(trie)) {
 145             free(trie->key);
 146         }
 147 
 148         free(trie);
 149     }
 150 }
 151 
 152 static void
 153 trie_print_internal(Trie t, int bit)
 154 {
 155     int i;
 156     int kid;
 157 
 158     if(t != 0) {
 159         if(IsLeaf(t)) {
 160             for(i = 0; i < bit; i++) putchar(' ');
 161             puts(t->key);
 162         } else {
 163             for(kid = 0; kid < TRIE_BASE; kid++) {
 164                 trie_print_internal(t->kids[kid], bit+1);
 165             }
 166         }
 167     }
 168 }
 169 
 170 void
 171 trie_print(Trie t)
 172 {
 173     trie_print_internal(t, 0);
 174 }
trie.c

Here is a short test program that demonstrates how to use it:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 
   4 #include "trie.h"
   5 
   6 /* test for trie.c */
   7 /* reads lines from stdin and echoes lines that haven't appeared before */
   8 
   9 /* read a line of text from stdin
  10  * and return it (without terminating newline) as a freshly-malloc'd block.
  11  * Caller is responsible for freeing this block.
  12  * Returns 0 on error or EOF.
  13  */
  14 char *
  15 getline(void)
  16 {
  17     char *line;         /* line buffer */
  18     int n;              /* characters read */
  19     int size;           /* size of line buffer */
  20     int c;
  21 
  22     size = 1;
  23     line = malloc(size);
  24     if(line == 0) return 0;
  25     
  26     n = 0;
  27 
  28     while((c = getchar()) != '\n' && c != EOF) {
  29         while(n >= size - 1) {
  30             size *= 2;
  31             line = realloc(line, size);
  32             if(line == 0) return 0;
  33         }
  34         line[n++] = c;
  35     }
  36 
  37     if(c == EOF && n == 0) {
  38         /* got nothing */
  39         free(line);
  40         return 0;
  41     } else {
  42         line[n++] = '\0';
  43         return line;
  44     }
  45 }
  46 
  47 int
  48 main(int argc, char **argv)
  49 {
  50     Trie t;
  51     char *line;
  52 
  53     t = EMPTY_TRIE;
  54 
  55     while((line = getline()) != 0) {
  56         if(!trie_contains(t, line)) {
  57             puts(line);
  58         }
  59 
  60         /* try to insert it either way */
  61         /* this tests that insert doesn't blow up on duplicates */
  62         t = trie_insert(t, line);
  63 
  64         free(line);
  65     }
  66 
  67     puts("===");
  68 
  69     trie_print(t);
  70 
  71     trie_destroy(t);
  72 
  73     return 0;
  74 }
test_trie.c

2. Patricia trees

Tries perform well when all keys are short (or are distinguished by short prefixes), but can grow very large if one inserts two keys that have a long common prefix. The reason is that a trie has to store an internal node for every bit of the common prefix until the two keys become distinguishable, leading to long chains of internal nodes each of which has only one child. An optimization (described in this paper) known as a Patricia tree eliminates these long chains by having each node store the number of the bit to branch on, like this:

   1 struct patricia_node {
   2     char *key;
   3     int bit;
   4     struct patricia_node *kids[2];
   5 };
   6 
   7 typedef struct patricia_node *Patricia;

Now when searching for a key, instead of using the number of nodes visited so far to figure out which bit to look at, we just read the bit out of the table. This means in particular that we can skip over any bits that we don't actually branch on. We do however have to be more careful to make sure we don't run off the end of our target key, since it is possible that when skipping over intermediate bits we might skip over some that distinguish our target from all keys in the table, including longer keys. For example, a Patricia tree storing the strings abc and abd will first test bit position 22, since that's where abc and abd differ. This can be trouble if we are looking for x.

Here's the search code:

   1 int
   2 patricia_contains(Patricia t, const char *key)
   3 {
   4     int key_bits;
   5 
   6     key_bits = BITS_PER_BYTE * (strlen(key)+1);   /* +1 for the nul */
   7 
   8     while(t && !IsLeaf(t)) {
   9         if(t->bit >= key_bits) {
  10             /* can't be there */
  11             return 0;
  12         } else {
  13             t = t->kids[GET_BIT(key, t->bit)];
  14         }
  15     }
  16 
  17     return t && !strcmp(t->key, key);
  18 }

The insertion code is similar in many respects to the insertion code for a trie. The differences are that we never construct a long chain of internal nodes when splitting a leaf (although we do have to scan through both the old and new keys to find the first bit position where they differ), but we may sometimes have to add a new internal node between two previously existing nodes if a new key branches off at a bit position that was previously skipped over.

In the worst case Patricia trees are much more efficient than tries, in both space (linear in the number of keys instead of linear in the total size of the keys) and time complexity, often needing to examine only a very small number of bits for misses (hits still require a full scan in strcmp to verify the correct key). The only downside of Patricia trees is that since they work on bits, they are not quite perfectly tuned to the byte or word-oriented structure of modern CPUs.

3. Ternary search trees

Ternary search trees were described by Jon Bentley and Bob Sedgewick in an article in the April 1988 issue of Dr. Dobb's Journal, available here.

The basic idea is that each node in the tree stores one character from the key and three child pointers lt, eq, and gt. If the corresponding character in the target is equal to the character in the node, we move to the next character in the target and follow the eq pointer out of the node. If the target is less, follow the lt pointer but stay at the same character. If the target is greater, follow the gt pointer and again stay at the same character. When searching for a string, we walk down the tree until we either reach a node that matches the terminating nul (a hit), or follow a null pointer (a miss).

A TST acts a bit like a 256-way trie, except that instead of storing an array of 256 outgoing pointers, we build something similar to a small binary search tree for the next character. Note that no explicit balancing is done within these binary search trees. From a theoretical perspective, the worst case is that we get a 256-node deep linked-list equivalent at each step, multiplying our search time by 256 = O(1). In practice, only those characters that actual appear in some key at this stage will show up, so the O(1) is likely to be a small O(1), especially if keys are presented in random order.

TSTs are one of the fastest known data structures for implementing dictionaries using strings as keys, beating both hash tables and tries in most cases. They can be slower than Patricia trees if the keys have many keys with long matching prefixes; however, a Patricia-like optimization can be applied to give a compressed ternary search tree that works well even in this case. In practice, regular TSTs are usually good enough.

Here is a simple implementation of an insert-only TST. The C code includes two versions of the insert helper routine; the first is the original recursive version and the second is an iterative version generated by eliminating the tail recursion from the first.

   1 typedef struct tst_node *TST;
   2 
   3 #define EMPTY_TST (0)
   4 
   5 /* returns 1 if t contains target */
   6 int tst_contains(TST t, const char *target);
   7 
   8 /* add a new key to a TST */
   9 /* and return the new TST */
  10 TST tst_insert(TST t, const char *key);
  11 
  12 /* free a TST */
  13 void tst_destroy(TST);
tst.h

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 
   5 #include "tst.h"
   6 
   7 struct tst_node {
   8     char key;                   /* value to split on */
   9     struct tst_node *lt;        /* go here if target[index] < value */
  10     struct tst_node *eq;        /* go here if target[index] == value */
  11     struct tst_node *gt;        /* go here if target[index] > value */
  12 };
  13 
  14 /* returns 1 if t contains key */
  15 int
  16 tst_contains(TST t, const char *key)
  17 {
  18     assert(key);
  19 
  20     while(t) {
  21         if(*key < t->key) {
  22             t = t->lt;
  23         } else if(*key > t->key) {
  24             t = t->gt;
  25         } else if(*key == '\0') {
  26             return 1;
  27         } else {
  28             t = t->eq;
  29             key++;
  30         }
  31     }
  32 
  33     return 0;
  34 }
  35 
  36 /* original recursive insert */
  37 static void
  38 tst_insert_recursive(TST *t, const char *key)
  39 {
  40     if(*t == 0) {
  41         *t = malloc(sizeof(**t));
  42         assert(*t);
  43         (*t)->key = *key;
  44         (*t)->lt = (*t)->eq = (*t)->gt = 0;
  45     }
  46 
  47     /* now follow search */
  48     if(*key < (*t)->key) {
  49         tst_insert_recursive(&(*t)->lt, key);
  50     } else if(*key > (*t)->key) {
  51         tst_insert_recursive(&(*t)->gt, key);
  52     } else if(*key == '\0') {
  53         /* do nothing, we are done */
  54         ;
  55     } else {
  56         tst_insert_recursive(&(*t)->eq, key+1);
  57     }
  58 }
  59 
  60 /* iterative version of above, since somebody asked */
  61 /* This is pretty much standard tail-recursion elimination: */
  62 /* The whole function gets wrapped in a loop, and recursive
  63  * calls get replaced by assignment */
  64 static void
  65 tst_insert_iterative(TST *t, const char *key)
  66 {
  67     for(;;) {
  68         if(*t == 0) {
  69             *t = malloc(sizeof(**t));
  70             assert(*t);
  71             (*t)->key = *key;
  72             (*t)->lt = (*t)->eq = (*t)->gt = 0;
  73         }
  74 
  75         /* now follow search */
  76         if(*key < (*t)->key) {
  77             t = &(*t)->lt;
  78         } else if(*key > (*t)->key) {
  79             t = &(*t)->gt;
  80         } else if(*key == '\0') {
  81             /* do nothing, we are done */
  82             return;
  83         } else {
  84             t = &(*t)->eq;
  85             key++;
  86         }
  87     }
  88 }
  89 
  90 
  91 /* add a new key to a TST */
  92 /* and return the new TST */
  93 TST
  94 tst_insert(TST t, const char *key)
  95 {
  96     assert(key);
  97 
  98 #ifdef USE_RECURSIVE_INSERT
  99     tst_insert_recursive(&t, key);
 100 #else
 101     tst_insert_iterative(&t, key);
 102 #endif
 103     return t;
 104 }
 105 
 106 /* free a TST */
 107 void
 108 tst_destroy(TST t)
 109 {
 110     if(t) {
 111         tst_destroy(t->lt);
 112         tst_destroy(t->eq);
 113         tst_destroy(t->gt);
 114         free(t);
 115     }
 116 }
tst.c

And here is some test code, almost identical to the test code for tries: test_tst.c.

The Dr. Dobb's article contains additional code for doing deletions and partial matches, plus some optimizations for inserts.

4. More information


CategoryProgrammingNotes CategoryAlgorithmNotes


2014-06-17 11:58