1. Randomized search trees
Idea: Build a binary search tree, but balance using randomization instead of bookkeeping.
2. Binary search tree with random insertions
Suppose we insert n elements into an initially-empty binary search tree in random order with no rebalancing. Then each element is equally likely to be the root, and all the elements less than the root end up in the left subtree, while all the elements greater than the root end up in the right subtree, where they are further partitioned recursively. This is exactly what happens in QuickSort, so the structure of the tree will exactly mirror the structure of an execution of QuickSort. So, for example, we can immediately observe from our previous analysis of QuickSort that the total path length—the sum of the depths of the nodes—is Θ(n log n), since the depth of each node is equal to 1 plus the number of comparisons it participates in as a non-pivot.
In fact, we can show that the height of the tree is O(log n) with high probability (which also shows that QuickSort takes O(n log n) comparisons with high probability, since each layer of the tree has at most n elements in it). Consider a single element x, and let T(d) be the number of elements not equal to x in the subtree rooted at x's depth-d ancestor, or zero if x has no depth-d ancestor. Then T(1) = n-1 ≤ n, and for d > 1 we can compute E[T(d) | T(d-1)] ≤ (3/4) T(d), if we assume that x lands in the larger subtree of its depth d-1 ancestor, since if we condition on e.g. the left subtree being larger all sizes ≥ (T(d)-1)/2 are equally likely. This gives that Pr[depth(x) > d] ≤ (3/4)d-1n, or a probability of at most n-c that depth(x) > 1 + (c+1) log(n)/log(4/3) ≅ 1 + 6.952 ln n for c = 1, which we need to apply the union bound. The right answer for the actual height in the limit (see Devroye 1988) is ~ 4.31107 ln n, so this bound is actually pretty close.1 It's still nearly a factor of three worse that a completely balanced tree, which has max depth bounded by 1 + lg n ≅ 1 + 1.44269 ln n.
The problem with this approach is that we don't have any guarantees that the input will be supplied in random order, and in the worst case we end up with a linked list.
3. Skip lists
A skip list is a randomized tree-like data structure based on linked lists. It consists of a level 0 list that is an ordinary sorted linked list, together with higher-level lists that contain a random sampling of the elements at lower levels. When inserted into the level i list, an element flips a coin that tells it with probability p to insert itself in the level i+1 list as well.
Searches in a skip list are done by starting in the highest-level list and searching forward for the last element whose key is smaller than the target; the search then continues in the same way on the next level down. To bound the expected running time of a search, it helps to look at this process backwards; the reversed search path starts at level 0 and continues going backwards until it reaches the first element that is also in a higher level; it then jumps to the next level up and repeats the process. We can analyze this process by tracking the number of nodes in the current level that are earlier than the current node. If we move left, this drops by 1; if up, this drops to p times its previous value on average. So the expected number of such nodes T(k) after k steps satisfies T(k+1) = (1-p)(T(k)-1) + p(pT(k)) < (1-p+p2) T(k), and in general we get T(k) < (1-p+p2)k n. (We can substitute the rank of the starting node for n if we want a slightly tighter bound.) This is minimized at p = 1/2, giving T(k) < (3/4)k n, suspiciously similar to the bound we computed before for random binary search trees.
The space per element of a skip list also depends on p. Every element has at least one outgoing pointer (on level 0), and on average has exactly 1/(1-p) expected pointers. So the space cost can also be adjusted by adjusting p. For example, if space is at a premium, setting p = 1/10 produces 10/9 pointers per node on average—not much more than in a linked list—but still gives O(log n) search times.
3.1. More about skip lists
The original skip list paper, published by William Pugh in Communications of the ACM in 1990, can be found at ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf.
A Java applet that demonstrates skip lists can be found at http://iamwww.unibe.ch/~wenger/DA/SkipList/.
4. Skip graphs
For applications where we want to avoid hot-spots, we can split the nodes into multiple high-level lists instead of having some of them drop out. This gives a data structure known as a skip graph (Aspnes and Shah) or SkipNet (Harvey et al.). See http://www.cs.yale.edu/homes/aspnes/skip-graphs-abstract.html.
In class I suggested that we could knock the 3/4 down to 2/3 by conditioning on which side x lands in (which is more likely to be in the larger side), but this only works if x is uniformly distributed in the interval as well. If x is the median value then it will almost always end up in the larger subtree, giving the 3/4 constant. (1)