Suppose we have a database that stores various information about Yale alumni, such as their names, majors, dates of graduation if any, and total amount donated. We may wish to perform aggregate queries on this database, where we ask for the total donations (or average donation) of some subset of the donors. An example of such a query might be "what is the average donation of the top 100 donors?" or "what is the total donation of all alumni whose names come between Bush, George H.W. and Bush, George W.?" We can easily answer such questions in O(n) time by scanning the entire database, but if we ask the same kind of questions often enough it may be cost-effective to precompute enough statistical information to answer the questions faster. Typically this is done by building an AugmentedDataStructure based on a BalancedTree.
Maintaining summary statistics
For this particular problem, we start with a BalancedTree (e.g., a RedBlackTree) that stores keys (alumni names) and values (donations) in each node. We add to each node a new field sum that keeps track of the total donation of every descendant of the node (including the node itself); the invariant on the tree is extended to include the requirement that this sum field is correct for all nodes whenever any operation completes.
We can easily enforce the invariant by providing a procedure FixSum that updates the sum field in a particular node under the assumption that the sum fields in its children are correct:
FixSum(node): node.sum = node.value if node.left is not null: node.sum = node.sum + node.left.value if node.right is not null: node.sum = node.sum + node.right.value
In principle, we could use this recursively to fix all the sum fields:
FixAllSums(root): if root.left is not null: FixAllSums(root.left) if root.right is not null: FixAllSums(root.right) FixSum(root)
Unfortunately, running FixAllSums after any insert or delete, or change in the value of any node, will take Θ(n) time. So instead we will be slightly more clever, and run it only on nodes whose descendants have changed. These include:
- Any ancestor of a newly-inserted node.
- Any ancestor of a newly-deleted node.
- Any node that is involved in a rotation.
Since each insert or delete only produces O(log n) nodes in either of the first two categories and O(1) nodes (in a RedBlackTree) in the third, we can maintain the invariant by calling FixSum on all affected nodes at cost O(log n), which is swallowed by the O(log n) cost of the basic insert or delete. Note that this property only requires that FixSum run in O(1) time, so it extends to any augmented tree where the summary information for a node can be recomputed quickly from its children; for example, we could keep track of the number of descendants of each node or the maximum donation in each subtree (or both) with only a constant-factor increase in running time.
Using the summary statistics
Assuming we have a correct sum field in each node, we can easily compute the total donation of all alumni, just by looking at root.sum. Similarly, we can easily computer the total donation of all alumni in a given subtree. But which nodes appear in which subtree is somewhat arbitrary; we know that each subtree contains some contiguous range of keys, but the boundaries between subtrees will shift depending on how the tree is rebalanced as nodes are inserted and deleted. So how do we find out how much money we collected from alumni to the right of Bush, George H.W. and to the left of Bush, George W.?
For this problem we can use a simple DecreaseAndConquer algorithm that constructs an interval that does not necessarily correspond to a single subtree out of O(log n) subtrees. The pseudocode is simpler if we consider only the question of computing the sum over all nodes whose keys are larger than some input key; we can compute sums over more complicated ranges by adding and subtracting the somes over such one-way ranges. To compute the sum of all nodes with keys greater than or equal to x, we consider three cases, depending on the relationship betweeen x and root.key. Pseudcode is given below.
SumAtLeast(root, x): if root is null: return 0 else if x < root.key: return SumAtLeast(root.left) else if x = root.key: return SumAtLeastroot.left) + root.value else if x > root.key: if root.left is not null: return root.left.sum + root.value + SumAtLeast(root.right, x) else: return root.value + SumAtLeast(root.right, x)
Since this algorithm recurses on only one subtree for each node, its total running time is O(depth of tree) = O(log n) for a typical balanced tree.