UnionFind

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

A union-find data structure starts with n elements each of which is in a set by itself. It supports the union operation, which takes two element names as arguments and merges the two sets containing those elements, and the find operation, which takes an element name and returns the identity of the set that contains that element. Because this identity is arbitrary, and may change as union operations occur, the find operation is mostly use to determine if two elements are currently in the same set or not.

There is a very nice method, described on page 317 of LevitinBook, for doing union-find that carries out m operations on n elements in O(m α(n)) time, where α(n) is the inverse of the Ackermann_function (see also http://www.nist.gov/dads/HTML/inverseAckermann.html). The Ackermann function grows so astonishingly quickly that α(n) is not likely to exceed 6 for any number you can easily define. But it's not quite constant, so the amortized cost per operation of this algorithm is not quite O(1).

A slightly less efficient approach is easier to describe. We'll keep around a record for each surviving set that includes a linked list of all of its members and a size field. We'll also store for each element a pointer to the set that contains it. When we take the union of two sets, we move all the elements of the smaller set to the larger set. This costs O(1) time per element moved, for O(n) time in the worst case. Since each element always points directly to its containing set, find operations are always O(1).

How much does it cost to do m unions? Let T(m) be the worst-case total cost of constructing a set with m elements. Then we have

T(m) ≤ T(k) + T(m-k) + c min(k, m-k) ≤ T(k) + T(m-k) + cm/2.

Let's guess that T(m) ≤ a m lg m for small m. Then for larger m, we have

T(m) ≤ T(k) + T(m-k) + cm/2
≤ a k lg k + a (m-k) lg (m-k) + cm/2
≤ 2a (m/2) lg (m/2) + cm/2
≤ a m lg m - am + cm/2
≤ a m lg m,

for sufficiently large a.

There's a hidden convexity argument in the middle of that pile of inequalities (at the step where we assert k lg k + (m-k) lg (m-k) is maximized when k = m/2), which makes it a little bit trickier than many recurrence solutions. So it's probably easier to use the following standard trick instead. Imagine you are one of the elements. How many times do you get moved to a new set? Each time you move to the union of your current set and another set that is at least as big, so the size of the set that contains you doubles at every move. It can do so at most lg n times before it includes everybody, so you move O(lg n) times. Adding up this bound over all n elements gives the O(n log n) bound.

CategoryAlgorithmNotes