|
The k-means method is one of the most popular algorithms for
clustering high-dimensional data into a given number k of
classes. However, its good performance in praxis is at stark contrast
to its theoretical performance: In the worst case, the running-time
can be exponential. The goal of smoothed analysis is to provide an
explanation for the good practical performance of algorithms with poor
worst-case performance.
We present a smoothed analysis of the k-means method: An adversary
specifies a point set, and then the points are perturbed with a
Gaussian with standard deviation \sigma. We prove two bounds on the
smoothed running-time of the k-means algorithm: Our first bound is
polynomial in n^{\sqrt k} and 1/\sigma, where n is the number of
points. Our second bound is k^{kd} poly(n, 1/\sigma), where d is the
dimension of the instance. This yields a polynomial bound if both k
and d are small compared to n.
Finally, we generalize the analysis of k-means to a large class of
other distance measures. The exponential lower bound for squared
Euclidean distances (Arthur, Vassilvitskii 2006) can be transferred to
most Bregman divergences. The smoothed analysis can also be adapted to
(almost) arbitrary Bregman divergences. Examples of Bregman
divergences are Kullback-Leibler divergence or Mahalanobis distances,
the latter includes squared Euclidean distances.
Based on joint work with Heiko Roeglin.
|