How slow is the k-means method

1. How slow is the k-means method? David Arthur Sergei Vassilvitskii Stanford University

2. The k-means Problem Given an integer k and n data points in Rd Partition points into k clusters Choose k centers and partition points according to closest center Try to minimize f = ? ||x � c(x)||2

3. Lloyd�s Algorithm (1982) Simply called the �k-means method� Choose k starting centers Uniformly at random usually Repeat until stable: Assign each point to the closest center Set each center to be center of mass of points assigned to it

4. Example

5. About k-means It always terminates Each step decreases f At most kn configurations It can stop with arbitrarily bad clusterings

6. About k-means Widely used because it is fast Usually far fewer than n iterations How do you formalize this? Just look at worst-case performance?

7. k-means (Worst case # iterations) Counting number of configurations: Already showed: O(kn) Inaba et al. (SOCG 94): O(nkd) One dimension: Dasgupta (COLT 03): O(n) Har-Peled, Sadri (SODA 05): O(n?2) ? = ratio of largest distance to smallest

8. Our Main Result Worst case = 2O(vn) k-means is superpolynomial!

9. Proof: High Level Start with configuration M with n points, which requires T iterations Add O(1) clusters, O(k) points These reset initial configuration M M stabilizes to M� New clusters, points reset M� to M M now has to stabilize to M� again Now requires at least 2T iterations

10. Proof: High Level Repeat reset construction m times: O(m2) points O(m) clusters 2m iterations

11. Main Construction (Overview)



14. Main Construction (Zoomed in)

15. Main Construction (t=0)

16. Main Construction (t=0�T)

17. Main Construction (t=T+1)Reassigning points to clusters



20. Main Construction (t=T+1)Recomputing centers








28. Summary Some configurations take 2O(vn) iterations (Yes, we have actually implemented this!) What now? Lower bound is too precise to arise in practice How do you formalize that?

29. The Big Question How to guarantee good speed? Choose initial centers randomly? Nope � Can force starting configuration w.h.p. Har-Peled, Sadri [SODA 05] � Poly spread? Nope � Can make spread=n by adding 1 dim. Low dimension? Open � We conjecture poly only if d=1 But k-means fast in practice even in high dim

30. The Big Question How to guarantee good speed? We suggest smoothed analysis of Spielman and Teng Perturb each data point using normal distribution We recently showed: O(nk) and O(2n/d) Recall worst case bound: O(nkd) Still open!

31. Thanks for listening!

How slow is the k-means method

How slow is the k-means method

Presentation Transcript

K-means algorithm

K-means Clustering

K-Means

Scalable K-Means++

K-means algorithm

K-means and Fuzzy K-means

The K-Means Clustering Method : for numerical attributes

K-Means Clustering

K-means Clustering

K-means algorithm

K-means method for Signal Compression: Vector Quantization

Cartesian k-means

K-means Clustering

K-means Clustering

Clustering: K-Means

K-means algorithm

K-means

K-means clustering

K-means properties

Fuzzy K means