300 likes | 486 Views
The k-means Problem. Given an integer k and n data points in RdPartition points into k clustersChoose k centers and partition points according to closest centerTry to minimizef = ? ||x
E N D
1. How slow is the k-means method?
David Arthur Sergei Vassilvitskii
Stanford University
2. The k-means Problem Given an integer k and n data points in Rd
Partition points into k clusters
Choose k centers and partition points according to closest center
Try to minimize
f = ? ||x c(x)||2
3. Lloyds Algorithm (1982) Simply called the k-means method
Choose k starting centers
Uniformly at random usually
Repeat until stable:
Assign each point to the closest center
Set each center to be center of mass of points assigned to it
4. Example
5. About k-means It always terminates
Each step decreases f
At most kn configurations
It can stop with arbitrarily bad clusterings
6. About k-means Widely used because it is fast
Usually far fewer than n iterations
How do you formalize this?
Just look at worst-case performance?
7. k-means (Worst case # iterations) Counting number of configurations:
Already showed: O(kn)
Inaba et al. (SOCG 94): O(nkd)
One dimension:
Dasgupta (COLT 03): O(n)
Har-Peled, Sadri (SODA 05): O(n?2)
? = ratio of largest distance to smallest
8. Our Main Result Worst case = 2O(vn)
k-means is superpolynomial!
9. Proof: High Level Start with configuration M with n points, which requires T iterations
Add O(1) clusters, O(k) points
These reset initial configuration M
M stabilizes to M
New clusters, points reset M to M
M now has to stabilize to M again
Now requires at least 2T iterations
10. Proof: High Level Repeat reset construction m times:
O(m2) points
O(m) clusters
2m iterations
11. Main Construction (Overview)
12. Main Construction (Overview)
13. Main Construction (Overview)
14. Main Construction (Zoomed in)
15. Main Construction (t=0)
16. Main Construction (t=0
T)
17. Main Construction (t=T+1)Reassigning points to clusters
18. Main Construction (t=T+1)Reassigning points to clusters
19. Main Construction (t=T+1)Reassigning points to clusters
20. Main Construction (t=T+1)Recomputing centers
21. Main Construction (t=T+2)Reassigning points to clusters
22. Main Construction (t=T+2)Reassigning points to clusters
23. Main Construction (t=T+2)Recomputing centers
24. Main Construction (t=T+3)Reassigning points to clusters
25. Main Construction (t=T+3)Recomputing centers
26. Main Construction (t=T+4)Reassigning points to clusters
27. Main Construction (t=T+4)Recomputing centers
28. Summary Some configurations take 2O(vn) iterations
(Yes, we have actually implemented this!)
What now?
Lower bound is too precise to arise in practice
How do you formalize that?
29. The Big Question How to guarantee good speed?
Choose initial centers randomly?
Nope Can force starting configuration w.h.p.
Har-Peled, Sadri [SODA 05] Poly spread?
Nope Can make spread=n by adding 1 dim.
Low dimension?
Open We conjecture poly only if d=1
But k-means fast in practice even in high dim
30. The Big Question How to guarantee good speed?
We suggest smoothed analysis of Spielman and Teng
Perturb each data point using normal distribution
We recently showed: O(nk) and O(2n/d)
Recall worst case bound: O(nkd)
Still open!
31. Thanks for listening!