1 / 30

How slow is the k-means method

The k-means Problem. Given an integer k and n data points in RdPartition points into k clustersChoose k centers and partition points according to closest centerTry to minimizef = ? ||x

MikeCarlo
Download Presentation

How slow is the k-means method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. How slow is the k-means method? David Arthur Sergei Vassilvitskii Stanford University

    2. The k-means Problem Given an integer k and n data points in Rd Partition points into k clusters Choose k centers and partition points according to closest center Try to minimize f = ? ||x – c(x)||2

    3. Lloyd’s Algorithm (1982) Simply called the “k-means method” Choose k starting centers Uniformly at random usually Repeat until stable: Assign each point to the closest center Set each center to be center of mass of points assigned to it

    4. Example

    5. About k-means It always terminates Each step decreases f At most kn configurations It can stop with arbitrarily bad clusterings

    6. About k-means Widely used because it is fast Usually far fewer than n iterations How do you formalize this? Just look at worst-case performance?

    7. k-means (Worst case # iterations) Counting number of configurations: Already showed: O(kn) Inaba et al. (SOCG 94): O(nkd) One dimension: Dasgupta (COLT 03): O(n) Har-Peled, Sadri (SODA 05): O(n?2) ? = ratio of largest distance to smallest

    8. Our Main Result Worst case = 2O(vn) k-means is superpolynomial!

    9. Proof: High Level Start with configuration M with n points, which requires T iterations Add O(1) clusters, O(k) points These reset initial configuration M M stabilizes to M’ New clusters, points reset M’ to M M now has to stabilize to M’ again Now requires at least 2T iterations

    10. Proof: High Level Repeat reset construction m times: O(m2) points O(m) clusters 2m iterations

    11. Main Construction (Overview)

    12. Main Construction (Overview)

    13. Main Construction (Overview)

    14. Main Construction (Zoomed in)

    15. Main Construction (t=0)

    16. Main Construction (t=0…T)

    17. Main Construction (t=T+1) Reassigning points to clusters

    18. Main Construction (t=T+1) Reassigning points to clusters

    19. Main Construction (t=T+1) Reassigning points to clusters

    20. Main Construction (t=T+1) Recomputing centers

    21. Main Construction (t=T+2) Reassigning points to clusters

    22. Main Construction (t=T+2) Reassigning points to clusters

    23. Main Construction (t=T+2) Recomputing centers

    24. Main Construction (t=T+3) Reassigning points to clusters

    25. Main Construction (t=T+3) Recomputing centers

    26. Main Construction (t=T+4) Reassigning points to clusters

    27. Main Construction (t=T+4) Recomputing centers

    28. Summary Some configurations take 2O(vn) iterations (Yes, we have actually implemented this!) What now? Lower bound is too precise to arise in practice How do you formalize that?

    29. The Big Question How to guarantee good speed? Choose initial centers randomly? Nope – Can force starting configuration w.h.p. Har-Peled, Sadri [SODA 05] – Poly spread? Nope – Can make spread=n by adding 1 dim. Low dimension? Open – We conjecture poly only if d=1 But k-means fast in practice even in high dim

    30. The Big Question How to guarantee good speed? We suggest smoothed analysis of Spielman and Teng Perturb each data point using normal distribution We recently showed: O(nk) and O(2n/d) Recall worst case bound: O(nkd) Still open!

    31. Thanks for listening!

More Related