1 / 38

Clustering with k -means: faster, smarter, cheaper

Clustering with k -means: faster, smarter, cheaper. Charles Elkan University of California, San Diego April 24, 2004. Acknowledgments. Funding from Sun Microsystems, with sponsor Dr. Kenny Gross. Advice from colleagues and students, especially Sanjoy Dasgupta (UCSD),

Download Presentation

Clustering with k -means: faster, smarter, cheaper

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004

  2. Acknowledgments • Funding from Sun Microsystems, with sponsor Dr. Kenny Gross. • Advice from colleagues and students, especially Sanjoy Dasgupta (UCSD), Greg Hamerly (Baylor University starting Fall ‘04), Doug Turnbull.

  3. Clustering is difficult! Source: Patrick de Smet, University of Ghent

  4. The standard k-means algorithm • Input: n points, distance function d(),number k of clusters to find.  • STEP NAME • Start with k centers • Compute d(each point x, each center c) • For each x, find closest center c(x) “ALLOCATE” • If no point has changed “owner” c(x), stop • Each c mean of points owned by it “LOCATE” • Repeat from 2

  5. A typical k-means result

  6. Observations • Theorem: If d() is Euclidean, then k-means converges monotonically to a local minimum of within-class squared distortion: xd(c(x),x)2 • Many variants, complex history since 1956, over 100 papers per year currently • Iterative, related to expectation-maximization (EM) • # of iterations to converge grows slowly with n, k, d • No accepted method exists to discover k.

  7. We want to … • … make the algorithm faster. • … find lower-cost local minima.   • (Finding the global optimum is NP-hard.) • … choose the correct k intelligently. • With success at (1), we can try more alternatives for (2).  • With success at (2), comparisons for different k are less likely to be misleading.

  8. Is this clustering better?

  9. Or is this better?

  10. Standard initialization methods • Forgy initialization: choose k points at random as starting center locations. • Random partitions: divide the data points randomly into k subsets. • Both these methods are bad. • E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 21(3):768, 1965.

  11. Forgy initialization

  12. k-means result

  13. Smarter initialization • The “furthest first" algorithm (FF): • Pick first center randomly. • Next is the point furthest from the first center. • Third is the point furthest from both previous centers. • In general: next center is argmaxx mincd(x,c) • D. Hochbaum, D. Shmoys. A best possible heuristic for the k-center problem, Mathematics of Operations Research, 10(2):180-184, 1985.

  14. Furthest-first initialization FF Furthest-first initialization

  15. Subset furthest-first (SFF) • FF finds outliers, by definition not good cluster centers! • Can we choose points far apart and typical of the dataset? • Idea:  A random sample includes many representative points, but few outliers. • But: How big should the random sample be? • Lemma:  Given k equal-size sets and c >1, with high probability ck log k random points intersect each set.

  16. Subset furthest-first c = 2

  17. Comparing initialization methods 218 means 218% worse than the best clustering known. Lower is better.

  18. How to find lower-cost local minima • Random restarts, even initialized well, are inadequate. • The “central limit catastrophe:” almost all local minima are only averagely good. • K. D. Boese, A. B. Kahng, & S. Muddu, A new adaptive multi-start technique for combinatorial global optimizations. Operations Research Letters 16 (1994) 101-113. • The art of designing a local search algorithm: defining a neighborhood rich in improving candidate moves.

  19. Our local search method • k-means alternates two guaranteed-improvement steps: “allocate” and “locate.”  • Sadly, we know no other guaranteed-improvement steps. • So, we do non-guaranteed “jump” operations: delete an existing center and create a new center at a data point.  • After each “jump”, run k-means to convergence starting with an “allocate” step.

  20. Add a center below Remove a center at left

  21. Theory versus practice • Theorem:  Let C be a set of centers such that no “jump” operation improves the value of C.  Then C is at most 25 times worse than the global optimum. • T. Kanungo et al. A local search approximation algorithm for clustering, ACM Symposium on Computational Geometry, 2002. • Our aim: Find heuristics to identify “jump” steps that are likely to be good. • Experiments indicate we can solve problems with up to 2000 points and 20 centers optimally.

  22. An upper bound ... • Lemma 1:  The maximum loss from removing center c. • Proof: • Suppose b is the center closest to c; let B and C be the subsets owned by b and c, with m = |B| and n = |C|. • If B and C merge, the new center is b’ = (mb+nc)/(m+n). • Because c is the mean of C, for any zx in Cd(z,x)2 = x in Cd(c,x)2 + nd(z,c)2. • So the loss from the merge is nd(b’,c)2 + md(b’,b)2. • This computation is cheap, so we do it for every center. 

  23. … and a lower bound • Suppose we add a new center at point z. • Lemma 2: The gain from adding a center at z is at least •  { x : d(x,c(x)) > d(x,z) }d(x,c(x))2 - d(x,z)2. • This computation is more expensive, so we do it for only 2k log k random candidates z.

  24. Sometimes a jump should only be a jiggle • How to use Lemmas 1 and 2: • delete the center with smallest maximum loss, • make new center at point with greatest minimum gain. • This procedure identifies good global improvements. • Small-scale improvements come from “jiggling” the center of an existing cluster: moving the center to a point inside the same cluster.

  25. jj-means: the smarter k-means algorithm • Run k-means with SFF initialization. • Repeat • While improvement do Try the best jump according to Lemmas 1 and 2 • Until improvement do Try a random jiggle • “Try” means run k-means to convergence after. • Insert random jumps to satisfy theorem.

  26. Results with 1000 points, 8 dimensions, 10 centers Conclusion: Running 10x longer is faster and better than restarting 10x.

  27. Goal: Make k-means faster, but with same answer • Allow any black-box d(), • any initialization method. • In later iterations, little movement of centers. • Distance calculations use the most time. • Geometrically, these are mostly redundant. Source: D. Pelleg.

  28. Let x be a point, c(x) its owner, and c a different center. • If we already know d(x,c)  d(x,c(x))  • then computing d(x,c) precisely is not necessary. • Strategy: Use the triangle inequality d(x,z)   d(x,y) + d(y,z)to get sufficient conditions for d(x,c)  d(x,b). • kd-trees are useful up to  10 dimensions. • Distance-based data structures can be better. • Our approach is adaptive.

  29. Lemma 1:  Let x be a point, and let b and c be centers.  • If d(b,c)  2d(x,b) then d(x,c)  d(x,b). • Proof:  We know d(b,c)  d(b,x) + d(x,c). So  d(b,c) - d(x,b)   d(x,c). Now d(b,c) - d(x,b)  2d(x,b) - d(x,b) = d(x,b).So d(x,b)   d(x,c). • c • b • x

  30. Lemma 2:  Let x be a point, let b and c be centers.  • Then  d(x,c)   max [ 0, d(x,b) - d(b,c) ]. • Proof:  We know d(x,b)   d(x,c) + d(b,c), • So  d(x,c)   d(x,b) - d(b,c). • Also d(x,c)   0. • c • b • x

  31. How to use Lemma 1 • Let c(x) be the owner of point x, c' another center: • compute d(x,c') only if • d(x,c(x))  > ½d(c(x),c'). • If we know an upper bound u(x)  d(x,c(x)): • compute d(x,c') and d(x,c(x)) only if • u(x)  > ½d(c(x),c'). • If u(x)  ½min c'  c(x) [ d(c(x), c') ]: • eliminate all distance calculations for x.

  32. How to use Lemma 2 • Let x be any point, let c be any center, • let c’ be c at previous iteration. • Assume previous lower bound: d(x,c’)  l'.  • Then we get a new lower bound for the current iteration: • d(x,c)   max [ 0, d(x,c) - d(c, c’)] •  max [ 0, l' - d(c,c’) ] • If l' is a good approximation, and the center only moves slightly, then we get a good updated approximation.

  33. Pick initial centers c. • For all x and c, compute d(x,c) • Initialize lower bounds l(x,c)   d(x,c) • Initialize upper bounds u(x)   minc d(x,c) • Initialize ownership c(x)   argminc d(x,c) • Repeat until convergence: • Find all x s.t. u(x)  ½minc'  c(x) [ d(c(x), c') ] • For each remaining x and c  c(x) s.t. • u(x) >  l(x,c) • u(x) >  ½d(c(x), c) • Compute d(x,c) and d(x,c(x)) • If d(x,c) < d(x,c(x)) then change owner c(x)   c • Update l(x,c)   d(x,c) and u(x)   d(x,c(x)) • For each c, m(c)  mean of points owned by c • For each x and c, update l(x,c)   max [ 0, l(x,c) - d(m(c),c) ] • For each x, update u(x)  u(x) + d(c(x), m(c(x)) ) • Update each center c  m(c)

  34. Notes on the new algorithm • Empirical issue: which checks to do in which order. • Implement “for each remaining x and c” by looping over c, with vectorized code processing all x together. • Or, sequentially scan x and l(x,c) from disk. • Obvious initialization computes O(nk) distances. Faster methods give inaccurate l(x,c) and u(x), hence may do more distance calculations later.

  35. Experimental observations • Natural clusters are found while computing the distance between each point and each center less than once! • We find k = 100 clusters in n = 100,000 covtype points with 7,353,400 < nk = 15,000,000 distance calculations. • Number of distance calculations is o(kc), because later iterations compute very few distances.

  36. Current limitations • Computing distances is no longer the dominant cost. • Reason: After each iteration, we • update nk lower bounds l(x,c) • use O(kd) time to recompute k means • use O(k2d) time to recompute all inter-center distances • Moreover, we can approximate distances in o(d) time, by considering the largest dimensions first.

  37. Deeper questions • What is the minimum # of distance calculations needed? • Adversary argument? If some calculations are omitted, an opponent can choose their values to make any clustering algorithm’s output incorrect. • Can we extend to clustering with general Bregman divergences? • Can we extend to soft-assignment clustering? Via lower and upper bounds on weights?

More Related