210 likes | 334 Views
The Effectiveness of Lloyd-type Methods for the k-means Problem. Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard Schulman UCLA Technion Caltech. c 3. c 2. c 1. X 2. X 3. X 1. The k-means Problem.
E N D
The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard Schulman UCLA Technion Caltech
c3 c2 c1 X2 X3 X1 The k-means Problem Given:n points in d-dimensional space • partition X into k clusters X1,…, Xk • assign each point in Xi to a common centerciÎRd Goal:Minimize∑i∑xÎXid(x,ci)2 X Í Rd: point set with |X| = n d: L2 distance
k-means (contd.) • Given the ci’s, best clustering is to assign each point to nearest center: Xi = {xÎX: ci is ctr. nearest to x} • Given the Xi’s, best choice of centers is to set ci = center of mass of Xi = ctr(Xi) = ∑xÎXix / |Xi| ÞOptimal solution satisfies both properties Problem is NP-hard even fork=2 (n, d not fixed)
Related Work k-means problem dates back to Steinhaus (1956). a) Approximation algorithmsº algorithms with provable guarantees • PTAS’s with varying runtime dependence on n, d, k: poly/linear in n, could be exponential in d and/or k • Matousek (poly(n), exp(d,k)) • Kumar, Sabharwal & Sen (KSS04) (lin(n,d), exp(k)) • O(1)-approximation algorithms for k-median: any point set with any metric, runtime poly(n,d,k); guarantees also translate to k-means • Charikar, Guha, Tardos & Shmoys • Arya et al. + Kanungo et al.: (9+e)-approximation
b) Heuristics:Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today • 1) Start with k initial / “seed” centers c1,…, ck. • 2) Iterate the following Lloyd step • Assign each point to nearest center ci to obtain clustering X1,…, Xk. • Update ci¬ ctr(Xi) = ∑xÎXi x/|Xi| .
b) Heuristics:Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today • 1) Start with k initial / “seed” centers c1,…, ck. • 2) Iterate the following Lloyd step • Assign each point to nearest center ci to obtain clustering X1,…, Xk. • Update ci¬ ctr(Xi) = ∑xÎXi x/|Xi| .
b) Heuristics:Lloyd’s method invented in 1957 and remains an extremely popular heuristic even today • 1) Start with k initial / “seed” centers c1,…, ck. • 2) Iterate the following Lloyd step • Assign each point to nearest center ci to obtain clustering X1,…, Xk. • Update ci¬ ctr(Xi) = ∑xÎXi x/|Xi| .
Lloyd’s method: What’s known? • Some bounds on number of iterations of Lloyd-type methods: Inaba-Katoh-Imai; Har-Peled-Sadri; Arthur-Vassilvitskii (’06) • Performance very sensitive to choice of seed centers; lot of literature on finding “good” seeding methods for Lloyd • But, almost no analysis that proves performance guarantees about quality of final solution for arbitrary k and dimension Our Goal: to analyze Lloyd and try to prove rigorous performance guarantees for Lloyd-type methods
Our Results Main Theorem: If data has a “meaningful k-clustering”, then there is a simple, efficient seeding method s.t. Lloyd-type methods returna near-optimal solution. • Introduce a clusterability or separation condition. • Give a novel, efficient sampling process for seeding Lloyd’s method with initial centers. • Show that if data satisfies our clusterabililty condition: • seeding + 1 Lloyd step yields a constant-approximation in time linear in n and d, poly(k): is potentially faster than Lloyd variants which require multiple reseedings • seeding + KSS04-sampling gives a PTAS: algorithm is faster and simpler than PTAS in KSS04.
“Meaningful k-Clustering” Settings where one would NOT consider data to possess a meaningful k-clustering: 1) If near-optimum cost can be achieved by two very distinct k-partitions of data, then identity of an optimal k-partition carries little meaning – provides ambiguous classification. 2) If cost of best k-clustering ≈ cost of best (k-1)-clustering, then a k-clustering yields only marginal benefit over the best (k-1)-clustering – should use smaller value of k here. Example: k=3
We formalize 2). Let Dk2(X) = cost of best k-clustering of X. X is e-seperated for k-means iff Dk2(X) / Dk-12(X) ≤ e2. • Simple condition. Drop in k-clustering cost is already used by practitioners to choose the right k • Can show that (roughly), two low-cost k-clusterings disagree on only a small fraction of data X is e-separated for k-means Û
The 2-means problem (k=2) X is e-separated for 2-means. X*1, X*2 : optimal clusters c*i = ctr(X*i), D* = d(c*1,c*2) ni = |X*i|, (r*i)2 = ∑xÎX*id(x,c*i)2 /ni = D12(X*i)/ni = avg. squared distance in cluster X*i r*1 r*2 c*1 c*2 D* X*2 X*1 Lemma: For i=1, 2, (r*i)2 ≤ e2/(1-e2).D*2 . Proof:D22(X)/e2 ≤ D12(X) = D22(X) + (n1n2 /n).D*2.
The 2-means algorithm c*1 c*2 D* X*2 X*1 • 1)Sampling-based seeding procedure: • Pick two seed centers c1, c2 by randomly picking the pair x, yÎX with probability d(x,y)2. • 2)Lloyd step or simpler “ball k-means step”: • For each ci, let Bi = {xÎX: d(x,ci) ≤ d(c1,c2)/3}. • Update ci¬ ctr(Bi); return these as final centers. Sampling can be implemented in O(nd) time, so entire algorithm runs in O(nd) time. Assume: X is e-separated for 2-means
2-means: Analysis c1 c2 c*1 c*2 D* X*2 X*1 core(X*2) core(X*1) Let core(X*i) = {xÎX*i : d(x,c)2≤ (r*i)2/r}, where r=q(e2) <1. Seeding lemma: With prob. 1–O(r), c1,c2 lie in cores of X*1, X*2. Proof:|core(X*i)| ≥ (1-r)ni for i=1,2. Let A = ∑xÎcore(X*1), yÎcore(X*2) d(x,y)2≈ (1-r)2n1n2D*2. B = ∑{x,y}ÍX d(x,y)2= n.D12(X) ≈ n1n2D*2. Probability = A/B ≈ (1-r)2= 1– O(r).
2-means analysis (contd.) B1 B2 c1 c2 c*1 c*2 D* X*2 X*1 core(X*2) core(X*1) Recall that Bi = {xÎX: d(x,ci) ≤ d(c1,c2)/3} Ball-k-means lemma: For i=1,2, core(X*i) Í BiÍ X*i. Therefore d(ctr(Bi),c*i)2 ≤ r(r*i)2/(1–r) . Intuitively, since Bi Í X*i and Bi contains almost all of the mass of X*i, ctr(Bi) must be close to ctr(X*i) = c*i.
2-means analysis (contd.) Theorem: With probability 1–O(r), cost of final clustering is at mostD22(X)/(1–r), Þ get a (1/(1–r))-approximation algorithm. Since r = O(e2), we have approximation ratio ®1 as e ® 0. probability of success ®1 as e ® 0.
Arbitrary k • Algorithm and analysis follow the same outline as in 2-means. • If X is e-separated for k-means, can again show that all clusters are well separated, that is, • cluster radius << inter-cluster distance, r*i = O(e).d(c*i,c*j) "i,j • 1)Seeding stage: we choose k initial centers and ensure that they lie in the “cores” of the k optimal clusters. • exploits the fact that clusters are well separated • after seeding stage, each optimal center has a distinct seed center very “near” it • 2) Now, can run either a Lloyd step or a ball-k-means step. • Theorem: If X is e-separated for k-means, then one can obtain an a(e)-approximation algorithm where a(e) ®1 as e® 0.
Schematic of entire algorithm Ball k-means or Lloyd step: gives O(1)-approx. Simple sampling: success probability = exp(-k) k well-placed seeds Greedy deletion: O(n3d) KSS04-sampling: gives PTAS Oversampling + deletion: sample O(k) centers, then greedily delete till k remain O(1) success probability, O(nkd+k3d) • Simple sampling: Pick k centers as follows. • first pick 2 centers c1, c2 as in 2-means • to pick center ci+1, pick xÎX with probability minj≤i d(x,cj)2 Greedy deletion: Start with n centers and keep deleting the center that causes least cost increase till k centers remain
Open Questions • Deeper analysis of Lloyd: are there weaker conditions under which one can prove performance guarantes for Lloyd-type methods? • PTAS for k-means with polytime dependence on n, k and d? Is it APX hard in geometric setting? • PTAS for k-means under our separation condition? • Other applications of separation condition?