420 likes | 628 Views
A Fast PTAS for k-Means Clustering. Dan Feldman, Tel Aviv University, Morteza Monemizadeh, Christian Sohler , Universität Paderborn. Simple coreset for clustering problems Overview. Introduction Weak Coresets Definition Intuition The construction A sketch of analysis The k-means PTAS
E N D
A Fast PTAS for k-Means Clustering Dan Feldman, Tel Aviv University, Morteza Monemizadeh, Christian Sohler , Universität Paderborn
Simple coreset for clustering problemsOverview • Introduction • Weak Coresets • Definition • Intuition • The construction • A sketch of analysis • The k-means PTAS • Conclusions
IntroductionClustering • Clustering • Partition input in sets (cluster), such that- Objects in same cluster are similar - Objects in different clusters are dissimilar • Goal • Simplification • Discovery of patterns • Procedure • Map objects to Euclidean space => point set P • Points in same cluster are close • Points in different clusters are far away from eachother
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Means Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i 2 i pC i i
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Means Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i 2 i pC i i
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Means Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i 2 i pC i i
(128,59,88) (218,181,163) IntroductionSimplification / Lossy Compression
IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters
IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters
IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters
IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters
IntroductionProperties of k-means • Properties of k-means • Optimal solution, if • Centers are given assign each point to the nearest center • Cluster are given centroid (mean) of clusters Notation: cost(P,C) denotes the cost of the solution defined this way
Weak CoresetsCentroid Sets • Definition (e-approx. centroid set) • A set S is called e-approximate centroid set, if • it contains a subset C S s.t. cost(P,C) (1+e) cost(P,Opt) • Lemma [KSS04] • The centroid of a random set of 2/e points is with constant probability a (1+e)-approx. of the optimal center of P. • Corollary • The set of all centroids of subsets of 2/e points is an e-approx. Centroid set.
Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) • Point set P (light blue)
Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) • Set of solution S (yellow)
Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) • Possible coreset with weights (red) 3 4 5 5 4
Weak CoresetsDefinition • Definition (weak e-Coreset for k-means) • A pair (K,S) is called a weak e-coreset for P, if for every set C of k centers from the e-approx. centroid set S we have • (1-e) cost(P,C) cost(K,C) (1+e) cost(P,C) • Approximates cost of k centers (voilett) from S 3 4 5 5 4
Weak CoresetsIdeal Sampling • Problem • Given n numbers a1,…,an >0 • Task: approximate A:=Sai by random sampling • Ideal Sampling • Assign weights w1,…, wn to numbers • wj = avg / aj • Pr[x=j] = aj / avg • Estimator: wxax
Weak CoresetsIdeal Sampling • Problem • Given n numbers a1,…,an >0 • Task: approximate A:=Sai by random sampling • Ideal Sampling • Assign weights w1,…, wn to numbers • wj = avg / aj • Pr[x=j] = aj / avg • Estimator: wxax Properties of estimator: (1) wxax = A (0 variance) (2) Expected weight of number j is 1
Weak CoresetsIdeal Sampling • Problem • Given n numbers a1,…,an >0 • Task: approximate A:=Sai by random sampling • Ideal Sampling • Assign weights w1,…, wn to numbers • wj = A / aj • Pr[x=j] = aj / A • Estimator: wxax Properties of estimator: (1) wxax = A (0 variance) (2) Expected weight of number j is 1 Only problem: Weights can be very large
Weak CoresetsConstruction • Step 1 • Compute constant factor approximation
Weak CoresetsConstruction • Step 2 • Consider each cluster separately
Weak CoresetsConstruction • Step 2 • Consider each cluster separately
Weak CoresetsConstruction • Step 2 • Consider each cluster separately Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)
Weak CoresetsConstruction • Step 2 • Consider each cluster separately But what about high weights? Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)
Weak CoresetsConstruction • Step 2 • A little twist Main idea: Apply ideal sampling to each Cluster C Pr[pi is taken] = dist(pi, c) / cost(C,c) w(pi) = cost(C,c) / dist(pi,c)
Weak CoresetsConstruction • Step 3 • A little twist Uniform sampling from small ball Radius = average distance / e Ideal sampling from ‚outliers‘
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius Weight of samples from outliers at most e|C|
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ At least (1-e)-fraction of points is here by choice of radius Forget about outliers!
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (a): nearest center is ‚far away‘ D eD Doesn‘t matter where points lie inside the ball
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (b): nearest center is ‚near‘
Weak CoresetsAnalysis • Fix arbitrary set of centers K • Case (b): nearest center is ‚near‘ Almost ideal sampling - Expectation is cost(C,K) - low variance
Weak CoresetsResult • The centroid set • S is set of all centroids of 2/e points (with repetition) from our sample set K • Can show that K approximates all solutions from S • Can show that S is an e-approx. centroid set w.h.p. • Theorem • One can compute in O(nkd) time a weak e-coreset (K,S). The size of K is poly(k, 1/e). S is the set of all centroids of subsets of K of size 2/e.
Weak CoresetsApplications • Fast-k-Means-PTAS(P,k) • Compute weak coreset K • Project K on poly(1/e,k) dimensional space • Exhaustively search for best solution of (projection of) centroid set • Return centroids of the points that create C • Running time: • O(nkd + (k/e) ) ~ O(k/e)
Summary • Weak Coresets • independent of n and d • fast PTAS for k-means • First PTAS for kernel k-means (if the kernel maps into finite dimensional space)
Thank you! Christian Sohler Heinz Nixdorf Institut & Institut für Informatik Universität Paderborn Fürstenallee 11 33102 Paderborn, Germany Tel.: +49 (0) 52 51/60 64 27 Fax: +49 (0) 52 51/62 64 82 E-Mail: csohler@upb.de http://www.upb.de/cs/ag-madh