500 likes | 532 Views
This paper introduces grid-based coresets for clustering problems to simplify the discovery of patterns by mapping objects to a Euclidean space and minimizing distances between points in the same cluster. It discusses properties of k-means clustering and presents a method for constructing coresets efficiently. Related works and coreset definitions are also explored in the context of clustering algorithms. Additionally, the construction process and analysis of coresets are detailed, emphasizing the reduction of computational complexity in clustering tasks.
E N D
Grid-based Coresets for Clustering Problems Christian Sohler Universität Paderborn (joint work with Gereon Frahling)
IntroductionClustering • Clustering • Partition input in sets (cluster), such that- Objects in same cluster are similar - Objects in different clusters are dissimilar • Goal • Simplification • Discovery of patterns • Procedure • Map objects to Euclidean space => point set P • Points in same cluster are close • Points in different clusters are far away from eachother
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i
(128,59,88) (218,181,163) IntroductionSimplification / Lossy Compression
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center Notation: cost(P,C) denotes the cost of the solution defined this way
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C)
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C) • Replace point set by few weighted points(red) 3 4 5 5 4
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C) 3 4 5 5 4
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C) 3 4 5 5 4
IntroductionRelated work • Coresets for Clustering Problems • k-center, k-median [Badoiu, Indyk, Har-Peled, 2002]existence of coresets, size independent of dimension • Projective clustering [Har-Peled, Varadarajan, 2002] • existence of coresets for projective clustering, faster algorithms • k-median, k-means [Har-Peled, Mazumdar, 2004]faster algorithms, data streaming, different definition of coresets • k-median, k-means [Har-Peled, Kushal, 2004]coresets of constant size • k-median [Chen, 2005]coreset with size polynomial in dimension • K-median, k-means, MaxCut [Frahling, S., 2005]‚oblivious‘ coreset construction, dynamic data streams
Coresets for clustering problems • k-means [Frahling, Sohler, 2006]efficient implementation • k-line median [Fiat, Feldman, Sharir, 2006]coresets for low dimensions • k-median, k-means [Feldman, Momemizadeh, Sohler, 2006]weak coresets; size independent of n and d
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R • Analysis • Moving a point by distance d changes cost(P,C) by at most d • Sum up movement for all regions • Show: Overall movement is at most ecost(P,C)
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Only question: How to find regions?
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W • Error per cell: • O(W #points in cell) • W e cost(P,C)/n • Too many cells
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell • Error per cell: • O(Cell width R) • There can be point at distance Opt • Re • Too many cells
Coreset constructionSome definitions • Assumptions • Cost Opt of optimal k-median solution is known • Grid i has cell width Opt / 2 • O(log n) levels • Definition: • Cell in grid i is called heavy, if it contains more than d2 points. • A cell that is not heavy is light. • Observation: • „Movement cost“ for light cells is O(dOpt) • Construction: • Put coreset point in every light cell whose parent cell is heavy i i
Coreset constructionThe algorithm Computation of coreset points Opt
Coreset constructionThe algorithm Computation of coreset points
Coreset constructionThe algorithm Computation of coreset points
Coreset constructionThe algorithm Computation of coreset points 1
Coreset constructionThe algorithm Computation of coreset points 1
Coreset constructionThe algorithm Computation of coreset points 1 1 1 1 3 1 3 1 1 1 1
Coreset constructionThe algorithm Computation of coreset points 1 1 1 1 3 1 3 1 1 1 1
Coreset constructionThe algorithm Computation of coreset points 1 1 1 5 5 1 5 2 3 1 3 1 1 1 1
Coreset constructionAnalysis d • Coreset size 2 #heavy cells 1/ecell width
Coreset constructionAnalysis d • Coreset size 2 #heavy cells 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d d
Coreset constructionAnalysis d • Coreset size 2 #heavy cells Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells: • k/e (volume argument) d
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d 1/ecell width
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d 1/ecell width Outer cells: Movement can be charged to contribution Overall cost e cost(P,C)
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d Inner cells: Cost per cell d Opt 1/ecell width Outer cells: Movement can be charged to contribution Overall cost e cost(P,C)
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d Inner cells: Cost per cell d Opt 1/ecell width #inner cells k/e d=e / log n Outer cells: Movement can be charged to contribution Overall cost e cost(P,C) d+1
Coreset Summary • Theorem • Our construction gives a coreset of size O(k log n / e ) • Dynamic geometric data streams • Stream of Insert(p)/Delete(p) operations; p {1,…,D} • Stream consistent: no Delete(p), if p is not in current set • Algorithm • Output: Set of k centers • Maintains Coreset • Compute centers from coreset using (1+e)-approx. algorithm d d
StreamingCoreset maintenance • How to maintain coreset • (1+e)-approx. of number of points in heavy cells sufficient • For grids with cell width Opt/2 we need approximation for all cells with more than d 2 points • Solution • Uniform random sampling will do • Reason • Size of grid cell imposes restriction on distribution • Sample hits only few cells • So, small space suffices i i
Conclusions • Summary • Streaming algorithm for insertions and deletions • Maintains coreset • Computes (1+e)-approximation from coreset • Some more progress on… • High dimensional dynamic data streams • Sliding window model (low dimensional)
Thank you! Christian Sohler Heinz Nixdorf Institut & Institut für Informatik Universität Paderborn Fürstenallee 11 33102 Paderborn, Germany Tel.: +49 (0) 52 51/60 64 27 Fax: +49 (0) 52 51/62 64 82 E-Mail: csohler@upb.de http://www.upb.de/cs/ag-madh