500 likes | 526 Views
Grid-based Coresets for Clustering Problems. Christian Sohler Universität Paderborn (joint work with Gereon Frahling). Introduction Clustering. Clustering Partition input in sets (cluster), such that - Objects in same cluster are similar - Objects in different clusters are dissimilar
E N D
Grid-based Coresets for Clustering Problems Christian Sohler Universität Paderborn (joint work with Gereon Frahling)
IntroductionClustering • Clustering • Partition input in sets (cluster), such that- Objects in same cluster are similar - Objects in different clusters are dissimilar • Goal • Simplification • Discovery of patterns • Procedure • Map objects to Euclidean space => point set P • Points in same cluster are close • Points in different clusters are far away from eachother
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i
Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i
(128,59,88) (218,181,163) IntroductionSimplification / Lossy Compression
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center
IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center Notation: cost(P,C) denotes the cost of the solution defined this way
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C)
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C) • Replace point set by few weighted points(red) 3 4 5 5 4
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C) 3 4 5 5 4
IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C) cost(S,C) (1+e) cost(P,C) 3 4 5 5 4
IntroductionRelated work • Coresets for Clustering Problems • k-center, k-median [Badoiu, Indyk, Har-Peled, 2002]existence of coresets, size independent of dimension • Projective clustering [Har-Peled, Varadarajan, 2002] • existence of coresets for projective clustering, faster algorithms • k-median, k-means [Har-Peled, Mazumdar, 2004]faster algorithms, data streaming, different definition of coresets • k-median, k-means [Har-Peled, Kushal, 2004]coresets of constant size • k-median [Chen, 2005]coreset with size polynomial in dimension • K-median, k-means, MaxCut [Frahling, S., 2005]‚oblivious‘ coreset construction, dynamic data streams
Coresets for clustering problems • k-means [Frahling, Sohler, 2006]efficient implementation • k-line median [Fiat, Feldman, Sharir, 2006]coresets for low dimensions • k-median, k-means [Feldman, Momemizadeh, Sohler, 2006]weak coresets; size independent of n and d
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R • Analysis • Moving a point by distance d changes cost(P,C) by at most d • Sum up movement for all regions • Show: Overall movement is at most ecost(P,C)
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Only question: How to find regions?
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W • Error per cell: • O(W #points in cell) • W e cost(P,C)/n • Too many cells
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell
Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell • Error per cell: • O(Cell width R) • There can be point at distance Opt • Re • Too many cells
Coreset constructionSome definitions • Assumptions • Cost Opt of optimal k-median solution is known • Grid i has cell width Opt / 2 • O(log n) levels • Definition: • Cell in grid i is called heavy, if it contains more than d2 points. • A cell that is not heavy is light. • Observation: • „Movement cost“ for light cells is O(dOpt) • Construction: • Put coreset point in every light cell whose parent cell is heavy i i
Coreset constructionThe algorithm Computation of coreset points Opt
Coreset constructionThe algorithm Computation of coreset points
Coreset constructionThe algorithm Computation of coreset points
Coreset constructionThe algorithm Computation of coreset points 1
Coreset constructionThe algorithm Computation of coreset points 1
Coreset constructionThe algorithm Computation of coreset points 1 1 1 1 3 1 3 1 1 1 1
Coreset constructionThe algorithm Computation of coreset points 1 1 1 1 3 1 3 1 1 1 1
Coreset constructionThe algorithm Computation of coreset points 1 1 1 5 5 1 5 2 3 1 3 1 1 1 1
Coreset constructionAnalysis d • Coreset size 2 #heavy cells 1/ecell width
Coreset constructionAnalysis d • Coreset size 2 #heavy cells 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d d
Coreset constructionAnalysis d • Coreset size 2 #heavy cells Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells: • k/e (volume argument) d
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d 1/ecell width
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d 1/ecell width Outer cells: Movement can be charged to contribution Overall cost e cost(P,C)
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d Inner cells: Cost per cell d Opt 1/ecell width Outer cells: Movement can be charged to contribution Overall cost e cost(P,C)
Coreset constructionAnalysis • Coreset size O(log n (e/d + k/e )) d Inner cells: Cost per cell d Opt 1/ecell width #inner cells k/e d=e / log n Outer cells: Movement can be charged to contribution Overall cost e cost(P,C) d+1
Coreset Summary • Theorem • Our construction gives a coreset of size O(k log n / e ) • Dynamic geometric data streams • Stream of Insert(p)/Delete(p) operations; p {1,…,D} • Stream consistent: no Delete(p), if p is not in current set • Algorithm • Output: Set of k centers • Maintains Coreset • Compute centers from coreset using (1+e)-approx. algorithm d d
StreamingCoreset maintenance • How to maintain coreset • (1+e)-approx. of number of points in heavy cells sufficient • For grids with cell width Opt/2 we need approximation for all cells with more than d 2 points • Solution • Uniform random sampling will do • Reason • Size of grid cell imposes restriction on distribution • Sample hits only few cells • So, small space suffices i i
Conclusions • Summary • Streaming algorithm for insertions and deletions • Maintains coreset • Computes (1+e)-approximation from coreset • Some more progress on… • High dimensional dynamic data streams • Sliding window model (low dimensional)
Thank you! Christian Sohler Heinz Nixdorf Institut & Institut für Informatik Universität Paderborn Fürstenallee 11 33102 Paderborn, Germany Tel.: +49 (0) 52 51/60 64 27 Fax: +49 (0) 52 51/62 64 82 E-Mail: csohler@upb.de http://www.upb.de/cs/ag-madh