Efficient Grid-based Coresets for Clustering Problems

Grid-based Coresets for Clustering Problems Christian Sohler Universität Paderborn (joint work with Gereon Frahling)

IntroductionClustering • Clustering • Partition input in sets (cluster), such that- Objects in same cluster are similar - Objects in different clusters are dissimilar • Goal • Simplification • Discovery of patterns • Procedure • Map objects to Euclidean space => point set P • Points in same cluster are close • Points in different clusters are far away from eachother

Introductionk-means clustering • Clustering with Prototypes • One prototyp (center) for each cluster • k-Median Clustering • k clusters C ,…,C • One center c for each cluster C • Minimize S S d(p,c ) 1 k i i i pC i i

(128,59,88) (218,181,163) IntroductionSimplification / Lossy Compression

IntroductionSimplification / Lossy Compression

IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center

IntroductionProperties of k-means • Simple property of k-median • Point set P • Set of centers C • Best clustering: Assign each point to nearest center Notation: cost(P,C) denotes the cost of the solution defined this way

IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C)  cost(S,C)  (1+e) cost(P,C)

IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C)  cost(S,C)  (1+e) cost(P,C) • Replace point set by few weighted points(red) 3 4 5 5 4

IntroductionCoresets • Definition (Coreset for k-median) [HM04] • A weighted point set S is called e-coreset for P, if for every set C of k centers we have • (1-e) cost(P,C)  cost(S,C)  (1+e) cost(P,C) 3 4 5 5 4

IntroductionRelated work • Coresets for Clustering Problems • k-center, k-median [Badoiu, Indyk, Har-Peled, 2002]existence of coresets, size independent of dimension • Projective clustering [Har-Peled, Varadarajan, 2002] • existence of coresets for projective clustering, faster algorithms • k-median, k-means [Har-Peled, Mazumdar, 2004]faster algorithms, data streaming, different definition of coresets • k-median, k-means [Har-Peled, Kushal, 2004]coresets of constant size • k-median [Chen, 2005]coreset with size polynomial in dimension • K-median, k-means, MaxCut [Frahling, S., 2005]‚oblivious‘ coreset construction, dynamic data streams

Coresets for clustering problems • k-means [Frahling, Sohler, 2006]efficient implementation • k-line median [Fiat, Feldman, Sharir, 2006]coresets for low dimensions • k-median, k-means [Feldman, Momemizadeh, Sohler, 2006]weak coresets; size independent of n and d

Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R • Analysis • Moving a point by distance d changes cost(P,C) by at most d • Sum up movement for all regions • Show: Overall movement is at most ecost(P,C)

Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Only question: How to find regions?

Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W

Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R First try: Regular grid with width W • Error per cell: • O(W  #points in cell) • W e cost(P,C)/n • Too many cells

Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points

Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell

Coreset constructionFirst try • Our Approach • Partition the input space into regions • For each region R • Count number w(R) of points in R • choose one representative point p from R • Assign weight w(R) to p • Remove all other points from R Second try:Refine grid till cells have at most R points per cell • Error per cell: • O(Cell width R) • There can be point at distance Opt • Re • Too many cells

Coreset constructionSome definitions • Assumptions • Cost Opt of optimal k-median solution is known • Grid i has cell width Opt / 2 • O(log n) levels • Definition: • Cell in grid i is called heavy, if it contains more than d2 points. • A cell that is not heavy is light. • Observation: • „Movement cost“ for light cells is O(dOpt) • Construction: • Put coreset point in every light cell whose parent cell is heavy i i

Coreset constructionThe algorithm Computation of coreset points Opt

Coreset constructionThe algorithm Computation of coreset points

Coreset constructionThe algorithm Computation of coreset points 1

Coreset constructionThe algorithm Computation of coreset points 1 1 1 1 3 1 3 1 1 1 1

Coreset constructionThe algorithm Computation of coreset points 1 1 1 5 5 1 5 2 3 1 3 1 1 1 1

Coreset constructionAnalysis d • Coreset size  2  #heavy cells 1/ecell width

Coreset constructionAnalysis d • Coreset size  2  #heavy cells 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d d

Coreset constructionAnalysis d • Coreset size  2  #heavy cells Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells per grid: • k/e (volume argument) d

Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d Contribution of „outer“ heavy cell ≥d/e cost(P,C) Number of outer heavy cells per grid e/d 1/ecell width • Number of „inner“ heavy cells: • k/e (volume argument) d

Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d 1/ecell width

Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d 1/ecell width Outer cells: Movement can be charged to contribution  Overall cost e  cost(P,C)

Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d Inner cells: Cost per cell d  Opt 1/ecell width Outer cells: Movement can be charged to contribution  Overall cost e  cost(P,C)

Coreset constructionAnalysis • Coreset size  O(log n (e/d + k/e )) d Inner cells: Cost per cell d  Opt 1/ecell width #inner cells  k/e  d=e / log n Outer cells: Movement can be charged to contribution  Overall cost e  cost(P,C) d+1

Coreset Summary • Theorem • Our construction gives a coreset of size O(k log n / e ) • Dynamic geometric data streams • Stream of Insert(p)/Delete(p) operations; p  {1,…,D} • Stream consistent: no Delete(p), if p is not in current set • Algorithm • Output: Set of k centers • Maintains Coreset • Compute centers from coreset using (1+e)-approx. algorithm d d

StreamingCoreset maintenance • How to maintain coreset • (1+e)-approx. of number of points in heavy cells sufficient • For grids with cell width Opt/2 we need approximation for all cells with more than d 2 points • Solution • Uniform random sampling will do • Reason • Size of grid cell imposes restriction on distribution • Sample hits only few cells • So, small space suffices i i

Conclusions • Summary • Streaming algorithm for insertions and deletions • Maintains coreset • Computes (1+e)-approximation from coreset • Some more progress on… • High dimensional dynamic data streams • Sliding window model (low dimensional)

Thank you! Christian Sohler Heinz Nixdorf Institut & Institut für Informatik Universität Paderborn Fürstenallee 11 33102 Paderborn, Germany Tel.: +49 (0) 52 51/60 64 27 Fax: +49 (0) 52 51/62 64 82 E-Mail: csohler@upb.de http://www.upb.de/cs/ag-madh

Efficient Grid-based Coresets for Clustering Problems

Efficient Grid-based Coresets for Clustering Problems

Presentation Transcript

Topic9: Density-based Clustering

Frequent Item Based Clustering

Directed Budget-Based Clustering for WSN

K -MST -based clustering

A general grid-clustering approach

Density based Clustering

Pattern-based Clustering

Collaborative Clustering for Entity Clustering

Matching Similarity for Keyword - based Clustering

Using Word Based Features for Word Clustering

A Clustering Utility Based Approach for

Identity-based cryptography for GRID security

Grid-based System for Flood Forecasting

Scalable Clustering on the Data Grid

Identity-based authentication protocol for grid

Cut-based clustering algorithms

Aspect Based Clustering for Turkish News

K -MST -based clustering

Identity-Based Cryptography for Grid Security

Component-Based Portals for Grid Computing

Grid-based Collaboration