130 likes | 279 Views
RIC: Parameter-Free Noise-Robust Clustering. Presenter : Shu-Ya Li Authors : CHRISTIAN BO¨ HM, CHRISTOS FALOUTSOS, JIA-YU PAN, CLAUDIA PLANT. TKDD, 2007. Outline. Motivation Objective Methodology Experiments and Results Conclusion Personal Comments. Motivation.
E N D
RIC: Parameter-Free Noise-Robust Clustering Presenter : Shu-Ya Li Authors : CHRISTIAN BO¨ HM, CHRISTOS FALOUTSOS, JIA-YU PAN, CLAUDIA PLANT TKDD, 2007
Outline • Motivation • Objective • Methodology • Experiments and Results • Conclusion • Personal Comments
Motivation • How to find a natural clustering of a real-world point set which contains • an unknown number of clusters with different shapes • the clusters may be contaminated by noise?
Objectives MDL for classification VAC for clustering • Find natural clustering in a dataset • Goodness of a clustering • We use Volume after Compression (VAC) to quantify the ‘goodness’ of a grouping by. • Efficient algorithm for good clustering • Robust Fitting • Cluster Merging
VAC (Volume after Compression ) • VAC • Tells which grouping is better • Lower VAC => better grouping • Formula using decorrelation matrix • Computing VAC • Compute covariance matrix of cluster C • Compute PCA and obtain decorrelation matrix • Compute VAC from the matrix
Computing VAC • VAC (volume after compression) • Record bytes to record their type (guassian, uniform,..) • Record bytes for number of clusters k • The bytes to describe the parameters of each distribution (e.g., mean, variance, covariance, slope, intercept) and then the location of each point • Cluster Model 2.3+4.3=6.6bits stat = (μi, σi, lbi, ubi, ...)
Methodology – RIC framework • Robust Fitting • Mahalanobis distance defined by Λ and V • Conventional estimation: covariance matrix uses Mean • Robust estimation: covariance matrix uses Median • Median is less affected by outliers than Mean PCA (Σ = V ΛV T) μR median μ
Methodology – RIC framework • Cluster Merging • Merge Ci and Cj only if the combined VAC decreases • If savedCost > 0, then merge Ci and Cj • Greedy search to maximize savedCost, hence minimize VAC
Experiments • Results on Synthetic Data
Experiments • Performance on Real Data
Experiments • Compares the result of filterOpt to the result of filterDist.
Conclusion • The contributions of this work are the answers to the two questions, organized in our RIC framework. • (Q1) Goodness Measure. • We propose the VAC criterion using information-theory concepts, and specifically the volume after compression. • (Q2) Efficiency. • Robust fitting (RF) algorithm, which carefully avoids outliers. • Cluster merging (CM) algorithm, which stitches clusters together if the stitching gives a better VAC score.
Personal Comments • Advantage • Description detail • Many pictures and examples • Drawback • It is difficult to identify black and white picture. • Application • Clustering