150 likes | 302 Views
Robust Information-theoretic Clustering. By C. Bohm, C. Faloutsos, J-Y. Pan, and C. Plant Presenter: Niyati Parikh. Objective. Find natural clustering in a dataset Two questions: Goodness of a clustering Efficient algorithm for good clustering. Define “ goodness”.
E N D
Robust Information-theoretic Clustering By C. Bohm, C. Faloutsos, J-Y. Pan, and C. Plant Presenter: Niyati Parikh
Objective • Find natural clustering in a dataset • Two questions: • Goodness of a clustering • Efficient algorithm for good clustering
Define “goodness” • Ability to describe the clusters succinctly • Adopt VAC (Volume after Compression) • Record #bytes for number of clusters k • Record #bytes to record their type (guassian, uniform,..) • Compressed location of each point
VAC • Tells which grouping is better • Lower VAC => better grouping • Formula using decorrelation matrix • Decorrelation matrix = matrix with eigenvectors
Computing VAC • Steps: • Compute covariance matrix of cluster C • Compute PCA and obtain eigenvector matrix • Compute VAC from the matrix
Efficient algorithm • Take initial clustering given by any algorithm • Refine that clustering to remove outliers/noise • Output a better clustering by doing post processing
Refining Clusters • Use VAC to refine existing clusters • Removing outliers from the given cluster C • Define Core and Out as set of points for core and outliers in C • Initially Out contains all points in C • Arrange points in ascending order of its distance from center • Compute VAC • Pick the closest point from Out and move to Core • Compute new VAC • If new VAC increases then stop, else pick next closest point and repeat
VAC and Robust estimation • Conventional estimation: covariance matrix uses Mean • Robust estimation: covariance matrix uses Median • Median is less affected by outliers than Mean
Sample result • Imperfect clusters formed by K-Means affect purifying process • May result into redundant clusters, that could be merged
Cluster Merging • Merge Ci and Cj only if the combined VAC decreases • savedCost(Ci, Cj) = VAC(Ci) + VAC(Cj) – VAC(Ci U Cj) • If savedCost > 0, then merge Ci and Cj • Greedy search to maximize savedCost, hence minimize VAC
Thank You • Questions?