170 likes | 257 Views
CURE: An Efficient Clustering Algorithm for Large Databases. Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. Presentation by: Vuk Malbasa For CIS664 Prof. Vasilis Megalooekonomou. Overview. Introduction Previous Approaches Drawbacks of previous approaches CURE: Approach
E N D
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664 Prof. Vasilis Megalooekonomou
Overview • Introduction • Previous Approaches • Drawbacks of previous approaches • CURE: Approach • Enhancements for Large Datasets • Conclusions
Introduction • Clustering problem: Given points separate them into clusters so that data points within a cluster are more similar to each other than points in different clusters. • Traditional clustering techniques either favor clusters with spherical shapes and similar sizes or are fragile to the presence of outliers. • CURE is robust to outliers and identifies clusters with non-spherical shapes, and wide variances in size. • Each cluster is represented by a fixed number of well scattered points.
Introduction • CURE is a hierarchical clustering technique where each partition is nested into the next partition in the sequence. • CURE is an agglomerative algorithm where disjoint clusters are successively merged until the number of clusters reduces to the desired number of clusters.
Previous Approaches • At each step in agglomerative clustering the merged clusters are ones where some distance metric is minimized. • This distance metric can be: • Distance between means of clusters, dmean • Average distance between all points in clusters, dave • Maximal distance between points in clusters, dmax • Minimal distance between points in clusters, dmin
Drawbacks of previous approaches • For situations where clusters vary in size dave, dmax and dmean distance metrics will split large clusters into parts. • Non spherical clusters will be split by dmean • Clusters connected by outliers will be connected if the dmin metric is used • None of the stated approaches work well in the presence of non spherical clusters or outliers.
CURE: Approach • CURE is positioned between centroid based (dave) and all point (dmin) extremes. • A constant number of well scattered pointsis used to capture the shape and extend of a cluster. • The points are shrunk towards the centroid of the cluster by a factor α. • These well scattered and shrunk points are used as representative of the cluster.
CURE: Approach • Scattered points approach alleviates shortcomings of dave and dmin. • Since multiple representatives are used the splitting of large clusters is avoided. • Multiple representatives allow for discovery of non spherical clusters. • The shrinking phase will affect outliers more than other points since their distance from the centroid will be decreased more than that of regular points.
CURE: Approach • Initially since all points are in separate clusters, each cluster is defined by the point in the cluster. • Clusters are merged until they contain at least c points. • The first scattered point in a cluster in one which is farthest away from the clusters centroid. • Other scattered points are chosen so that their distance from previously chosen scattered points in maximal. • When c well scattered points are calculated they are shrunk by some factor α (r = p + α*(mean-p)). • After clusters have c representatives the distance between two clusters is the distance between two of the closest representatives of each cluster • Every time two clusters are merged their representatives are re-calculated.
Enhancements for Large Datasets • Random sampling • Filters outliers and allows the dataset to fit into memory • Partitioning • First cluster in partitions then merge partitions • Labeling Data on Disk • The final labeling phase can be done by NN on already chosen cluster representatives • Handling outliers • Outliers are partially eliminated and spread out by random sampling, are identified because they belong to small clusters that grow slowly
Conclusions • CURE can identify clusters that are not spherical but also ellipsoid • CURE is robust to outliers • CURE correctly clusters data with large differences in cluster size • Running time for a low dimensional dataset with s points is O(s2) • Using partitioning and sampling CURE can be applied to large datasets