1 / 36

Clustering Large Datasets in Arbitrary Metric Space

Clustering Large Datasets in Arbitrary Metric Space. by Muralikrishna Achari. Contents. Introduction to Clustering Problems in Traditional Clustering Clustering Large Datasets BIRCH* BUBBLE BUBBLE-FM Scalability Conclusion. Traditional Clustering. Unsupervised Learning.

watsonjames
Download Presentation

Clustering Large Datasets in Arbitrary Metric Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Large Datasets in Arbitrary Metric Space by Muralikrishna Achari COSC6341 Information Retreival Project Presentation

  2. Contents • Introduction to Clustering • Problems in Traditional Clustering • Clustering Large Datasets • BIRCH* • BUBBLE • BUBBLE-FM • Scalability • Conclusion COSC6341 Information Retreival Project Presentation

  3. Traditional Clustering • Unsupervised Learning. • A process of grouping similar object into groups. • Distance between object is used as a common metric to assess similarity COSC6341 Information Retreival Project Presentation

  4. Types of Clustering Algorithms • Hierarchical clustering • Minimal Spanning Tree Method, BIRCH, BUBBLE • Partition based clustering • K-means, CLARANS COSC6341 Information Retreival Project Presentation

  5. Hierarchical Clustering • A crude division of instances into groups at the top level, and each of these groups is refined further – perhaps all the way down to the individual instances. COSC6341 Information Retreival Project Presentation

  6. Partition based clustering • A desired number of clusters are assumed at the start and instances are allocated among clusters so that a particular clustering criterion is optimized (e.g. minimization of the variability within clusters). COSC6341 Information Retreival Project Presentation

  7. Applications • Marketing: finding groups of customers with similar behavior. • Landscapes : Characterizing different regions. • Biology: classification of plants and animals given their features. • Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones. • WWW: document classification; clustering weblog data to discover groups of similar access patterns. COSC6341 Information Retreival Project Presentation

  8. Problem with Traditional Clustering Dealing with large number of dimensions and large number of data items can be problematic because of time complexity. COSC6341 Information Retreival Project Presentation

  9. Requirements for a Good Clustering Algorithm • Scalability. • Dealing with different types of attributes. • Discovering clusters with arbitrary shape. • Minimal requirements for domain knowledge to determine input parameters. • Ability to deal with noise and outliers. COSC6341 Information Retreival Project Presentation

  10. Clustering Large Datasets COSC6341 Information Retreival Project Presentation

  11. Clustering Large Datasets • CLARANS • Assumes all object fit in main memory, sensitive to input order. • Uses R* to improve efficiency. • BIRCH • Minimizes memory usage and scans data only once from disk. • Uses cluster representatives instead of actual data points. • 1st algorithm proposed in the database area that addresses outliers. • DBSCAN • Uses distance based notion to clusters to discover clusters of arbitrary shapes. • Sensitive to the input parameters and incurs substantial I/O cost. COSC6341 Information Retreival Project Presentation

  12. Drawbacks • Both BIRCH and CLARANS works well for clusters with Spherical or Convex sphape and uniform size and are unsuitable when clusters have different sizes and are non-spherical. • All the three algorithms relies on vector operations which are only defined in coordinate space and are unsuitable to datasets in distance space. COSC6341 Information Retreival Project Presentation BRICH

  13. Proposed Approach • 2 algorithms for clustering large datasets based on BIRCH* framework. • BUBBLE • BUBBLE-FM COSC6341 Information Retreival Project Presentation

  14. BIRCH* • Balanced Iterative Reducing and Clustering using Hierarchies • BIRCH* is generalized framework for incremental clustering algorithms. • BIRCH* components can be instantiated to generate concrete clustering algorithms. COSC6341 Information Retreival Project Presentation

  15. BIRCH* Components • Cluster Feature (CF*) • A Summarized representation of the cluster • Cluster Tree (CF*-tree) • A Height balanced tree for CF*’s COSC6341 Information Retreival Project Presentation

  16. Clustering Feature • CFs are summarized representations of clusters. • Requirements:- • Incrementally maintainable when a new object is inserted. • Contain Sufficient information to compute distance between clusters and objects. COSC6341 Information Retreival Project Presentation

  17. CF*-Tree • A height-balanced tree. • Two parameters • 1. Branching Factor, B • 2. Threshold, T • Non-leaf node has B entries; ( [CFi, childi], i = 1..B) • Where CFi is the CF of the sub clusters represented by this child • Childi is a pointer to its ith child. COSC6341 Information Retreival Project Presentation

  18. CF*-tree • Leaf node • Satisfies threshold T, which controls its tightness and quality. • Diameter or radius < T • Tree size is a function of T • T increase tree size decreases. COSC6341 Information Retreival Project Presentation

  19. CF Tree COSC6341 Information Retreival Project Presentation

  20. Functionality of CF*-tree • Direct a new object, O, to the cluster closest to it. • Non-leaf node: exits to guide new objects to appropriate leaf clusters. • Leaf node: absorbs the new object. COSC6341 Information Retreival Project Presentation

  21. BIRCH*: Mechanism • Starts with an initial T. • Scans the data and inserts the objects into the tree. • During scan, existing clusters are updated and new clusters are formed. • If runs out of memory, M, increases T and builds a smaller CF*-tree. • After inserting the old leaf entries, resumes from the point at which it was interrupted. COSC6341 Information Retreival Project Presentation

  22. CF*-tree insertion CF* tree insertion mechanism is same as that of B+ trees. Each new object, O: • Reaches the leaf node, L. • Inserted into a closest clusters C, if threshold, T, is not violated else forms a new Cluster • If there is not enough space in L, then split into two leaf nodes and distribute entries between the two nodes. • Like B+ tree, node splits might propagate till the root. • The path from the root to the leaf is updated to reflect the insertion. COSC6341 Information Retreival Project Presentation

  23. BRICH*: Instantiation Summary • Cluster features at leaf and non-leaf levels. • Incremental maintenance of Cluster Features at leaf and non leaf nodes • Distance measure between CF* and an object , and between CF*s. • Threshold requirement. COSC6341 Information Retreival Project Presentation

  24. BUBBLE • BIRCH* instantiated in distance space. • No concept of Centroids. • For a given set of objects O = {O1…On} Defines:- Rowsum (O)= Clustroid (O ) is and object O’ with least Rowsum value. Radius, r (O ) = Clustroid distance, D0 (O1 , O2) = d(O1’,O2’) COSC6341 Information Retreival Project Presentation

  25. BUBBLE: CF at leaf nodes For a set of objects O = {O1…On} and cluster C . CF is a five tupple defined as: (n, O’, R, RS, r) n: Number of objects in C . O’: Clustroid of C . R : representatives of the Cluster C , (R C ). RS: The Rowsum of all the representatives. r: Radius of the Clusters C . COSC6341 Information Retreival Project Presentation

  26. BUBBLE: CF at non-leaf node • A set of sample objects, S (NLi), randomly collected from the subtree NLi form its CFi . • CF at NL = S (NLi) • Each child node will have at least one representative in S(NL). • If CFi is leaf node then S (NLi) are randomly picked from the clustroids of CFi. COSC6341 Information Retreival Project Presentation

  27. BUBBLE: Incremental Maintenance of CF at leaf • Types of Insertions • Type I: Insertion of a single object • Type II: Insertion of a cluster of objects. COSC6341 Information Retreival Project Presentation

  28. Type I Insertion Inserting an object into the leaf • If |C| is small, maintain all the cluster objects and calculate the new clustroid. • If |C | is large, maintain a subset of C of size R that are close to the clustroid. COSC6341 Information Retreival Project Presentation

  29. Type II Insertion Inserting a cluster of objects:- • C1 and C2 must be non-overlapping but close clusters. • The location of the new clustroid is between the two old clustroids. • By maintaining few objects far away from the clustriods of C1 and C2 the new clustroid can be calculated. COSC6341 Information Retreival Project Presentation

  30. Incremental Maintenance of CF at non leaf • The sample objects at a non-leaf entry are updated whenever its child node splits. • The distribution of clusters changes significantly whenever a node splits. • To reflect changes in the distribution at all children nodes, update the sample objects at all entries of NL. COSC6341 Information Retreival Project Presentation

  31. Drawbacks of BUBBLE • BUBBLE computes distance between sample objects which could be expensive. • E.g. edit distance on string COSC6341 Information Retreival Project Presentation

  32. BUBBLE-FM • Transforms the distance space into an approximate vector space. • Maintains all the sample objects at each non leaf node in vector space. • For a new object O, transforms O to Vector space and uses Euclidean distance metric. • Doesn’t use transformation at leaf node. COSC6341 Information Retreival Project Presentation

  33. Scalability COSC6341 Information Retreival Project Presentation

  34. Scalability COSC6341 Information Retreival Project Presentation

  35. Conclusion • Presented the BIRCH* framework for scalable incremental pre clustering algorithms. • BUBBLE for datasets in arbitrary metric space • Fast map to reduce the number of calls to an expensive distance function. COSC6341 Information Retreival Project Presentation

  36. References Primary Source: • Clustering Large Datasets in Arbitrary Metric Spaces (1999)  Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrke Computer Sciences. Secondary Sources: • BIRCH: An Efficient Data Clustering Method for Very Large Databases (1996)  Tian Zhang, Raghu Ramakrishnan, Miron Livny • CURE: An Efficient Clustering Algorithm for Large Databases (1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim. COSC6341 Information Retreival Project Presentation

More Related