Clustering Large Datasets in Arbitrary Metric Space

Clustering Large Datasets in Arbitrary Metric Space by Muralikrishna Achari COSC6341 Information Retreival Project Presentation

Contents • Introduction to Clustering • Problems in Traditional Clustering • Clustering Large Datasets • BIRCH* • BUBBLE • BUBBLE-FM • Scalability • Conclusion COSC6341 Information Retreival Project Presentation

Traditional Clustering • Unsupervised Learning. • A process of grouping similar object into groups. • Distance between object is used as a common metric to assess similarity COSC6341 Information Retreival Project Presentation

Types of Clustering Algorithms • Hierarchical clustering • Minimal Spanning Tree Method, BIRCH, BUBBLE • Partition based clustering • K-means, CLARANS COSC6341 Information Retreival Project Presentation

Hierarchical Clustering • A crude division of instances into groups at the top level, and each of these groups is refined further – perhaps all the way down to the individual instances. COSC6341 Information Retreival Project Presentation

Partition based clustering • A desired number of clusters are assumed at the start and instances are allocated among clusters so that a particular clustering criterion is optimized (e.g. minimization of the variability within clusters). COSC6341 Information Retreival Project Presentation

Applications • Marketing: finding groups of customers with similar behavior. • Landscapes : Characterizing different regions. • Biology: classification of plants and animals given their features. • Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones. • WWW: document classification; clustering weblog data to discover groups of similar access patterns. COSC6341 Information Retreival Project Presentation

Problem with Traditional Clustering Dealing with large number of dimensions and large number of data items can be problematic because of time complexity. COSC6341 Information Retreival Project Presentation

Requirements for a Good Clustering Algorithm • Scalability. • Dealing with different types of attributes. • Discovering clusters with arbitrary shape. • Minimal requirements for domain knowledge to determine input parameters. • Ability to deal with noise and outliers. COSC6341 Information Retreival Project Presentation

Clustering Large Datasets COSC6341 Information Retreival Project Presentation

Clustering Large Datasets • CLARANS • Assumes all object fit in main memory, sensitive to input order. • Uses R* to improve efficiency. • BIRCH • Minimizes memory usage and scans data only once from disk. • Uses cluster representatives instead of actual data points. • 1st algorithm proposed in the database area that addresses outliers. • DBSCAN • Uses distance based notion to clusters to discover clusters of arbitrary shapes. • Sensitive to the input parameters and incurs substantial I/O cost. COSC6341 Information Retreival Project Presentation

Drawbacks • Both BIRCH and CLARANS works well for clusters with Spherical or Convex sphape and uniform size and are unsuitable when clusters have different sizes and are non-spherical. • All the three algorithms relies on vector operations which are only defined in coordinate space and are unsuitable to datasets in distance space. COSC6341 Information Retreival Project Presentation BRICH

Proposed Approach • 2 algorithms for clustering large datasets based on BIRCH* framework. • BUBBLE • BUBBLE-FM COSC6341 Information Retreival Project Presentation

BIRCH* • Balanced Iterative Reducing and Clustering using Hierarchies • BIRCH* is generalized framework for incremental clustering algorithms. • BIRCH* components can be instantiated to generate concrete clustering algorithms. COSC6341 Information Retreival Project Presentation

BIRCH* Components • Cluster Feature (CF*) • A Summarized representation of the cluster • Cluster Tree (CF*-tree) • A Height balanced tree for CF*’s COSC6341 Information Retreival Project Presentation

Clustering Feature • CFs are summarized representations of clusters. • Requirements:- • Incrementally maintainable when a new object is inserted. • Contain Sufficient information to compute distance between clusters and objects. COSC6341 Information Retreival Project Presentation

CF*-Tree • A height-balanced tree. • Two parameters • 1. Branching Factor, B • 2. Threshold, T • Non-leaf node has B entries; ( [CFi, childi], i = 1..B) • Where CFi is the CF of the sub clusters represented by this child • Childi is a pointer to its ith child. COSC6341 Information Retreival Project Presentation

CF*-tree • Leaf node • Satisfies threshold T, which controls its tightness and quality. • Diameter or radius < T • Tree size is a function of T • T increase tree size decreases. COSC6341 Information Retreival Project Presentation

CF Tree COSC6341 Information Retreival Project Presentation

Functionality of CF*-tree • Direct a new object, O, to the cluster closest to it. • Non-leaf node: exits to guide new objects to appropriate leaf clusters. • Leaf node: absorbs the new object. COSC6341 Information Retreival Project Presentation

BIRCH*: Mechanism • Starts with an initial T. • Scans the data and inserts the objects into the tree. • During scan, existing clusters are updated and new clusters are formed. • If runs out of memory, M, increases T and builds a smaller CF*-tree. • After inserting the old leaf entries, resumes from the point at which it was interrupted. COSC6341 Information Retreival Project Presentation

CF*-tree insertion CF* tree insertion mechanism is same as that of B+ trees. Each new object, O: • Reaches the leaf node, L. • Inserted into a closest clusters C, if threshold, T, is not violated else forms a new Cluster • If there is not enough space in L, then split into two leaf nodes and distribute entries between the two nodes. • Like B+ tree, node splits might propagate till the root. • The path from the root to the leaf is updated to reflect the insertion. COSC6341 Information Retreival Project Presentation

BRICH*: Instantiation Summary • Cluster features at leaf and non-leaf levels. • Incremental maintenance of Cluster Features at leaf and non leaf nodes • Distance measure between CF* and an object , and between CF*s. • Threshold requirement. COSC6341 Information Retreival Project Presentation

BUBBLE • BIRCH* instantiated in distance space. • No concept of Centroids. • For a given set of objects O = {O1…On} Defines:- Rowsum (O)= Clustroid (O ) is and object O’ with least Rowsum value. Radius, r (O ) = Clustroid distance, D0 (O1 , O2) = d(O1’,O2’) COSC6341 Information Retreival Project Presentation

BUBBLE: CF at leaf nodes For a set of objects O = {O1…On} and cluster C . CF is a five tupple defined as: (n, O’, R, RS, r) n: Number of objects in C . O’: Clustroid of C . R : representatives of the Cluster C , (R C ). RS: The Rowsum of all the representatives. r: Radius of the Clusters C . COSC6341 Information Retreival Project Presentation

BUBBLE: CF at non-leaf node • A set of sample objects, S (NLi), randomly collected from the subtree NLi form its CFi . • CF at NL = S (NLi) • Each child node will have at least one representative in S(NL). • If CFi is leaf node then S (NLi) are randomly picked from the clustroids of CFi. COSC6341 Information Retreival Project Presentation

BUBBLE: Incremental Maintenance of CF at leaf • Types of Insertions • Type I: Insertion of a single object • Type II: Insertion of a cluster of objects. COSC6341 Information Retreival Project Presentation

Type I Insertion Inserting an object into the leaf • If |C| is small, maintain all the cluster objects and calculate the new clustroid. • If |C | is large, maintain a subset of C of size R that are close to the clustroid. COSC6341 Information Retreival Project Presentation

Type II Insertion Inserting a cluster of objects:- • C1 and C2 must be non-overlapping but close clusters. • The location of the new clustroid is between the two old clustroids. • By maintaining few objects far away from the clustriods of C1 and C2 the new clustroid can be calculated. COSC6341 Information Retreival Project Presentation

Incremental Maintenance of CF at non leaf • The sample objects at a non-leaf entry are updated whenever its child node splits. • The distribution of clusters changes significantly whenever a node splits. • To reflect changes in the distribution at all children nodes, update the sample objects at all entries of NL. COSC6341 Information Retreival Project Presentation

Drawbacks of BUBBLE • BUBBLE computes distance between sample objects which could be expensive. • E.g. edit distance on string COSC6341 Information Retreival Project Presentation

BUBBLE-FM • Transforms the distance space into an approximate vector space. • Maintains all the sample objects at each non leaf node in vector space. • For a new object O, transforms O to Vector space and uses Euclidean distance metric. • Doesn’t use transformation at leaf node. COSC6341 Information Retreival Project Presentation

Scalability COSC6341 Information Retreival Project Presentation

Conclusion • Presented the BIRCH* framework for scalable incremental pre clustering algorithms. • BUBBLE for datasets in arbitrary metric space • Fast map to reduce the number of calls to an expensive distance function. COSC6341 Information Retreival Project Presentation

References Primary Source: • Clustering Large Datasets in Arbitrary Metric Spaces (1999) Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrke Computer Sciences. Secondary Sources: • BIRCH: An Efficient Data Clustering Method for Very Large Databases (1996) Tian Zhang, Raghu Ramakrishnan, Miron Livny • CURE: An Efficient Clustering Algorithm for Large Databases (1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim. COSC6341 Information Retreival Project Presentation

Clustering Large Datasets in Arbitrary Metric Space

Clustering Large Datasets in Arbitrary Metric Space

Presentation Transcript

Challenges in survival analysis with large datasets

Color Compatibility From Large Datasets

Analysis with Extremely Large Datasets

Identifying functional subnetworks in large-scale datasets

Adding GO for Large Datasets

Challenges in Mining Large Image Datasets

Probabilistic Tracking in a Metric Space

Algorithmic Analysis of Large Datasets

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Algorithms for clustering large datasets in arbitrary metric spaces

Clustering Sequences in a Metric Space

Clustering Very Large Multi-dimensional Datasets with MapReduce

Best Practices in Loading Large Datasets

GGS Lecture: Knowledge discovery in large datasets

Analysis with Extremely Large Datasets

Analyzing Large Datasets in Astrophysics

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Challenges in survival analysis with large datasets

Compactness in Metric space

Space-time datasets

Spooky Stuff in Metric Space