BIRCH: An Efficient Data Clustering Method for Very Large Databases

BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented by Zhirong Tao

Outline of the Paper • Background • Clustering Feature and CF Tree • The BIRCH Clustering Algorithm • Performance Studies

Background • A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. • The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

Background (Contd) • Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N, the centroid X0, radius R and diameter D of the cluster are defined as:

Background (Contd) Given the centroids of two clusters: X01 and X02, • The centroid Euclidean distance D0: • The centroid Manhattan distance D1:

BIRCH: Hierarchical Method • A distance-based approach: • Assume there is a distance measurement between any two instances . • Represent clusters by some kind of ‘center’ measure. • A hierarchical clustering • a sequence of partitions in which each partition is nested into the next partition in the sequence.

Clustering Feature Definition Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N, CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points.

CF Additive Theorem • Assume that CF1 = (N1, LS1, SS1), and CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint subclusters. • The CF entry of the subcluster formed by merging the two disjoint subclusters is: CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2) • The CF entries can be stored and calculated incrementally and consistently as subclusters are merged or new data points are inserted.

CF-Tree • A CF-tree is a height-balanced tree with two parameters: branching factor (B for nonleaf node and L for leaf node) and threshold T. • The entry in each nonleaf node has the form [CFi, childi] • The entry in each leaf node is a CF; each leaf node has two pointers: `prev' and`next'. • Threshold value T: the diameter (alternatively, the radius) of each leaf entry has to be less than T.

BIRCH Algorithm Overview

Phase 1

Insertion Algorithm • Identifying the appropriate leaf • Modifying the leaf: assume the closest leaf entry, say Li, • Li can àbsorb' Ènt' • Add a new entry for Ènt' to the leaf • Split the leaf node • Modifying the path to the leaf: • The parent has space for this entry • Split the parent, and so on up to the root

Phase 3: Global Clustering • Use an existing global or semi-global algorithm to cluster all the leaf entries across the boundaries of different nodes. • This way we can overcome Anomaly 1: • Anomaly 1: Depending upon the order of data input and the degree of skew, it is also possible that two subclusters that should not be in one cluster are kept in the same node.

Comparison of BIRCH and CLARANS With synthetic generated dataset:

Summary • Compared with previous distance-based approached (e.g, K-Means and CLARANS), BIRCH is appropriate for very large datasets. • BIRCH can work with any given amount of memory, and the I/O complexity is a little more than one scan of data.

BIRCH: An Efficient Data Clustering Method for Very Large Databases

BIRCH: An Efficient Data Clustering Method for Very Large Databases

Presentation Transcript

Structured Data Extraction From Web Based on Partial Tree Alignment

Coolants

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Four Square Writing Method for Grades 1-3 Four Square Writing Method for Grades 4-6 Four Square Writing Method for Gra

Chapter 2 Data Mining

Chapter 4: Unsupervised Learning

Lecture 10: Parallel Databases

Large-Scale Copy Detection

Density-Based Clustering of Uncertain Data (KDD2005)

Advanced Methods and Analysis for the Learning and Social Sciences

Distributed Databases

Clustering and Pathway Analysis

CS590D: Data Mining Prof. Chris Clifton

Managing external data Part 1 Design of Databases

Sequence Searching Strategies

Mining Complex Types of Data

Purdue-Tivoli Partnership: Exploiting Purdue’s Technological Prowess

Temple University – CIS Dept. CIS616– Principles of Data Management

Lecture 09: Parallel Databases , Big Data, Map/Reduce, Pig-Latin