BIRCH

BIRCH An Efficient Data Clustering Method for Very Large Databases SIGMOD 96

Introduction • Balanced Iterative Reducing and Clustering using Hierarchies • For multi-dimensional dataset • Minimized I/O cost (linear : 1 or 2 scan) • Full utilization of memory • Hierarchies  indexing method

Terminology • Property of a cluster • Given N d-dimensional data points • Centroid • Radius • Diameter

Terminology Distance between 2 clusters D0 – Euclidian distance between centroids D1 – Manhattan distance between centroids D2 – average inter-cluster distance D3 – average intra-cluster distance D4 – variance increase distance

Clustering Feature • To calculate centroid, radius, diameter, D0, D1, D2, D3 and D4, not all points are needed • 3 values are stored to represent the cluster (CF) • N – number of points in a cluster • LS – linear sum of points in a cluster • SS – square sum of points in a cluster • CF are additive

CF Tree • Similar to B+-tree, R-tree • Parameter • B – branching factor • T – threshold • Leaf node – contains at most L CF entries, each CF should follows D<T or R<T • Non-leaf node – contains at most B CF entries of its child • Each node should fit into 1 page

BIRCH • Phase 1: Scan dataset once, build a CF tree in memory • Phase 2: (Optional) Condense the CF tree to a smaller CF tree • Phase 3: Global Clustering • Phase 4: (Optional) Clustering Refining (require scan of dataset)

BIRCH

Building CF Tree (Phase 1) • CF of a data point (3,4) is (1,(3,4),25) • Insert a point to the tree • Find the path (based on D0, D1, D2, D3, D4 between CF of children in a non-leaf node) • Modify the leaf • Find closest leaf node entry (based on D0, D1, D2, D3, D4 of CF in leaf node) • Check if it can “absorb” the new data point • Modify the path to the leaf • Splitting – if leaf node is full, split into two leaf node, add one more entry in parent

Building CF Tree (Phase 1) Sum of CF(N,LS,SS) of all children Non-leave node Leave node CF(N,LS,SS) – under condition D<T or R<T

Condensing CF Tree (Phase 2) • Chose a larger T (threshold) • Consider entries in leaf nodes • Reinsert CF entries in the new tree • If new “path” is “before” original “path”, move it to new “path” • If new “path” is the same as original “path”, leave it unchanged

Global Clustering (Phase 3) • Consider CF entries in leaf nodes only • Use centroid as the representative of a cluster • Perform traditional clustering (e.g. agglomerative hierarchy (complete link == D2) or K-mean or CL…) • Cluster CF instead of data points

Cluster Refining (Phase 4) • Require scan of dataset one more time • Use clusters found in phase 3 as seeds • Redistribute data points to their closest seeds and form new clusters • Removal of outliers • Acquisition of membership information

Performance

Visualization

Conclusion • A clustering algorithm taking consideration of I/O costs, memory limitation • Utilize local information (each clustering decision is made without scanning all data points) • Not every data point is equally important for clustering purpose

BIRCH

BIRCH

Presentation Transcript

11.11 The Birch Reduction

Birch Reduction

Red birch energy . com

Birch Industry in Newfoundland Labrador

BIRCH PAPER COMPANY

Rebecca Birch

Birch Trees

River Birch

Yellow Birch

River Birch – Betula nigra

The BIRCH Algorithm

Birch Timothy Cat

River Birch Villas

Water Birch

The Birch Tree

Nigel J Birch

Birch Logs For Sale

Birch Water Market

White Birch Vodka Porterslux.com.au

Birch Water Market

Birch Gold Birch Gold Review netboxgold