CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams

CDS-Tree: An Effective Index for Clustering Arbitrary Shapesin Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor：Jia-Ling Koh Speaker：Tsui-Feng Yen

Introduction • Partitioning -k-means and k-medians algorithms don’t emphasize on finding arbitrary shapes in data streams • Density-based -DBSCAN can find arbitrary shapes in data streams, but need to scan database more than one time • Cell-based (Grid-based) -CLIQUE has three problems -high complexity -high memory -accuracy is not good with limited memory for changing data streams

Problem Definition • Domain：A={A1,A2,…,Ak} • S= A1xA2x ... 􀁵 xAk be a k-dimensional numerical space. A1, A2,…,Ak as the dimensions (attributes) of S • A k-dimension data stream X={x1, x2, …, xn} is a set of ordered objects at t time point, where xi=<xi1, xi2,…, xik>, and xij, the jth component of xi, is drawn from domain Aj.

Definition • Sliding window model on data stream X -B1 is the most recent bucket , and Bu is the oldest -The window slides by creating a new bucket and discarding a oldest one

Definition cont. • Partition P of data stream X -P be a set of non-overlapping rectangular cells, which is obtained by partitioning every dimension of X into equal length -Each cell C is the intersection of one interval from each dimension. It is represented as the form {c1,c2,…,ck} -A cell can also be denoted as (cNO1, cNO2, …, cNOk)named the coordinate of the cell, where cNOi is the interval number of the cell on i-th dimension

Definition cont. • Selectivity pc of cell C -The number of points that belong to Cdefines the selectivity pc of cell C • Clustering based on cells data stream X in a sliding window -If the selectivity of a cell is larger than a threshold τ, we call the cell dense -A cluster is the largest set of cells that are adjacent and dense -Two cells C1 and C2 are connective when they are neighboring, or there exists a cell C3, C1 and C3 are neighboring, C2 and C3 are neighboring

CDS-Tree data stream coming：(2,3),(5,4),(6,5) root-node mid leaf total-num-list

Related Algorithms of CDS-Tree • CDS-Tree building algorithm

Related Algorithms of CDS-Tree • Clustering algorithm based on CDS-Tree.

Granularity Adjustment • -thefiner the partition is, the higher the accuracy is, but the more number of the cells is created -if the current cost memory Mp is far less thanMmax, we can execute finer granularity partition for higher accuracy. -if the current memory cost Mp is close toMmax, we should use coarser partition to avoid memory overflow.

Granularity Adjustment cont. • Safety factor (in case of exhausting memory) -λ：is used to avoid the memory required exceeding the limited memory Mmax when the granularity turns finer, here we set it larger than 1. -η：we set it to decide the time point to adjust the granularity, where ηis less than 1. For example, 􀈘 is set 0.1, which represents when left memory is less than 10% of Mmax, the algorithm will turn granularity coarse to save more memory.

Granularity Adjustment Algorithm

Experimental Results • OS: Microsoft Windows 2000 • CPU: 2.5GHz • RAM: 512MB • Two databases： -KDD-CUP-99 Network Intrusion Detection stream dataset -Image Fourier Coefficient dataset

Experimental Results

CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams