150 likes | 258 Views
CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams. Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen. Introduction. Partitioning
E N D
CDS-Tree: An Effective Index for Clustering Arbitrary Shapesin Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor:Jia-Ling Koh Speaker:Tsui-Feng Yen
Introduction • Partitioning -k-means and k-medians algorithms don’t emphasize on finding arbitrary shapes in data streams • Density-based -DBSCAN can find arbitrary shapes in data streams, but need to scan database more than one time • Cell-based (Grid-based) -CLIQUE has three problems -high complexity -high memory -accuracy is not good with limited memory for changing data streams
Problem Definition • Domain:A={A1,A2,…,Ak} • S= A1xA2x ... xAk be a k-dimensional numerical space. A1, A2,…,Ak as the dimensions (attributes) of S • A k-dimension data stream X={x1, x2, …, xn} is a set of ordered objects at t time point, where xi=<xi1, xi2,…, xik>, and xij, the jth component of xi, is drawn from domain Aj.
Definition • Sliding window model on data stream X -B1 is the most recent bucket , and Bu is the oldest -The window slides by creating a new bucket and discarding a oldest one
Definition cont. • Partition P of data stream X -P be a set of non-overlapping rectangular cells, which is obtained by partitioning every dimension of X into equal length -Each cell C is the intersection of one interval from each dimension. It is represented as the form {c1,c2,…,ck} -A cell can also be denoted as (cNO1, cNO2, …, cNOk)named the coordinate of the cell, where cNOi is the interval number of the cell on i-th dimension
Definition cont. • Selectivity pc of cell C -The number of points that belong to Cdefines the selectivity pc of cell C • Clustering based on cells data stream X in a sliding window -If the selectivity of a cell is larger than a threshold τ, we call the cell dense -A cluster is the largest set of cells that are adjacent and dense -Two cells C1 and C2 are connective when they are neighboring, or there exists a cell C3, C1 and C3 are neighboring, C2 and C3 are neighboring
CDS-Tree data stream coming:(2,3),(5,4),(6,5) root-node mid leaf total-num-list
Related Algorithms of CDS-Tree • CDS-Tree building algorithm
Related Algorithms of CDS-Tree • Clustering algorithm based on CDS-Tree.
Granularity Adjustment • -thefiner the partition is, the higher the accuracy is, but the more number of the cells is created -if the current cost memory Mp is far less thanMmax, we can execute finer granularity partition for higher accuracy. -if the current memory cost Mp is close toMmax, we should use coarser partition to avoid memory overflow.
Granularity Adjustment cont. • Safety factor (in case of exhausting memory) -λ:is used to avoid the memory required exceeding the limited memory Mmax when the granularity turns finer, here we set it larger than 1. -η:we set it to decide the time point to adjust the granularity, where ηis less than 1. For example, is set 0.1, which represents when left memory is less than 10% of Mmax, the algorithm will turn granularity coarse to save more memory.
Experimental Results • OS: Microsoft Windows 2000 • CPU: 2.5GHz • RAM: 512MB • Two databases: -KDD-CUP-99 Network Intrusion Detection stream dataset -Image Fourier Coefficient dataset