130 likes | 401 Views
HE-Tree: a framework for detecting changes in clustering structure for categorical data streams. Keke Chen · Ling Liu VLDB, Vol.18, 2009, pp. 1241–1260 Presenter : Wei- Shen Tai 20 10 / 8/4. Outline . Introduction Entropy-based categorical clustering
E N D
HE-Tree: a framework for detecting changes in clustering structure for categorical data streams Keke Chen · Ling Liu VLDB, Vol.18, 2009, pp. 1241–1260 Presenter : Wei-Shen Tai 2010/8/4
Outline • Introduction • Entropy-based categorical clustering • BKPlot for determining the “Best K” for categorical clustering • HE-Tree: capturing cluster entropy of the categorical data stream • A monitoring framework based on the HE-Tree • Experiments • Conclusion • Comments
Motivation • Problems of clustering categorical data streams • None addressed the problems of monitoring the change of clustering structure in categorical data streams. • Most methods often assume a fixed number of clusters in the data stream.
Objective • Hierarchical Entropy Tree structure (HE-Tree) • It captures the entropy characteristics of clusters in a data stream, and detects the change of Best K.
Entropy-based categorical clustering • Classical entropy definition • Optimal partition, • Minimizing the weighted entropy of cluster Ck • Incremental entropy(IE) • After merging two clusters in a partition, the expected entropy should not be reduced. • Minimizing the expected entropy criterion in clustering
BKPlot for determining the “Best K” for categorical clustering • BKPlot method • Determines the candidate best K for static datasets. • Investigates the entropy difference between any two optimal neighboring partitions. • Second-order difference • ACE (entropy-based agglomerative hierarchical clustering) • Generates such high-quality approximate BKPlots.
ACE • IE (incremental entropy) • It is a natural inter-cluster similarity measure, ready for constructing a hierarchical clustering algorithm. • summary table • for conveniently counting occurrences of values • M-table • for bookkeeping M(Cp, Cq ) of any pair of clusters Cp and Cq. • M-heap • for maintaining the minimum M value in each step.
HE-Tree: capturing cluster entropy of the categorical data stream • Find the most similar sub-tree to sample e • Growing stage • If M(e, ei) = 0 then e is merged to entry ei • Else • If leaf-node has empty entrythen e is assigned to an empty one • Else spilt leaf-node • Absorbing stage • e is merged to entry eiwith min M(e, ei)
A monitoring framework based on the HE-Tree • Time-decaying HE-Tree • Let the decaying rate λ, 0 < λ < 1, represent the proportionof the information that is preserved from the last window. (record number, summary table and M-table) • Extended ACE • It takes sub-clusters as input andconsecutively merges the pairof clusters.
Conclusion • HE-Tree • Detects the change of clustering structure in categorical data streams. • A time-decaying HE-tree makes the framework more sensitive to recently emerging clustering structures.
Comments • Advantage • This proposed scheme provides a solution for detecting changes of categorical data streams. • This entropy-based HE-tree and its decaying ideas can be accepted intuitively . • Drawback • Due to summary table cannot handle mixed-type data in the same time, This proposed method only was applied to categorical data streams. • Is the decaying processes still necessary once the fixed-interval window is changed to a moving window? • Application • Categorical data stream clustering