260 likes | 266 Views
This research paper presents a clustering algorithm for analyzing transactional data efficiently using weighted coverage density. It aims to achieve fast, memory-efficient, and scalable clustering without the need for manual parameter settings. The authors also introduce the SCALE framework for sampling, clustering structure assessment, clustering, and domain-specific evaluation. Experimental results demonstrate the effectiveness of the proposed approach in generating high-quality clustering results for a wide range of transactional datasets.
E N D
Efficiently clustering transactional data with weighted coverage density Advisor : Dr. Hsu Presenter : Hsin-Yi Huang Authors : Hua Yan, Keke Chem, Ling Liu 2006.CIKM.10
Outline • Motivation • Objective • WCD (Weighted Coverage Density) • LISR (Large Item Size Ratio) • AMI (Average pair-clusters Merging Index) • SCALE • Experiment • Conclusion • Comments 2007/7/31
Motivation • The two key features of such transformed datasets: • Large volume • high dimensionality • Manually tuning at least one or two parameters • The lack of cluster validation methods 2007/7/31
Objective • The Clustering algorithm for analyzing transactional data • Fast • Memory-efficient • Scalable • Without manual parameter settings • The domain-specific cluster validity measure • SCALE (Sampling, Clustering structure Assessment, cLustering and domain-specific) 2007/7/31
CD (Coverage Density) • The percentage of filled cells to the whole retangle area Ex1. a simple transactional dataset { abcd, bcd, ac, de, def} • Vertical axis: transaction IDs • Horizontal axis: items 2007/7/31
CD (Coverage Density) • Both transactional and item contributions are uniform 2007/7/31
WCD (Weighted Coverage Density) 2007/7/31
Comparing CD and WCD • WCD will prefer the cluster having the higher Var(X) • Maximizing Var(x) is remotely related to minimizing the entropy criterion 2007/7/31
EWCD(Expected Weighted Coverage Density) • WCD-based clustering criterion 2007/7/31
WCD Clustering Algorithm(1) 2007/7/31
WCD Clustering Algorithm(2) 2007/7/31
WCD Complexity Analysis • Space complexity • Time complexity 2007/7/31
LISR (Large Item Size Ratio) • Large Item • Measuring the preservation of frequent itemsets 2007/7/31
AMI (Average pair-clusters Merging Index) • Measuring inter-dissimilarity of clusters Simplifying the above formula, we get 2007/7/31
SCALE • (Sampling, Clustering structure Assessment, cLustering and domain-specific) • Eliminating parameter setting/tuning: • WCD • A few candidate best Ks or the best K • AMI • LISR • Four steps: • Sampling • Clustering structure assessment • Clustering • Domain-Specific Evaluation 2007/7/31
SCALE • Hierarchical algorithm • BKPlots • Hierarchical tree • Cluster seeds 2007/7/31
Experimental • CLOPE the ratio of the average item occurrences to the number of distinct items. 2007/7/31
Experiment • Datasets • Tc30a6r1000_2L • TxI4Dx Series • Zoo Real dataset • Mushroom Real dataset 2007/7/31
Experiment • Results on Tc30a6r1000_2L 2007/7/31
Experiment 2007/7/31
Experiment • Results on Zoo and Mushroom 2007/7/31
Experiment 2007/7/31
Experiment • Performance Evaluation on Mushroom100k 2007/7/31
Experiment • Performance Evaluation on TxI4Dx Series Time complexity: 2007/7/31
Conclusion • The WCD approach can generate high quality clustering results in a fully automated manner with much higher efficiency for wider collections of transactional datasets. 2007/7/31
Comments • Advantage • … • Drawback • … • Application 2007/7/31