1 / 26

Efficiently clustering transactional data with weighted coverage density

This research paper presents a clustering algorithm for analyzing transactional data efficiently using weighted coverage density. It aims to achieve fast, memory-efficient, and scalable clustering without the need for manual parameter settings. The authors also introduce the SCALE framework for sampling, clustering structure assessment, clustering, and domain-specific evaluation. Experimental results demonstrate the effectiveness of the proposed approach in generating high-quality clustering results for a wide range of transactional datasets.

Download Presentation

Efficiently clustering transactional data with weighted coverage density

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficiently clustering transactional data with weighted coverage density Advisor : Dr. Hsu Presenter : Hsin-Yi Huang Authors : Hua Yan, Keke Chem, Ling Liu 2006.CIKM.10

  2. Outline • Motivation • Objective • WCD (Weighted Coverage Density) • LISR (Large Item Size Ratio) • AMI (Average pair-clusters Merging Index) • SCALE • Experiment • Conclusion • Comments 2007/7/31

  3. Motivation • The two key features of such transformed datasets: • Large volume • high dimensionality • Manually tuning at least one or two parameters • The lack of cluster validation methods 2007/7/31

  4. Objective • The Clustering algorithm for analyzing transactional data • Fast • Memory-efficient • Scalable • Without manual parameter settings • The domain-specific cluster validity measure • SCALE (Sampling, Clustering structure Assessment, cLustering and domain-specific) 2007/7/31

  5. CD (Coverage Density) • The percentage of filled cells to the whole retangle area Ex1. a simple transactional dataset { abcd, bcd, ac, de, def} • Vertical axis: transaction IDs • Horizontal axis: items 2007/7/31

  6. CD (Coverage Density) • Both transactional and item contributions are uniform 2007/7/31

  7. WCD (Weighted Coverage Density) 2007/7/31

  8. Comparing CD and WCD • WCD will prefer the cluster having the higher Var(X) • Maximizing Var(x) is remotely related to minimizing the entropy criterion 2007/7/31

  9. EWCD(Expected Weighted Coverage Density) • WCD-based clustering criterion 2007/7/31

  10. WCD Clustering Algorithm(1) 2007/7/31

  11. WCD Clustering Algorithm(2) 2007/7/31

  12. WCD Complexity Analysis • Space complexity • Time complexity 2007/7/31

  13. LISR (Large Item Size Ratio) • Large Item • Measuring the preservation of frequent itemsets 2007/7/31

  14. AMI (Average pair-clusters Merging Index) • Measuring inter-dissimilarity of clusters Simplifying the above formula, we get 2007/7/31

  15. SCALE • (Sampling, Clustering structure Assessment, cLustering and domain-specific) • Eliminating parameter setting/tuning: • WCD • A few candidate best Ks or the best K • AMI • LISR • Four steps: • Sampling • Clustering structure assessment • Clustering • Domain-Specific Evaluation 2007/7/31

  16. SCALE • Hierarchical algorithm • BKPlots • Hierarchical tree • Cluster seeds 2007/7/31

  17. Experimental • CLOPE the ratio of the average item occurrences to the number of distinct items. 2007/7/31

  18. Experiment • Datasets • Tc30a6r1000_2L • TxI4Dx Series • Zoo Real dataset • Mushroom Real dataset 2007/7/31

  19. Experiment • Results on Tc30a6r1000_2L 2007/7/31

  20. Experiment 2007/7/31

  21. Experiment • Results on Zoo and Mushroom 2007/7/31

  22. Experiment 2007/7/31

  23. Experiment • Performance Evaluation on Mushroom100k 2007/7/31

  24. Experiment • Performance Evaluation on TxI4Dx Series Time complexity: 2007/7/31

  25. Conclusion • The WCD approach can generate high quality clustering results in a fully automated manner with much higher efficiency for wider collections of transactional datasets. 2007/7/31

  26. Comments • Advantage • … • Drawback • … • Application 2007/7/31

More Related