Efficiently clustering transactional data with weighted coverage density

Efficiently clustering transactional data with weighted coverage density Advisor : Dr. Hsu Presenter : Hsin-Yi Huang Authors : Hua Yan, Keke Chem, Ling Liu 2006.CIKM.10

Outline • Motivation • Objective • WCD (Weighted Coverage Density) • LISR (Large Item Size Ratio) • AMI (Average pair-clusters Merging Index) • SCALE • Experiment • Conclusion • Comments 2007/7/31

Motivation • The two key features of such transformed datasets: • Large volume • high dimensionality • Manually tuning at least one or two parameters • The lack of cluster validation methods 2007/7/31

Objective • The Clustering algorithm for analyzing transactional data • Fast • Memory-efficient • Scalable • Without manual parameter settings • The domain-specific cluster validity measure • SCALE (Sampling, Clustering structure Assessment, cLustering and domain-specific) 2007/7/31

CD (Coverage Density) • The percentage of filled cells to the whole retangle area Ex1. a simple transactional dataset { abcd, bcd, ac, de, def} • Vertical axis: transaction IDs • Horizontal axis: items 2007/7/31

CD (Coverage Density) • Both transactional and item contributions are uniform 2007/7/31

WCD (Weighted Coverage Density) 2007/7/31

Comparing CD and WCD • WCD will prefer the cluster having the higher Var(X) • Maximizing Var(x) is remotely related to minimizing the entropy criterion 2007/7/31

EWCD(Expected Weighted Coverage Density) • WCD-based clustering criterion 2007/7/31

WCD Clustering Algorithm(1) 2007/7/31

WCD Clustering Algorithm(2) 2007/7/31

WCD Complexity Analysis • Space complexity • Time complexity 2007/7/31

LISR (Large Item Size Ratio) • Large Item • Measuring the preservation of frequent itemsets 2007/7/31

AMI (Average pair-clusters Merging Index) • Measuring inter-dissimilarity of clusters Simplifying the above formula, we get 2007/7/31

SCALE • (Sampling, Clustering structure Assessment, cLustering and domain-specific) • Eliminating parameter setting/tuning: • WCD • A few candidate best Ks or the best K • AMI • LISR • Four steps: • Sampling • Clustering structure assessment • Clustering • Domain-Specific Evaluation 2007/7/31

SCALE • Hierarchical algorithm • BKPlots • Hierarchical tree • Cluster seeds 2007/7/31

Experimental • CLOPE the ratio of the average item occurrences to the number of distinct items. 2007/7/31

Experiment • Datasets • Tc30a6r1000_2L • TxI4Dx Series • Zoo Real dataset • Mushroom Real dataset 2007/7/31

Experiment • Results on Tc30a6r1000_2L 2007/7/31

Experiment 2007/7/31

Experiment • Results on Zoo and Mushroom 2007/7/31

Experiment 2007/7/31

Experiment • Performance Evaluation on Mushroom100k 2007/7/31

Experiment • Performance Evaluation on TxI4Dx Series Time complexity： 2007/7/31

Conclusion • The WCD approach can generate high quality clustering results in a fully automated manner with much higher efficiency for wider collections of transactional datasets. 2007/7/31

Comments • Advantage • … • Drawback • … • Application 2007/7/31

Efficiently clustering transactional data with weighted coverage density

Efficiently clustering transactional data with weighted coverage density

Presentation Transcript

Using Weighted Data

Topic9: Density-based Clustering

Efficiently recover data with Wondershare Data Recovery

Data reduction for weighted and outlier-resistant clustering

Density-Based Clustering of Uncertain Data (KDD2005)

local-density based spatial clustering algorithm with noise

Density based Clustering

Weighted Clustering

Determining the best K for clustering transactional datasets – A coverage density-based approach

Scalable Data Clustering with GPUs

Kernel-based Weighted Multi-view Clustering

Efficiently Clustering Transactional data with Weighted Coverage Density

SCALE: a scalable framework for efficiently clustering transactional data

Two Density-based Clustering Algorithms

Weighted kNN , clustering, more plottong , Bayes

Weighted Chinese Restaurant Process for clustering barcodes

Data Clustering

CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data

DATA MINING WITH CLUSTERING AND CLASSIFICATION

Data Clustering

Aggregation Pheromone Density Based Clustering

Parallel Density-based Hybrid Clustering