240 likes | 395 Views
Efficiently Clustering Transactional data with Weighted Coverage Density. M. Hua Yan , Keke Chen, and Ling Liu Proceedings of the 15 th International Conference on Information and Knowledge Management, ACM CIKM, 2006. 報告人 : 吳建良. Outline. Motivation SCALE Framework BKPlot Method
E N D
Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan , Keke Chen, and Ling Liu Proceedings of the 15th International Conference on Information and Knowledge Management, ACM CIKM, 2006 報告人:吳建良
Outline • Motivation • SCALE Framework • BKPlot Method • WCD Clustering Algorithm • Cluster Validity Evaluation • Experimental Results
Motivation • Transactional data is a kind of special categorical data • t1={milk, bread, beer}, t2={milk, bread} • Can be transformed to row by column table with Boolean value • Large volume and high dimensionality make the existing algorithms inefficient to process the transformed data • Clustering transactional data algorithm: LargeItem, CLPOE, CCCD • Require users to manually tune at least one or two parameters • Setting these parameters are different from dataset to dataset
SCALE Framework • ACE & BkPlot (SSDBM’05) • ACE: Agglomerative Categorical clustering with Entropy criterion • BkPlot: • Examine the entropy difference between the clustering structures with varying K • Reports the Ks where the clustering stricture changes dramatically • Evaluation Metrics • LISR: Large Item Size Ratio • AMI: Average pair-clusters Merging Index
ACE Algorithm • Bottom-up process • Initially, each record is a cluster • Iteratively, find the most similar pair of clusters Cp and Cq, and then merge them • Incremental entropy • The most similar pair of clusters • is minimum among all possible pairs • denote the Im value in forming the K-cluster partition from the K+1-cluster partition
BkPlot • Increasing rate of entropy: • N: total records, d: columns • Small increasing rate • Merging does not introduce any impurity to the clusters • Clustering structure is not significantly changed • Large increasing rate • Introduce considerable impurity into the partitions • Clustering structure can be changed significantly
BkPlot (contd.) • Relative changes • Use relative changes to determine if a globally significant clustering structure emerges I(K)≈I(K+1), but I(K-1)>I(K)
BkPlot (contd.) Second-order differential of ECG: Entropy Characteristic Graph (ECG)
WCD Clustering Algorithm • Notations • D: transactional dataset • N: size of dataset • I={I1, I2,…, Im}: a set of items • tj={Ij1, Ij2,…, Ijl}: a transaction • A transaction clustering result CK={C1, C2,…,CK} is a partition of D, where
Intra-cluster Similarity Measure • Coverage Density (CD) • Given a cluster Ck • Mk: Number of distinct items • : Items set of Ck • Nk : Number of transaction in Ck • Sk: Sum occurrences of all items in Ck CD↑, compactness ↑
Intra-cluster Similarity Measure (contd.) • Drawback of CD • Insufficient to measure the density of frequent itemset • Each item has equal contribution in a cluster • Two clusters may have the same CD but different filled-cell distribution a b c a b c
Intra-cluster Similarity Measure (contd.) • Weighted Coverage Density (WCD) • Focus on high-frequency items • Define Wj as CD WCD a b c a b c
Clustering Criterion • Expected Weighted Coverage Density (EWCD) • Clustering algorithm try to maximize the EWCD • When every individual transaction is considered as a cluster, it will get the maximum EWCD=1 • Use BKPlot method to generate a set of candidate “best Ks”
WCD Clustering Algorithm Input: Dataset D, Number of clusters K, Initial K seeds Output: K clusters /* Phase 1 – Initialization*/ K seeds form the initial K clusters; while not end of D do read one transaction t from D; add t into Ci that maximizes EWCD; write <t, i> back to D; /* Phase 2 – Iteration*/ while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read <t, i>; if moving t to cluster Cj increases EWCD and i ≠ j moveMark = true; write <t, j> back to D;
Cluster Validity Evaluation • LISR (Large Item Size Ratio) • Measure the preservation of frequent itemsets • , where LSk is #Large Items in Ck • high concurrences of items high possibility of finding more frequent itemsets at user-specified minimum support
Cluster Validity Evaluation (contd.) • Inter-cluster dissimilarity between Ci and Cj simplify , where Mij is the number of distinct items after merging two cluster thus Mij ≧max{Mi, Mj} Because of and , d(Ci, Cj) is a real number between 0 and 1
Cluster Validity Evaluation (contd.) Ci Cj • Example • If Mi=Mj=Mij, then d(Ci,Cj)=0 • Mi=Mj=3, Mij=5 a b c a b c Ci Cj a b c c d e
Cluster Validity Evaluation (contd.) • AMI (Average pair-clusters Merging Index) • Evaluate the overall inter-dissimilarity of a clustering result having K clusters • better the clustering quality
Experiments • Dataset • Tc30a6r1000 • 1000 records, 30 column, 6 possible attribute values • Zoo • 101 records, 18 attributes • Mushroom • 8124 instances, 22 attributes • Mushroom100k • Sample the mushroom data with duplicates • 100,000 instances • TxI4Dx • IBM Data Generator
Experimental Results • Tc30a6r1000 The repulsion parameter r of CLOPE is controlling the number of clusters 5 clusters 9 clusters
Experimental Results (contd.) • Zoo: K=7 is the best 2 clusters 4 clusters 7 clusters
Experimental Results (contd.) • Mushroom: K=19 is the best
Experimental Results (contd.) • Performance evaluation on mushroom100k r=0.5~4.0 r=2.0
Experimental Results (contd.) • Performance evaluation on TxI4Dx T10I4Dx TxI4D100k