200 likes | 341 Views
Frequent Item Based Clustering. M.Sc Student: Homayoun Afshar Supervisor: Martin Ester. Contents. Introduction and motivation Frequent item sets Text data as transactional data Cluster set definition Our approach Test data set, results, challenges Related works Conclusion.
E N D
Frequent Item Based Clustering M.Sc Student: Homayoun Afshar Supervisor: Martin Ester
Contents • Introduction and motivation • Frequent item sets • Text data as transactional data • Cluster set definition • Our approach • Test data set, results, challenges • Related works • Conclusion Frequent Item Based Clustering
Introduction and Motivation • Huge amount of information online • Lots of this information is in text format • E.G. Emails, web pages, news group postings, … • Need to group related documents • Nontrivial task Frequent Item Based Clustering
Frequent Item Sets • Given a dataset D={t1,t2,…,tn} • Each ti is a transaction • tiI where I is the set of all items • Given a threshold min_sup • iI such that • |{t it and tD}|>min_sup • i is a frequent item set with respect to minimum support min_sup Frequent Item Based Clustering
Text Data As Transactional Data • Assume each word as an item • And each document as a transaction • Using a minimum support find frequent item sets (frequent word sets) • Frequent Word SetsFrequent Item Sets Frequent Item Based Clustering
Cluster Set Definition • f={X1,X2,…,Xn} is the set of all the frequent item sets with respect to some minimum support • c={C1,C2,…,Cm} is a cluster set, where Ci is the documents that are covered with some Xkf • And… Frequent Item Based Clustering
Cluster Set Definition … • Each optimal cluster set has to: • Cover the whole data set • Mutual overlap between clusters in cluster set must be minimized • Clusters should be roughly the same size Frequent Item Based Clustering
Our Approach:Frequent-Item Based Clustering … • Find all the frequent word sets • Form cluster sets with just one cluster • Overlap is zero • Coverage is the support of the frequent item set presenting the cluster • Form cluster sets with two clusters • Find the overlap and coverage Frequent Item Based Clustering
Our Approach:Frequent-Item Based Clustering … • Prune the candidate list for cluster sets • If Cov(ci)Cov(cj) and • overlap(ci)>overlap(cj) • ci and cj are candidates in same level • remove if Overlap(ci)>= |Cov(ci)| • Generate the next level • Find Overlap and Coverage, Prune • Stop when there are no more candidates left Frequent Item Based Clustering
Our Approach:Coverage And Overlap … • Using a bit matrix • Each column is a document • Each row is a frequent word set • Coverage: OR, counting the 1s • Overlap: XOR, OR, AND, counting 1s Frequent Item Based Clustering
Our Approach:Coverage And Overlap … 10110010 (1st) 10001010 (2nd) 10101100 (3rd) ------------ Coverage: OR all = 10111110 count 1s -> coverage = 6 cost = 2 ORs + counting 1s cost for counting 1s = 8 (shifts, ANDs, Adds) Frequent Item Based Clustering
Our Approach:Coverage And Overlap … Overlap: 10110010 (1st) 10001010 (2nd) ------------ AND first two = 10000010 (i) XOR first two = 00111000 (ii) 10101100 (3rd) ------------ AND 3rd with (ii) 00101000 (iii) ------------ OR (i) and (iii) 10101010 now count 1s for overlap -> Overlap = 4 Frequent Item Based Clustering
Test Data, Results, Challenges • Test data set • Reuters 21578 • 21578 documents Reuters news • 8655 of them have exactly one topic • Remove stop words • Stem all the words • Number of frequent word sets • 5% min_sup = 10678 • 10% min_sup=1217 • 20% min_sup=78 Frequent Item Based Clustering
Test Data, Results, Challenges • With 20% min support • sample 2-cluster candidate set • {(said,reuter)(line,ct,vs)} • Overlap = 1 • Coverage = 5259 • sample 5-cluster candidate set • {(reuter)(vs)(net)(line,ct,net)(vs,net,shr)} • Overlap = 3303 • Coverage = 8609 Frequent Item Based Clustering
Test Data, Results, Challenges • More Results • With min_sup=10% • {(reuter)(includ)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} • 6-clusters cluster set • Coverage = 8616 • Overlap = 2553 • {(reuter)(loss)(profit)(year,1986)(mln,profit)(year,ct)(year,mln,net)} • 7-clusters cluster set • Coverage = 8611 • Overlap = 2705 • {(reuter)(loss)(profit)(year,1986)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} • 8-clusters cluster set • Coverage = 8616 • Overlap = 3033 Frequent Item Based Clustering
Test Data, Results, Challenges • Lower support values • Pruning is very slow • 2-cluster set with minSup=20% • Creating= 0.010 seconds. • Updating= 1.853 seconds. (Overlap and Coverage) • Pruning= 11.767 seconds. • Sorting= 0.000 seconds. • Number of candidates • Before prune=3003 • After prune=73 Frequent Item Based Clustering
Test Data, Results, Challenges • Hierarchical clustering • Clustering quality • In our test data set, entropy • Real data sets, classes are not known • Test the pruning more efficiently • Defining an upper threshold • Using following ratios to prune candidates • or • Using only max item sets Frequent Item Based Clustering
Related Works • Similar idea • Frequent Term-Based Text Clustering [BEX02] • Florian Beil, Martin Ester, Xiaowei Xu • Focuses on finding one optimal clustering set (non overlapping)-FTC • Hierarchical clustering (overlapping)-HFTC Frequent Item Based Clustering
Conclusion • To get optimal clustering • Reduce minimum support • Reduce number of frequent items • Introduce maximum support • Use only max item sets • Better pruning (speed) • Hierarchical clustering Frequent Item Based Clustering
References [AS94] R. Agrawal, R. Sirkant. Fast Algorithms for Mining Association rules in large databases. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487-499, Santiago, Chile, Sept. 1994. [BEX02] F. Beil, M. Ester,X. Xu. Frequent Term-Based Text clustering. J. Han, M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann, 2001. Frequent Item Based Clustering