140 likes | 261 Views
Department of Information & Computer Education, NTNU. SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets. Qinghua Zou, Wesley W. Chu, and Baojing Lu,
E N D
Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets Qinghua Zou, Wesley W. Chu, and Baojing Lu, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02), 9-12 Dec. 2002, pp. 570 – 577. Advisor:Jia-Ling Koh Speaker:Chen-Yi Lin
Department of Information & Computer Education, NTNU Outline • Introduction • The strategy of SmartMiner • Experimental Results • Conclusions
Department of Information & Computer Education, NTNU Dataset id: item set 1: a b c d e 2: a b c d 3: b c d 4: b e 5: c d e MinSup=2 Introductions (1/5)The problem of mining frequent patterns What itemsets are frequent itemsets (FI)? a, b, c, d, e, ab, ac, ad, bc, bd, be, cd, ce, de, abc, abd, acd, bcd, cde, abcd Maximal frequent itemset(MFI): No superset is frequent. abcd, be, cde
Department of Information & Computer Education, NTNU Introductions (2/5)Current status and techniques – Why MFI not FI • Mining FI is infeasible when there exists long FI. • E.g, Suppose we have a 20-item frequent set a1 a2 …a20. All of its subset are frequent, i.e., 220=1,048,576 • Given a unknown large dataset, mining MFI is fast and gives us an overview of the characteristics of the dataset.
Department of Information & Computer Education, NTNU :abcde a:bcde b:cde c:de d:e e: ab:cde ac:de ad:e ae: bc:de bd:e be: cd:e ce: de: abc:de abd:e abe: acd:e ace: ade: bcd:e bce: bde: cde: abcd:e abce: abde: acde: bcde: abcde: Introductions (3/5) • Enumeration tree: • Each node has a head and a tail representing a state. • The head is a candidate while the tail contains items to form new heads. head tail An enumeration tree for abcde for the given order of a, b, c, d, e
Department of Information & Computer Education, NTNU MFI Dataset |D|=5 2 4 4 4 3 id: item set abcd: :a b c d e Superset chk 1: a b c d e 2: a b c d 3: b c d 4: b e 5: c d e eb: :a e b c d ecd: |Da|=2 1 2 2 2 |De|=3 2 2 2 a: e b c d c: d d: ec: d ed: e: b c d eb: c d b: c d e: b c d MinSup=2 |Deb|=2 1 1 |Dec|=2 2 eb: ecd: Introductions (4/5)Current status and techniques – Mafia: an example Answer abcd eb,ecd abcd: MFI abcd, be, cde
Department of Information & Computer Education, NTNU Introductions (5/5)Current status and techniques – the limitations • Constant superset checking. • A study shows that CPU spends 40% time for superset checking. • The size of the search tree is too large. • It can be reduced. • The number of support counting is large. • Counting support is expensive.
Department of Information & Computer Education, NTNU A1 A1 B1 B2 … Bn B1 B’ … Creating B2 before exploring B1 Creating B’ after exploring B1 … … The strategy of SmartMiner (1/2) Using information from B1 to prune the space at B’ (a) Previous approach (b) SmartMiner Strategy SmartMiner takes advantages of the information from previous steps.
Department of Information & Computer Education, NTNU S0 S0 Inf0 Inf0 S1 S1 Inf1 Inf1 Mfi Mfi |D|=5 :aebcd :ebcd :bcd Dataset id: item set 2 4 4 4 3 nil bcd,b,cd :a b c d e a:ebcd e:bcd nil 1: a b c d e 2: a b c d 3: b c d 4: b e 5: c d e nil nil bcd b,cd |Da|=2 |De|=3 :bcd :cd :d nil nil d 1 2 2 2 2 2 2 b:cd c:d nil :e b c d :b c d nil nil |Dec|=2 MinSup=2 :b c d [] d 2 MFI abcd, be, cde |Deb|=2 d: 1 1 :c d :d : The strategy of SmartMiner (2/2) Answer abcd eb,ecd bcd :a e b c d bcd:
Department of Information & Computer Education, NTNU ExperimentalResults (1/4) Running time on Mushroom
Department of Information & Computer Education, NTNU ExperimentalResults (2/4) Search tree size on Mushroom
Department of Information & Computer Education, NTNU ExperimentalResults (3/4) The number of support counting on Mushroom
Department of Information & Computer Education, NTNU ExperimentalResults (4/4) Running time on Connect
Department of Information & Computer Education, NTNU Conclusions • The SmartMiner algorithm is able to take advantage of the information gathered from previous steps to search for MFI. • Compared with Mafia and GenMax, SmartMiner generates a smaller search tree, requires a smaller number of support counting, and does not require superset checking.