240 likes | 385 Views
CSCI6405 class project. Implementation and comparison of three AR mining algorithms. Xuehai Wang, Xiaobo Chen, Shen chen. Outline. Motivation Dataset Apriori based hash tree algorithm FP-tree algorithm Conclusion Reference. Motivation.
E N D
CSCI6405 class project Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen AR mining
Outline • Motivation • Dataset • Apriori based hash tree algorithm • FP-tree algorithm • Conclusion • Reference AR mining
Motivation • Make the time of generating rules as shot as possible! • To understand the three algorithms • Apriori algorithm • Apriori with hash tree algorithm • FP-tree algorithm • Learn how to improve an algorithm AR mining
Dataset • IBM dataset generator • Can set item number • Can set minimal support • Can set dataset size 1 2 5 8 9 2 3 4 6 7 12 Tid item AR mining
Apriori principle • Apriori principle • A candidate generation-and-test Approach [4] • Given a frequent itemset, its subset must be frequent • A set is infrequent, its super set will not be generated and tested • But there is still some places can be improved • Count the support • I/O scan times AR mining
Apriori Hash Tree Alg • Candidate K-itemset size is l • There is n transactions • Average transaction size is m • Calculate support count: • Original Apriori Alg: • With hash tree: O( n.log(l).(mk) ) AR mining
Apriori Hash Tree Alg • Candidate is stored in a hash tree structure 1-itemset candidate hash tree 1(2) 1(1) 1(2) 2(1) 3(1) 2(1) 3(1) AR mining
Apriori Hash Tree Alg 1itemset , Min support = 2 1(3) 2(4) 3(3) 4(1) 5(1) 6(3) AR mining
Apriori Hash Tree Alg 2 itemset, Min support = 2 2 3(2) 3 6(2) 2 6(1) 1 2(2) 1 3(2) 1 6(1) 3 itemset, Min support = 2 1 2 3(1) AR mining
FP-tree • Since the mining dataset is always very huge, it’s impossible to read all transactions into computer memory all in once. • But I/O scan is very time consuming. • FP-tree algorithm will try to suite all information from the dataset into computer memory, hence only need to scan I/O two times. AR mining
FP-tree • FP-tree algorithm and implementation • By Xiaobo Chen AR mining
FP-tree (Frequent Pattern Tree) • Mining frequent pattern without candidate generation • Divide and conquer methodology: decompose mining tasks into smaller ones AR mining
FP-tree (Merits of FP-tree algorithm) • Make most use of common shared prefix • Complete and compact All information of a transaction is stored in a path The size is constrained by the data set consequently, the longest path corresponds to the longest pattern The compact ratio: over 100 AR mining
f:1 c:1 a:1 m:1 p:1 FP-tree (Construction of FP-tree) min_support = 3 • TID freq. Items bought • 100 {f, c, a, m, p} • 200 {f, c, a, b, m} • 300 {f, b} • 400 {c, p, b} • 500 {f, c, a, m, p} Item frequency f 4 c 4 a 3 b 3 m 3 p 3 root AR mining
f:2 c:2 a:2 m:1 p:1 FP-tree (construction (Cont’d)) TID freq. Items bought 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, p, b} 500 {f, c, a, m, p} root b:1 m:1 AR mining
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2 FP-tree construction (Cont’d) min_support = 3 • TID freq. Items bought • 100 {f, c, a, m, p} • 200 {f, c, a, b, m} • 300 {f, b} • 400 {c, p, b} • 500 {f, c, a, m, p} Item frequency f 4 c 4 a 3 b 3 m 3 p 3 root c:1 b:1 b:1 p:1 b:1 m:1 AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree) • General idea (divide-and-conquer) • Recursively grow frequent pattern path using the FP-tree • Method • For each item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it containsonly one path(single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) AR mining
c:1 b:1 p:1 Conditional pattern base for p fcam:2, cb:1 f:4 c:3 a:3 p m:2 p:2 FP-tree (Mining Frequent Patterns Using the FP-tree) • Start with last item in order (i.e., p). • Follow node pointers and traverse only the paths containing p. • Accumulate all of transformed prefix paths of that item to form a conditional pattern base root Constructing a new FP-tree based on this pattern base leads to only one branch c:3 Thus we derive only one frequent pattern cont. p. Pattern cp AR mining
f:4 Conditional pattern base for m fca:2, fcab:1 c:3 a:3 m m:2 b:1 m:1 FP-tree (Mining Frequent Patterns Using the FP-tree) • Move to next least frequent item in order, i.e., m • Follow node pointers and traverse only the paths containing m. • Accumulate all of transformed prefix paths of that item to form a conditional pattern base root Constructing a new FP-tree based on this pattern base leads to path fca:3 From this we derive frequent patterns fcam, fcm, cam, fm, cm, am AR mining
Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty FP-tree (Conditional Pattern-Bases for the example) AR mining
FP-tree (Why is Frequent pattern Growth fast?) • Performance studies show that FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection • Reasoning: • No candidate generation, no candidate test • Use compact data structure • Eliminate repeated database scan • Basic operation is counting and FP-tree building AR mining
FP-tree: Expected result: FP-growth vs. Apriori: Scalability With the Support Threshold AR mining
Conclusion • FP-tree is faster than other two algorithms. • Apriori as well as hash tree algorithms are easier to implement. • We can easily combine them with other methods or tools. (i.e. distributed parallel computing). • The parameter of dataset is very important too. • Density, size, min support … AR mining
References • [1] Jiawei Han and Micheline Kamber:"Data Mining: Concepts and Techniques ",Morgan Kaufmann, 2001 • [2] Jiawei Han, Jian Pei, Yiwen Yin:Mining Frequent Patterns without Candidate Generation, ACM SIGMOD, 2000 • [3] N.Mamoulis, Advanced Database Technologies (Slides) • [4] Jiawei Han and Micheline Kamber. Data Mining - Concepts and Techniques. MorganKaufmann Publishers, 2001. AR mining