440 likes | 936 Views
The FP-Growth/Apriori Debate. Jeffrey R. Ellis CSE 300 – 01 April 11, 2002. Presentation Overview. FP-Growth Algorithm Refresher & Example Motivation FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work? Real World Application Datasets are not created equal
E N D
The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002
Presentation Overview • FP-Growth Algorithm • Refresher & Example • Motivation • FP-Growth Complexity • Vs. Apriori Complexity • Saving calculation or hiding work? • Real World Application • Datasets are not created equal • Results of real-world implementation
FP-Growth Algorithm • Association Rule Mining • Generate Frequent Itemsets • Apriori generates candidate sets • FP-Growth uses specialized data structures (no candidate sets) • Find Association Rules • Outside the scope of both FP-Growth & Apriori • Therefore, FP-Growth is a competitor to Apriori
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FP-Growth Example TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1
Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty FP-Growth Example
FP-CreateTree Input: DB, min_support Output: FP-Tree Scan DB & count all frequent items. Create null root & set as current node. For each Transaction T Sort T’s items. For each sorted Item I Insert I into tree as a child of current node. Connect new tree node to header list. Two passes through DB Tree creation is based on number of items in DB. Complexity of CreateTree is O(|DB|) FP-Growth Algorithm
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 FP-Growth Example TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p}
FP-Growth Input: FP-Tree, f_is (frequent itemset) Output: All freq_patterns if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is combination else for each item i in header generate pattern f_is i construct pattern’s conditional cFP-Tree if (FP-Tree 0) then call FP-Growth (cFP-Tree, pattern) Recursive algorithm creates FP-Tree structures and calls FP-Growth Claims no candidate generation FP-Growth Algorithm
A B C D Two-Part Algorithm if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is combination • e.g., { A B C D } • p = |pattern| = 4 • AD, BD, CD, ABD, ACD, BCD, ABCD • (n=1 to p-1) (p-1Cn)
A B C D C D E D E E Two-Part Algorithm else for each item i in header generate pattern f_is i construct pattern’s conditional cFP-Tree if (cFP-Tree null) then call FP-Growth (cFP-Tree, pattern) • e.g., { A B C D E } for f_is = D • i = A, p_base = (ABCD), (ACD), (AD) • i = B, p_base = (ABCD) • i = C, p_base = (ABCD), (ACD) • i = E, p_base = null • Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.
FP-Growth Complexity • Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header. • Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree) • Creation of a new cFP-Tree occurs also.
K K L L M M L L M M M M M M Sample Data FP-Tree null A … … B C D C D E … … … J K L M L M M M
Algorithm Results (in English) • Candidate Generation sets exchanged for FP-Trees. • You MUST take into account all paths that contain an item set with a test item. • You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur. • Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.
Null Transactions: A (49) B (44) F (39) G (34) H (30) I (20) ABGM BGIM FGIM GHIM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 FP-Growth Mining Example Header: A B F G H IM H:30 I:20
FP-Growth Mining Example Null i = A pattern = { A M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }
FP-Growth Mining Example Null i = B pattern = { B M } support = 2 pattern base cFP-Tree … Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }
G:2 B:2 M:2 FP-Growth Mining Example Header: G B M Patterns mined: BM GM BGM freq_itemset = { B M } All patterns: BM, GM, BGM
FP-Growth Mining Example Null i = F pattern = { F M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }
FP-Growth Mining Example Null i = G pattern = { G M } support = 4 pattern base cFP-Tree … Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }
FP-Growth Mining Example Null i = I pattern = { I G M } support = 3 pattern base cFP-Tree … Header: I B G M I:3 B:1 B:1 G:2 G:1 G:1 M:2 M:1 M:1 freq_itemset = { G M }
I:3 G:3 M:3 FP-Growth Mining Example Header: I G M Patterns mined: IM GM IGM freq_itemset = { I G M } All patterns: BM, GM, BGM, IM, GM, IGM
FP-Growth Mining Example Null i = B pattern = { B G M } support = 2 pattern base cFP-Tree … Header: I B G M I:3 B:1 B:1 G:2 G:1 G:1 M:2 M:1 M:1 freq_itemset = { G M }
B:2 G:2 M:2 FP-Growth Mining Example Header: B G M Patterns mined: BM GM BGM freq_itemset = { B G M } All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM
FP-Growth Mining Example Null i = H pattern = { H M } support = 1 pattern base cFP-Tree = null Header: A B F G H IM A:50 B:45 F:40 G:35 B:1 G:1 G:1 H:1 G:1 I:1 I:1 I:1 M:1 M:1 M:1 M:1 freq_itemset = { M }
FP-Growth Mining Example • Complete pass for { M } • Move onto { I }, { H }, etc. • Final Frequent Sets: • L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG } • L3 : { BGM, GIM, BGM, GIM } • L4 : None
FP-Growth Redundancy v. Apriori Candidate Sets • FP-Growth (support=2) generates: • L2 : 10 sets (5 distinct) • L3 : 4 sets (2 distinct) • Total : 14 sets • Apriori (support=2) generates: • C2 : 21 sets • C3 : 2 sets • Total : 23 sets • What about support=1? • Apriori : 23 sets • FP-Growth : 28 sets in {M} with i = A, B, F alone!
FP-Growth vs. Apriori • Apriori visits each transaction when generating a new candidate sets; FP-Growth does not • Can use data structures to reduce transaction list • FP-Growth traces the set of concurrent items; Apriori generates candidate sets • FP-Growth uses more complicated data structures & mining techniques
Algorithm Analysis Results • FP-Growth IS NOT inherently faster than Apriori • Intuitively, it appears to condense data • Mining scheme requires some new work to replace candidate set generation • Recursion obscures the additional effort • FP-Growth may run faster than Apriori in circumstances • No guarantee through complexity which algorithm to use for efficiency
Improvements to FP-Growth • None currently reported • MLFPT • Multiple Local Frequent Pattern Tree • New algorithm that is based on FP-Growth • Distributes FP-Trees among processors • No reports of complexity analysis or accuracy of FP-Growth
Real World Applications • Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” • Collected implementations of Apriori, FP-Growth, CLOSET, CHARM, MagnumOpus • Tested implementations against 1 artificial and 3 real data sets • Time-based comparisons generated
Apriori & FP-Growth • Apriori • Implementation from creator Christian Borgelt (GNU Public License) • C implementation • Entire dataset loaded into memory • FP-Growth • Implementation from creators Han & Pei • Version – February 5, 2001
Other Algorithms • CHARM • Based on concept of Closed Itemset • e.g., { A, B, C } – ABC, AC, BC, ACB, AB, CB, etc. • CLOSET • Han, Pei implementation of Closed Itemset • MagnumOpus • Generates rules directly through search-and-prune technique
Datasets • IBM-Artificial • Generated at IBM Almaden (T10I4D100K) • Often used in association rule mining studies • BMS-POS • Years of point-of-sale data from retailer • BMS-WebView-1 & BMS-WebView-2 • Months of clickstream traffic from e-commerce web sites
Experimental Considerations • Hardware Specifications • Dual 550MHz Pentium III Xeon processors • 1GB Memory • Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% } • Confidence = 0% • No other applications running (second processor handles system processes)
IBM-Artificial BMS-POS
BMS-WebView-1 BMS-WebView-2
Study Results – Real Data • At support 0.20%, Apriori performs as fast as or better than FP-Growth • At support < 0.20%, Apriori completes whenever FP-Growth completes • Exception – BMS-WebView-2 @ 0.01% • When 2 million rules are generated, Apriori finishes in 10 minutes or less • Proposed – Bottleneck is NOT the rule algorithm, but rule analysis
Study Results – Artificial Data • At support < 0.40%, FP-Growth performs MUCH faster than Apriori • At support 0.40%, FP-Growth and Apriori are comparable
Real-World Study Conclusions • FP-Growth (and other non-Apriori) perform better on artificial data • On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets • FP-Growth may be suitable when low support, large result count, fast generation are needed • Future research may best be directed toward analyzing association rules
Research Conclusions • FP-Growth does not have a better complexity than Apriori • Common sense indicates it will run faster • FP-Growth does not always have a better running time than Apriori • Support, dataset appears more influential • FP-Trees are very complex structures (Apriori is simple) • Location of data (memory vs. disk) is non-factor in comparison of algorithms
To Use or Not To Use? • Question: Should the FP-Growth be used in favor of Apriori? • Difficulty to code • High performance at extreme cases • Personal preference • More relevant questions • What kind of data is it? • What kind of results do I want? • How will I analyze the resulting rules?
References • Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000. • Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001. • Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000. • Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000. • Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001 • Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000. • Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.