Intelligent Database Laboratory, NCKU

Mining Frequent Patterns without Candidate Generation Jiawei Han, Jian Pei and Yiwen Yin SIGMOD’ 2000 Advisor: Prof. Vincent S. Tseng Presenter: Ying-Chun Lin, Zhi-Hong Yang, Jun-Han Lin and Yi-Ting Cheng Intelligent Database Laboratory, NCKU 2014/10/23

Outline • Introduction • FP-tree construction • Mining frequent pattern using FP-tree • Comparative analysis and performance evaluation • Conclusion • Discussion

Introduction

Introduction • Frequent pattern mining plays an important role in data mining • Apriori-like candidate set generation-and-test approach • Candidate set generated () • Check the frequencies of the candidate set • Anti-monotone Aprioriheuristic • Problem • Costly to handle a huge number of candidate sets • Repeatedly scan the database

Frequent Pattern Tree (FP-tree) • Properties • Extended prefix-tree structure to store frequent patterns • Nodes are frequent length-1 items • More frequently occurring nodes will have better chances of sharing nodes

Frequent Pattern Mining • FP-tree-based pattern fragment growth mining method • Examine only conditional pattern base • Construct conditionalFP-tree • Mine frequent patterns recursively using conditional FP-tree • Not Apriori-like restricted generation-and-test but restricted test only • Search technique is partitioning-base, divide-and-conquer method

FP-tree Construction Definition First Scan of Database Construct FP-tree Analysis of FP-tree Construction Completeness & Compactness of FP-tree

Definition • Set of items • Transaction database • Support or occurrence frequency of a pattern A • The number of transactions containing A • Frequent pattern • If A’s support is no less than a predefined minimum support threshold

First Scan of Database Step 1: Scan DB for the first time to generatethe set of frequent items F and their supports. Sort F in support descending order as L, the list of frequent items. Minimum support: ξ =3

Construct FP-tree Step 2: create the root of FP-tree, T. For each transaction Trans in DB do the following. Select and sort the frequent items in Trans according to the order of L. Insert the selected and sorted Trans into FP-tree. {} select {f, a, c, d, g, i, m, p} DB sort {f, c, a, m, p}

Construct FP-tree (cont.) {} Header Table Item head f c a b m p f:1 {f, c, a, m, p} {} c:1 a:1 m:1 p:1

Construct FP-tree (cont.) {} {} Header Table Item head f c a b m p f:1 f:1 f:2 c:1 c:1 c:2 {f, c, a, b, m} a:1 a:2 a:1 m:1 m:1 b:1 p:1 p:1 m:1

Construct FP-tree (cont.) {} {} {} f:3 c:1 f:3 f:4 f:2 f:3 c:1 {f, b} {c, b, p} {f, c, a, m, p} c:2 b:1 b:1 c:3 c:2 c:2 b:1 b:1 b:1 a:2 p:1 a:3 a:2 a:2 p:1 m:1 b:1 m:2 m:1 b:1 m:1 b:1 p:1 m:1 p:2 p:1 m:1 p:1 m:1

Final FP-tree {} Header Table Item head f c a b m p f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1

Analysis of FP-tree Construction • Exact two scan of transaction DB • Collect the frequent items • Insert transactions into FP-tree • Cost: O(|Trans|)

Analysis of FP-tree Construction (Cont.) • Each transaction will contribute at most one path to the FP-tree. • Sharing of the Frequent items make the size of tree usually smaller than its original database. • The height of the tree is bounded by the maximal number of frequent items in any transaction in the DB.

Completeness & Compactness of FP-tree • FP-tree contains the complete information of DB in relevance of frequent pattern mining • FP-tree is a highly compact structure • Record only frequent items in the FP-tree • Register the count • Strat at the root, • The tree is not always constructed to maximal compactness

Compactness of FP-tree

Mining Frequent Pattern Using FP-tree Node-link Property Prefix Path Property Fragment Growth Pattern Growth Single FP-tree Path Pattern Generation Conditional FP-tree

Node-link Property {} {} f:4 c:1 Header Table Item head f c a b m p Header Table Item head f c a b m p f:4 c:1 c:3 b:1 c:3 b:1 b:1 a:3 p:1 a:3 p:1 m:2 m:2 b:1 p:2 p:2 m:1

Prefix Path Property {} f:4 c:1 Header Table Item head f c a b m p c:3 b:1 p- cond. pattern base: fcam:2 cb:1 a:3 p:1 m:2 p:2

Fragment Growth - cond. pattern base: cdgimp:1 bclmo:1 celpmn:1 Lemma (Fragment growth) Let α be an itemset in DB, B be α’s conditional pattern base, and β be an itemset in B. Then the support of in DB is equivalent to the support of β in B.

Pattern Growth - cond. pattern base: fcam:2 cb:1 Lemma (Pattern growth) Let α be an frequent itemset in DB, B be α’s conditional pattern base, and β be an itemset in B. Then is frequent in DB iffβis frequent in B.

Single FP-tree Path Pattern Generation {} Frequent patterns: fca:3 fc:3 fa:3 ca:3 f:4 c:3 a:3 f:4 c:3 a:3 Lemma (Single FP-tree path pattern generation) Suppose an FP-tree T has a single path P. The complete set of frequent patterns of T can be generated by enumeration of all the combination of the subpaths of P with the support being the minimum support of the items contained in the subpaths.

Growth {} Conditional pattern bases item cond. pattern base p fcam:2, cb:1 m fca:2, fcab:1 b fca:1, f:1, c:1 a fc:3 c f:3 f { } Header Table Item head f c a b m p Growth() f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1

{} c:3 p-conditional FP-tree of “p”: (c:3) Conditional FP-tree of “p” {} f:4 c:1 Header Table Item head f c a b m p c:3 b:1 p- cond. pattern base: fcam:2, cb:1   a:3 p:1 m:2 p:2 Frequent Pattern

{} f:3 c:3 a:3 m-conditional FP-tree Conditional FP-tree of “m” {} Header Table Item head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 m- cond. pattern base: fca:2, fcab:1   a:3 m:2 b:1 m:1

Comparative Analysis and Performance Evaluation TreeProjection Analysis of TreeProjection TreeProjection and FP-growth methods Performance study

TreeProjection Minimum support: ξ =3 p: 3 m:3 b: 3 a: 3 c: 4 f: 4 p,m,a,c,f p,m,a,c,f p,m,a,c,f m,b,a,c,f m,b,a,c,f m,b,a,c,f m,b,a,c,f reverse order of FP-tree p,m,a,c,f p,m,a,c,f p,m,a,c,f All the transactions in the database are projected to the root node. Building the matrix to find nodes Intelligent Database Laboratory, NCKU

Analysis of TreeProjection • The efficiency of TreeProjection • The support counting in a relatively small search space -Transaction projection • The flexibility of lexicographical tree -Tree generation phase -Transaction projection phase • Suffer from problems • Efficiency, scalability, and implementation complexity.

TreeProjection and FP-growth methods • TreeProjection • The size of the matrices is huge • Transaction projection becomes an nontrivial cost • Need generate candidate 2-itemsets • One node for each frequent itemset • FP-growth • Avoid the generation of any candidate k-itemsets • Only need frequent 1-itemset • Highly compact form

Performance study • Data sets Numerous frequent itemsets Length frequent itemsets Support threshold(%) Support threshold(%) • Features of data sets

Performance study (cont.) • Comparison of FP-growth and Apriori D1FP-growth D1 Apriori D2FP-growth D2 Apriori Run Time (Sec.) Run Time (Sec.) Support threshold(%) Support threshold(%) • Scalability with threshold

Performance study (cont.) • Comparison of D1 and D2 Running memory D1 D2 D1 D2 Run Time per itemset (Sec.) Support threshold(%) Support threshold(%) • Running memory requirements of FP-growth • Run time per itemset versus • support threshold

Performance study (cont.) • Data: D2 • Support threshold : 1.5% FP-growth Apriori Run Time (Sec.) Number of transactions(K) • Scalability with number of transactions

Performance study (cont.) • Comparison of FP-growth and TreeProjection D2FP-growth D2 TreeProjection D1FP-growth D1 TreeProjection Run Time (Sec.) Run Time (Sec.) Support threshold(%) Support threshold(%) • Scalability with support threshold

Conclusion

Conclusion • Proposed frequent pattern tree(FP-tree) • Store compressed and crucial information about the frequent patterns • Highly compact data structure, thus save the cost of database scans • Developed a pattern growth method(FP-growth) • Mine frequent patterns in large database • Avoid costly candidate generation • Partitioning-based divide-and-conquer method reduce the size of conditional pattern base and conditional FP-trees

Discussion

Discussion • Advantage • Generate frequent patterns without candidate generation • FP-tree is complete and compact to represent the origin database • Only two scan of database • Disadvantage • FP-tree may not fit in memory • Execution time is large due to the complex compact data structure • The structure can not work better in certain type of database

Thank You for Your Listening.

Intelligent Database Laboratory, NCKU