300 likes | 496 Views
Frequent-Pattern Tree. Bottleneck of Frequent-pattern Mining. Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100
E N D
Bottleneck of Frequent-pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! • Bottleneck: candidate-generation-and-test • Can we avoid candidate generation?
Mining Freq Patterns w/o Candidate Generation • Grow long patterns from short ones using local frequent items • “abc” is a frequent pattern • Get all transactions having “abc”: DB|abc (projected database on abc) • “d” is a local frequent item in DB|abc abcd is a frequent pattern • Get all transactions having “abcd” (projected database on “abcd”) and find longer itemsets
Mining Freq Patterns w/o Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • Highly condensed, but complete for frequent pattern mining • Avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation: examine sub-database (conditional pattern base) only!
Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_sup= 50% Steps: • Scan DB once, find frequent 1-itemset (single item pattern) • Order frequent items in frequency descending order: f, c, a, b, m, p (L-order) • Process DB based on L-order
Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} {} Header Table Item frequency head f 0 nil c 0 nil a 0 nil b 0 nil m 0 nil p 0 nil Initial FP-tree
Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} {} Header Table Item frequency head f 1 c 1 a 1 b 0 nil m 1 p 1 f:1 c:1 a:1 m:1 Insert {f, c, a, m, p} p:1
Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} {} Header Table Item frequency head f 2 c 2 a 2 b 1 m 2 p 1 f:2 c2 a:2 m:1 b:1 Insert {f, c, a, b, m} p:1 m:1
Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} {} Header Table Item frequency head f 3 c 2 a 2 b 2 m 2 p 1 f:3 c:2 b:1 a:2 m:1 b:1 Insert {f, b} p:1 m:1
Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} {} Header Table Item frequency head f 3 c 3 a 2 b 3 m 2 p 2 f:3 c:1 c:2 b:1 b:1 a:2 p:1 m:1 b:1 Insert {c, b, p} p:1 m:1
Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 Insert {f, c, a, m, p} p:2 m:1
Benefits of FP-tree Structure • Completeness: • Preserve complete DB information for frequent pattern mining (given prior min support) • Each transaction mapped to one FP-tree path; counts stored at each node • Compactness • One FP-tree path may correspond to multiple transactions; tree is never larger than original database (if not count node-links and counts) • Reduce irrelevant information—infrequent items are gone • Frequency-descending ordering: more frequent items are closer to tree top and more likely to be shared
How Effective Is FP-tree? Dataset: Connect-4 (a dense dataset)
Mining Frequent Patterns Using FP-tree • General idea (divide-and-conquer) • Recursively grow frequent pattern path using FP-tree • Frequent patterns can be partitioned into subsets according to L-order • L-order=f-c-a-b-m-p • Patterns containing p • Patterns having m but no p • Patterns having b but no m or p • … • Patterns having c but no a nor b, m, p • Pattern f
Mining Frequent Patterns Using FP-tree • Step 1 : Construct conditional pattern base for each item in header table • Step 2: Construct conditional FP-tree from each conditional pattern-base • Step 3: Recursively mine conditional FP-trees and grow frequent patterns obtained so far • If conditional FP-tree contains a single path, simply enumerate all patterns
Step 1: Construct Conditional Pattern Base • Starting at header table of FP-tree • Traverse FP-tree by following link of each frequent item • Accumulate all transformed prefix paths of item to form a conditional pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2,cb:1 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1
{} Item frequency head c 3 c:3 Step 2: Construct Conditional FP-tree • For each pattern-base • Accumulate count for each item in base • Construct FP-tree for frequent items of pattern base min_sup= 50% # transaction =5 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 p conditional FP-tree
Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty Mining Frequent Patterns by Creating Conditional Pattern-Bases
FP-tree: c(3) Step 3: Recursively mine conditional FP-tree Collect all patterns that end at p suffix: p(3) FP: p(3) CPB: fcam:2, cb:1 Suffix: cp(3) FP: cp(3) CPB: nil
Step 3: Recursively mine conditional FP-tree FP-tree: f(3) Collect all patterns that end at m FP-tree: suffix: m(3) f(3) c(3) FP: m(3) CPB: fca:2, fcab:1 a(3) suffix: am(3) suffix: cm(3) suffix: fm(3) FP: cm(3) CPB: f:3 FP: fm(3) CPB: nil Continue next page suffix: fcm(3) FP: fcm(3) CPB: nil
FP-tree: FP-tree: f(3) c(3) f(3) Collect all patterns that end at m (cont’d) suffix: am(3) FP: am(3) CPB: fc:3 suffix: fam(3) suffix: cam(3) FP: fam(3) CPB: nil FP: cam(3) CPB: f:3 suffix: fcam(3) FP: fcam(3) CPB: nil
FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K
Why Is Frequent Pattern Growth Fast? • Performance study shows • FP-growth is an order of magnitude faster than Apriori • Reasoning • No candidate generation, no candidate test • Use compact data structure • Eliminate repeated database scan • Basic operations are counting and FP-tree building
Weaknesses of FP-growth • Support dependent; cannot accommodate dynamic support threshold • Cannot accommodate incremental DB update • Mining requires recursive operations
null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Border ABCD E Maximal patterns and Border • Maximal patterns: An frequent itemset X is maximal if none of its superset is frequent • 20 patterns, but only 3 maximal patterns ! • Can use 3 maximal patterns to represent all 20 patterns Maximal Patterns = {AD, ACE, BCDE}
Closed Itemsets • An itemset X is closed if there exists no item y (yX) such that every transaction containing X also contains y Example: • AC is not closed since every transaction containing AC also contains W • CDW is closed since transaction Tid=2 contains no other item
Frequent Closed Patterns For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “acdf” is a frequent closed pattern Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT’99 Min_sup=2