260 likes | 398 Views
Chapter 5 Mining Association Rules with FP Tree. Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010. Mining Frequent Itemsets without Candidate Generation.
E N D
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010
Mining Frequent Itemsets without Candidate Generation • In many cases, the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. • However, it suffer from two nontrivial costs: • It may generate a huge number of candidates (for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset) • It may need to scan database many times
Association Rules with AprioriMinimum support=2/9Minimum confidence=70%
Bottleneck of Frequent-pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! • Bottleneck: candidate-generation-and-test • Can we avoid candidate generation?
Mining Frequent Patterns WithoutCandidate Generation • Grow long patterns from short ones using local frequent items • “abc” is a frequent pattern • Get all transactions having “abc”: DB|abc • “d” is a local frequent item in DB|abc abcd is a frequent pattern
Process of FP growth • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order • Scan DB again, construct FP-tree
Association Rules • Let’s have an example • T100 1,2,5 • T200 2,4 • T300 2,3 • T400 1,2,4 • T500 1,3 • T600 2,3 • T700 1,3 • T800 1,2,3,5 • T900 1,2,3
Benefits of the FP-tree Structure • Completeness • Preserve complete information for frequent pattern mining • Never break a long pattern of any transaction • Compactness • Reduce irrelevant info—infrequent items are gone • Items in frequency descending order: the more frequently occurring, the more likely to be shared • Never be larger than the original database (not count node-links and the count field) • For Connect-4 DB, compression ratio could be over 100
Exercise • A dataset has five transactions, let min-support=60% and min_confidence=80% • Find all frequent itemsets using FP Tree
Association Rules with Apriori K:5KE:4 KE E:4 KM:3 KM M:3KO:3 KO O:3 => KY:3 => KY => KEO Y:3 EM:2 EO EO:3 EY:2 MO:1 MY:2 OY:2
Association Rules with FP Tree K:5 E:4 M:3 O:3 Y:3
Association Rules with FP Tree Y: KEMO:1 KEO:1 KY:1 K:3 KY O: KEM:1 KE:2 KE:3 KO EO KEO M: KE:2 K:1 K:3 KM E: K:4 KE
FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K
Why Is FP-Growth the Winner? • Divide-and-conquer: • decompose both the mining task and DB according to the frequent patterns obtained so far • leads to focused search of smaller databases • Other factors • no candidate generation, no candidate test • compressed database: FP-tree structure • no repeated scan of entire database • basic ops—counting local freq items and building sub FP-tree, no pattern search and matching
Strong Association Rules are not necessary interesting Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010
Example 5.8 Misleading “Strong” Association Rule • Of the 10,000 transactions analyzed, the data show that • 6,000 of the customer included computer games, • while 7,500 include videos, • And 4,000 included both computer games and videos
Misleading “Strong” Association Rule • For this example: • Support (Game & Video) = 4,000 / 10,000 =40% • Confidence (Game => Video) = 4,000 / 6,000 = 66% • Suppose it pass our minimum support and confidence (30% , 60%, respectively)
Misleading “Strong” Association Rule • However, the truth is : “computer games and videos are negatively associated” • Which means the purchase of one of these items actually decreases the likelihood of purchasing the other. • (How to get this conclusion??)
Misleading “Strong” Association Rule • Under the normal situation, • 60% of customers buy the game • 75% of customers buy the video • Therefore, it should have 60% * 75% = 45% of people buy both • That equals to 4,500 which is more than 4,000 (the actual value)
From Association Analysis to Correlation Analysis • Lift is a simple correlation measure that is given as follows • The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B) • Otherwise, itemset A and B are dependent and correlated as events • Lift(A,B) = P(AUB) / P(A)P(B) • If the value is less than 1, the occurrence of A is negatively correlated with the occurrence of B • If the value is greater than 1, then A and B are positively correlated
Mining Multiple-Level Association Rules • Items often form hierarchies
Mining Multiple-Level Association Rules • Items often form hierarchies
uniform support reduced support Level 1 min_sup = 5% Milk [support = 10%] Level 1 min_sup = 5% Level 2 min_sup = 5% 2% Milk [support = 6%] Skim Milk [support = 4%] Level 2 min_sup = 3% Mining Multiple-Level Association Rules • Flexible support settings • Items at the lower level are expected to have lower support
Multi-level Association: Redundancy Filtering • Some rules may be redundant due to “ancestor” relationships between items. • Example • milk wheat bread [support = 8%, confidence = 70%] • 2% milk wheat bread [support = 2%, confidence = 72%] • We say the first rule is an ancestor of the second rule.