1 / 38

Ch5 Mining Frequent Patterns, Associations, and Correlations

Ch5 Mining Frequent Patterns, Associations, and Correlations. Dr. Bernard Chen Ph.D. University of Central Arkansas. Outline. Association Rules Association Rules with FP tree Misleading Rules Multi-level Association Rules. What Is Frequent Pattern Analysis?.

Download Presentation

Ch5 Mining Frequent Patterns, Associations, and Correlations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch5 Mining Frequent Patterns, Associations, and Correlations Dr. Bernard Chen Ph.D. University of Central Arkansas

  2. Outline • Association Rules • Association Rules with FP tree • Misleading Rules • Multi-level Association Rules

  3. What Is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining

  4. What Is Frequent Pattern Analysis? • Motivation: Finding inherent regularities in data • What products were often purchased together? bread and milk? • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? • Applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

  5. Association Rules

  6. Association Rules • support, s, probability that a transaction contains X  Y • confidence, c,conditional probability that a transaction having X also contains Y

  7. Association Rules • Let’s have an example • T100 1,2,5 • T200 2,4 • T300 2,3 • T400 1,2,4 • T500 1,3 • T600 2,3 • T700 1,3 • T800 1,2,3,5 • T900 1,2,3

  8. Association Rules with AprioriMinimum support=2/9Minimum confidence=60%

  9. The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;

  10. Strong Association Rule • Strong association rules means the frequent rules that also pass the minimum confidence. • For example frequent rules: {I1, I2} • Confidence(I1->I2)= 4/6 (strong association rule!) • Confidence(I2->I1)= 4/7

  11. Exercise • A dataset has five transactions, let min-support=60% and min_support=80% • Find all frequent itemsets using Apriori and all strong association rules

  12. Association Rules with Apriori K:5KE:4 KE E:4 KM:3 KM M:3KO:3 KO O:3 => KY:3 => KY => KEO Y:3 EM:2 EO EO:3 EY:2 MO:1 MY:2 OY:2

  13. Outline • Association Rules • Association Rules with FP tree • Misleading Rules • Multi-level Association Rules

  14. Mining Frequent Itemsets without Candidate Generation • In many cases, the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. • However, it suffer from two nontrivial costs: • It may generate a huge number of candidates (for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset) • It may need to scan database many times

  15. Association Rules with AprioriMinimum support=2/9Minimum confidence=70%

  16. Bottleneck of Frequent-pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! • Bottleneck: candidate-generation-and-test • Can we avoid candidate generation?

  17. Mining Frequent Patterns WithoutCandidate Generation • Grow long patterns from short ones using local frequent items • “abc” is a frequent pattern • Get all transactions having “abc”: DB|abc • “d” is a local frequent item in DB|abc  abcd is a frequent pattern

  18. Process of FP growth • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order • Scan DB again, construct FP-tree

  19. Association Rules • Let’s have an example • T100 1,2,5 • T200 2,4 • T300 2,3 • T400 1,2,4 • T500 1,3 • T600 2,3 • T700 1,3 • T800 1,2,3,5 • T900 1,2,3

  20. FP Tree

  21. Mining the FP tree

  22. Benefits of the FP-tree Structure • Completeness • Preserve complete information for frequent pattern mining • Never break a long pattern of any transaction • Compactness • Reduce irrelevant info—infrequent items are gone • Items in frequency descending order: the more frequently occurring, the more likely to be shared • Never be larger than the original database (not count node-links and the count field) • For Connect-4 DB, compression ratio could be over 100

  23. Exercise • A dataset has five transactions, let min-support=60% and min_confidence=80% • Find all frequent itemsets using FP Tree

  24. Association Rules with FP Tree K:5 E:4 M:3 O:3 Y:3

  25. Association Rules with FP Tree Y: KEMO:1 KEO:1 KY:1 K:3 KY O: KEM:1 KE:2 KE:3 KO EO KEO M: KE:2 K:1 K:3 KM E: K:4 KE

  26. FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K

  27. Why Is FP-Growth the Winner? • Divide-and-conquer: • decompose both the mining task and DB according to the frequent patterns obtained so far • leads to focused search of smaller databases • Other factors • no candidate generation, no candidate test • compressed database: FP-tree structure • no repeated scan of entire database • basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

  28. Outline • Association Rules • Association Rules with FP tree • Misleading Rules • Multi-level Association Rules

  29. Example 5.8 Misleading “Strong” Association Rule • Of the 10,000 transactions analyzed, the data show that • 6,000 of the customer included computer games, • while 7,500 include videos, • And 4,000 included both computer games and videos

  30. Misleading “Strong” Association Rule • For this example: • Support (Game & Video) = 4,000 / 10,000 =40% • Confidence (Game => Video) = 4,000 / 6,000 = 66% • Suppose it pass our minimum support and confidence (30% , 60%, respectively)

  31. Misleading “Strong” Association Rule • However, the truth is : “computer games and videos are negatively associated” • Which means the purchase of one of these items actually decreases the likelihood of purchasing the other. • (How to get this conclusion??)

  32. Misleading “Strong” Association Rule • Under the normal situation, • 60% of customers buy the game • 75% of customers buy the video • Therefore, it should have 60% * 75% = 45% of people buy both • That equals to 4,500 which is more than 4,000 (the actual value)

  33. From Association Analysis to Correlation Analysis • Lift is a simple correlation measure that is given as follows • The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B) • Otherwise, itemset A and B are dependent and correlated as events • Lift(A,B) = P(AUB) / P(A)P(B) • If the value is less than 1, the occurrence of A is negatively correlated with the occurrence of B • If the value is greater than 1, then A and B are positively correlated

  34. Outline • Association Rules • Association Rules with FP tree • Misleading Rules • Multi-level Association Rules

  35. Mining Multiple-Level Association Rules • Items often form hierarchies

  36. Mining Multiple-Level Association Rules • Items often form hierarchies

  37. uniform support reduced support Level 1 min_sup = 5% Milk [support = 10%] Level 1 min_sup = 5% Level 2 min_sup = 5% 2% Milk [support = 6%] Skim Milk [support = 4%] Level 2 min_sup = 3% Mining Multiple-Level Association Rules • Flexible support settings • Items at the lower level are expected to have lower support

  38. Multi-level Association: Redundancy Filtering • Some rules may be redundant due to “ancestor” relationships between items. • Example • milk  wheat bread [support = 8%, confidence = 70%] • 2% milk  wheat bread [support = 2%, confidence = 72%] • We say the first rule is an ancestor of the second rule.

More Related