1 / 42

Association Rule Mining (II)

Association Rule Mining (II). Instructor: Qiang Yang Thanks: J.Han and J. Pei. Bottleneck of Frequent-pattern Mining. Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100

zubin
Download Presentation

Association Rule Mining (II)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Association Rule Mining (II) Instructor: Qiang Yang Thanks: J.Han and J. Pei

  2. Bottleneck of Frequent-pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! • Bottleneck: candidate-generation-and-test • Can we avoid candidate generation? Frequent-pattern mining methods

  3. FP-growth: Frequent-pattern Mining Without Candidate Generation • Heuristic: let P be a frequent itemset, S be the set of transactions contain P, and x be an item. If x is a frequent item in S, {x} P must be a frequent itemset • No candidate generation! • A compact data structure, FP-tree, to store information for frequent pattern mining • Recursive mining algorithm for mining complete set of frequent patterns Frequent-pattern mining methods

  4. Example Min Support = 3 Frequent-pattern mining methods

  5. Scan the database • List of frequent items, sorted: (item:support) • <(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)> • The root of the tree is created and labeled with “{}” • Scan the database • Scanning the first transaction leads to the first branch of the tree: <(f:1),(c:1),(a:1),(m:1),(p:1)> • Order according to frequency Frequent-pattern mining methods

  6. Scanning TID=100 root {} Transaction Database TID Items 100 f,a,c,d,g,i,m,p Header Table Node Item count head f 1 c 1 a 1 m 1 p 1 f:1 c:1 a:1 m:1 p:1 Frequent-pattern mining methods

  7. Frequent Single Items: F1=<f,c,a,b,m,p> TID=200 Possible frequent items: Intersect with F1: f,c,a,b,m Along the first branch of <f,c,a,m,p>, intersect: <f,c,a> Generate two children <b>, <m> Scanning TID=200 Frequent-pattern mining methods

  8. Scanning TID=200 root {} Transaction Database TID Items 200 f,c,a,b,m Header Table Node Item count head f 1 c 1 a 1 b 1 m 2 p 1 f:2 c:2 a:2 m:1 b:1 p:1 m:1 Frequent-pattern mining methods

  9. The final FP-tree {} Transaction Database TID Items 100 f,a,c,d,g,i,m,p 200 a,b,c,f,l,m,o 300 b,f,h,j,o 400 b,c,k,s,p 500 a,f,c,e,l,p,m,n Min support = 3 Header Table Node Item count head f 1 c 2 a 1 b 3 m 2 p 2 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Frequent 1-items in frequency descending order: f,c,a,b,m,p Frequent-pattern mining methods

  10. FP-Tree Construction • Scans the database only twice • Subsequent mining: based on the FP-tree Frequent-pattern mining methods

  11. How to Mine an FP-tree? • Step 1: form conditional pattern base • Step 2: construct conditional FP-tree • Step 3: recursively mine conditional FP-trees Frequent-pattern mining methods

  12. Conditional Pattern Base • Let {I} be a frequent item • A sub database which • consists of the set of prefix paths in the FP-tree • With item {I} as a co-occurring suffix pattern • Example: • {m} is a frequent item • {m}’s conditional pattern base: • <f,c,a>: support =2 • <f,c,a,b>: support = 1 • Mine recursively on such databases {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Frequent-pattern mining methods

  13. Conditional Pattern Tree • Let {I} be a suffix item, {DB|I} be theconditional pattern base • The frequent pattern tree TreeI is knownas the conditional pattern tree • Example: • {m} is a frequent item • {m}’s conditional pattern base: • <f,c,a>: support =2 • <f,c,a,b>: support = 1 • {m}’s conditional pattern tree {} f:4 c:3 a:3 m:2 Frequent-pattern mining methods

  14. Composition of patterns a and b • Let a be a frequent item in DB, B be a’s conditional pattern base, and b be an itemset in B. Then a + b is frequent in DB if and only if b is frequent in B. • Example: • Starting with a={p} • {p}’s conditional pattern base (from the tree) B= • (f,c,a,m): 2 (c,b): 1 • Let b be {c}. • Then a+b={p,c}, with support = 3. Frequent-pattern mining methods

  15. Let P be a single path FP tree Let {I1, I2, …Ik} be an itemset in the tree Let Ij have the lowest support Then the support({I1, I2, …Ik})=support(Ij) Example: Single path tree {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Frequent-pattern mining methods

  16. FP_growth Algorithm Fig 6.10 • Recursive Algorithm • Input: A transaction database, min_supp • Output: The complete set of frequent patterns • 1. FP-Tree construction • 2. Mining FP-Tree by calling FP_growth(FP_tree, null) • Key Idea: consider single path FP-tree and multi-path FP-tree separately • Continue to split until get single-path FP-tree Frequent-pattern mining methods

  17. FP_Growth (tree, a) • If tree contains a single path P, then • For each combination (denoted as b) of the nodes in the path P, then • Generate pattern b+a with support = min_supp of nodes in b • Else for each a in the header of tree, do { • Generate pattern b = a + a with support = a.support; • Construct • (1) b’s conditional pattern base and • (2) b’s conditional FP-tree Treeb • If Treeb is not empty, then • Call FP-growth(Treeb, b); • } Frequent-pattern mining methods

  18. FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K Frequent-pattern mining methods

  19. FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T25I20D100K Frequent-pattern mining methods

  20. Why Is FP-Growth the Winner? • Divide-and-conquer: • decompose both the mining task and DB according to the frequent patterns obtained so far • leads to focused search of smaller databases • Other factors • no candidate generation, no candidate test • compressed database: FP-tree structure • no repeated scan of entire database • basic ops—counting and FP-tree building, not pattern search and matching Frequent-pattern mining methods

  21. Implications of the Methodology: Papers by Han, et al. • Mining closed frequent itemsets and max-patterns • CLOSET (DMKD’00) • Mining sequential patterns • FreeSpan (KDD’00), PrefixSpan (ICDE’01) • Constraint-based mining of frequent patterns • Convertible constraints (KDD’00, ICDE’01) • Computing iceberg data cubes with complex measures • H-tree and H-cubing algorithm (SIGMOD’01) Frequent-pattern mining methods

  22. Visualization of Association Rules: Pane Graph Frequent-pattern mining methods

  23. Visualization of Association Rules: Rule Graph Frequent-pattern mining methods

  24. Mining Various Kinds of Rules or Regularities • Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity • Classification, clustering, iceberg cubes, etc. Frequent-pattern mining methods

  25. uniform support reduced support Level 1 min_sup = 5% Milk [support = 10%] Level 1 min_sup = 5% Level 2 min_sup = 5% 2% Milk [support = 6%] Skim Milk [support = 4%] Level 2 min_sup = 3% Multiple-level Association Rules • Items often form hierarchy • Flexible support settings: Items at the lower level are expected to have lower support. • Transaction database can be encoded based on dimensions and levels • explore shared multi-level mining Frequent-pattern mining methods

  26. Quantitative Association Rules • Numeric attributes are dynamically discretized • Such that the confidence or compactness of the rules mined is maximized. • 2-D quantitative association rules: Aquan1 Aquan2  Acat • Cluster “adjacent” association rules to form general rules using a 2-D grid. • Example: age(X,”34-35”)  income(X,”30K - 50K”)  buys(X,”high resolution TV”) Frequent-pattern mining methods

  27. Redundant Rules [SA95] • Which rule is redundant? • milk  wheat bread, [support = 8%, confidence = 70%] • “skim milk”  wheat bread, [support = 2%, confidence = 72%] • The first rule is more general than the second rule. • A rule is redundant if its support is close to the “expected” value, based on a general rule, and its confidence is close to that of the general rule. Frequent-pattern mining methods

  28. INCREMENTAL MINING [CHNW96] • Rules in DB were found and a set of new tuples db is added to DB, • Task: to find new rules in DB + db. • Usually, DB is much larger than db. • Properties of Itemsets: • frequent in DB + db if frequent in both DB and db. • infrequent in DB + db if also in both DB and db. • frequent only in DB, then merge with counts in db. • No DB scan is needed! • frequent only in db, then scan DB once to update their itemset counts. • Same principle applicable to distributed/parallel mining.

  29. CORRELATION RULES • Association does not measure correlation [BMS97, AY98]. • Among 5000 students • 3000 play basketball, 3750 eat cereal, 2000 do both • play basketball  eat cereal [40%, 66.7%] • Conclusion: “basketball and cereal are correlated” is misleading • because the overall percentage of students eating cereal is 75%, higher than 66.7%. • Confidence does not always give correct picture! Frequent-pattern mining methods

  30. P(A^B)=P(B)*P(A), if A and B are independent events A and B negatively correlated the value is less than 1; Otherwise A and B positively correlated. P(B|A)/P(B) is known as the lift of rule BA If less than one, then B and A are negatively correlated. BasketballCereal 2000/(3000*3750/5000)=2000*5000/3000*3750<1 Correlation Rules Frequent-pattern mining methods

  31. Chi-square Correlation [BMS97] • The cutoff value at 95% significance level is 3.84 > 0.9 • Thus, we do not reject the independence assumption. Frequent-pattern mining methods

  32. Constraint-based Data Mining • Finding all the patterns in a database autonomously? — unrealistic! • The patterns could be too many but not focused! • Data mining should be an interactive process • User directs what to be mined using a data mining query language (or a graphical user interface) • Constraint-based mining • User flexibility: provides constraints on what to be mined • System optimization: explores such constraints for efficient mining—constraint-based mining Frequent-pattern mining methods

  33. Constraints in Data Mining • Knowledge type constraint: • classification, association, etc. • Data constraint— using SQL-like queries • find product pairs sold together in stores in Vancouver in Dec.’00 • Dimension/level constraint • in relevance to region, price, brand, customer category • Rule (or pattern) constraint • small sales (price < $10) triggers big sales (sum > $200) • Interestingness constraint • strong rules: min_support  3%, min_confidence  60% Frequent-pattern mining methods

  34. Constrained Mining vs. Constraint-Based Search • Constrained mining vs. constraint-based search/reasoning • Both are aimed at reducing search space • Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI • Constraint-pushing vs. heuristic search • It is an interesting research problem on how to integrate them • Constrained mining vs. query processing in DBMS • Database query processing requires to find all • Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing Frequent-pattern mining methods

  35. Constrained Frequent Pattern Mining: A Mining Query Optimization Problem • Given a frequent pattern mining query with a set of constraints C, the algorithm should be • sound: it only finds frequent sets that satisfy the given constraints C • complete: all frequent sets satisfying the given constraints C are found • A naïve solution • First find all frequent sets, and then test them for constraint satisfaction • More efficient approaches: • Analyze the properties of constraints comprehensively • Push them as deeply as possible inside the frequent pattern computation. Frequent-pattern mining methods

  36. Anti-Monotonicity in Constraint-Based Mining TDB (min_sup=2) • Anti-monotonicity • intemset S satisfies the constraint, so does any of its subset • sum(S.Price)  v is anti-monotone • sum(S.Price)  v is not anti-monotone • Example. C: range(S.profit)  15 is anti-monotone • Itemset ab violates C • So does every superset of ab Frequent-pattern mining methods

  37. Which Constraints Are Anti-Monotone? Frequent-pattern mining methods

  38. Monotonicity in Constraint-Based Mining TDB (min_sup=2) • Monotonicity • When an intemset S satisfies the constraint, so does any of its superset • sum(S.Price)  v is monotone • min(S.Price)  v is monotone • Example. C: range(S.profit)  15 • Itemset ab satisfies C • So does every superset of ab Frequent-pattern mining methods

  39. Which Constraints Are Monotone? Frequent-pattern mining methods

  40. Succinctness, Convertible, Inconvertable Constraints in Book • We will not consider these in this course. Frequent-pattern mining methods

  41. Associative Classification • Mine association possible rules in form of itemset  class • Itemset: a set of attribute-value pairs • Class: class label • Build Classifier • Organize rules according to decreasing precedence based on confidence and support • B. Liu, W. Hsu & Y. Ma. Integrating classification and association rule mining. In KDD’98 Frequent-pattern mining methods

  42. Classification by Aggregating Emerging Patterns • Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others. • Age<=30 is frequent in class “buys_computer=yes” and infrequent in class “buys_computer=no” • Rule: age<=30  buys computer • G. Dong & J. Li. Efficient mining of emerging patterns: discovering trends and differences. In KDD’99 Frequent-pattern mining methods

More Related