1 / 94

Data Mining Toon Calders

Data Mining Toon Calders. Why Data mining?. Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Major sources of abundant data. Why Data mining?. We are drowning in data, but starving for knowledge!

elpida
Download Presentation

Data Mining Toon Calders

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningToon Calders

  2. Why Data mining? • Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Major sources of abundant data

  3. Why Data mining? • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets The Data Gap Total new disk (TB) since 1995 Number of analysts

  4. What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

  5. Current Applications • Data analysis and decision support • Market analysis and management • Risk analysis and management • Fraud detection and detection of unusual patterns (outliers) • Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • Bioinformatics and bio-data analysis

  6. Ex. 3: Process Mining process mining • Process mining can be used for: • Process discovery (What is the process?) • Delta analysis (Are we doing what was specified?) • Performance analysis (How can we improve?)

  7. Ex. 3: Process Mining case 1 : task A case 2 : task A case 3 : task A case 3 : task B case 1 : task B case 1 : task C case 2 : task C case 4 : task A case 2 : task B case 2 : task D case 5 : task E case 4 : task C case 1 : task D case 3 : task C case 3 : task D case 4 : task B case 5 : task F case 4 : task D

  8. Data Mining Tasks • Previous lectures: • Classification [Predictive] • Clustering [Descriptive] • This lecture: • Association Rule Discovery [Descriptive] • Sequential Pattern Discovery [Descriptive] • Other techniques: • Regression [Predictive] • Deviation Detection [Predictive]

  9. Outline of today’s lecture • Association Rule Mining • Frequent itemsets and association rules • Algorithms: Apriori and Eclat • Sequential Pattern Mining • Mining frequent episodes • Algorithms: WinEpi and MinEpi • Other types of patterns • strings, graphs, … • process mining

  10. Association Rule Mining • Definition • Frequent itemsets • Association rules • Frequent itemset mining • breadth-first Apriori • depth-first Eclat • Association Rule Mining

  11. Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

  12. Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold

  13. Example: Definition: Association Rule • Association Rule • An implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Y appear in transactions thatcontain X

  14. Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup threshold • confidence ≥ minconf threshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

  15. Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67){Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) • Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

  16. Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive

  17. Association Rule Mining • Definition • Frequent itemsets • Association rules • Frequent itemset mining • breadth-first Apriori • depth-first Eclat • Association Rule Mining

  18. Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets

  19. Frequent Itemset Generation • Brute-force approach: • Each itemset in the lattice is a candidate frequent itemset • Count the support of each candidate by scanning the database • Match each transaction against every candidate • Complexity ~ O(NMw) => Expensive since M = 2d!!!

  20. Frequent Itemset Generation Strategies • Reduce the number of candidates (M) • Complete search: M=2d • Use pruning techniques to reduce M • Reduce the number of transactions (N) • Reduce size of N as the size of itemset increases • Used by DHP and vertical-based mining algorithms • Reduce the number of comparisons (NM) • Use efficient data structures to store the candidates or transactions • No need to match every candidate against every transaction

  21. Reducing Number of Candidates • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support

  22. Illustrating Apriori Principle Found to be Infrequent Pruned supersets

  23. Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generatecandidates involving Cokeor Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13

  24. Association Rule Mining • Definition • Frequent itemsets • Association rules • Frequent itemset mining • breadth-first Apriori • depth-first Eclat • Association Rule Mining

  25. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates A B C D 0 0 0 0 {}

  26. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates A B C D 0 1 1 0 {}

  27. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates A B C D 0 2 2 0 {}

  28. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates A B C D 1 2 3 1 {}

  29. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates A B C D 2 3 4 2 {}

  30. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates A B C D 2 4 4 3 {}

  31. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates AB AC AD BC BD CD A B C D 2 4 4 3 {}

  32. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 AB AC AD BC BD CD 1 2 2 3 2 2 A B C D 2 4 4 3 {}

  33. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D Candidates minsup=2 ACD BCD AB AC AD BC BD CD 1 2 2 3 2 2 A B C D 2 4 4 3 {}

  34. Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 ACD BCD 2 1 AB AC AD BC BD CD 1 2 2 3 2 2 A B C D 2 4 4 3 {}

  35. Apriori Algorithm • Apriori Algorithm: k := 1 C1 := { {A} | A is an item} Repeat until Ck = {} Count the support of each candidate in Ck • in one scan over DB Fk := { I  Ck : I is frequent} Generate new candidates Ck+1 := { I : |I| = k+1 and all J  I with |J|=k are in Fk} k:=k+1 Returni=1…k-1 Fi

  36. Association Rule Mining • Definition • Frequent itemsets • Association rules • Frequent itemset mining • breadth-first Apriori • depth-first Eclat • Association Rule Mining

  37. Depth-first strategy • Recursive procedure • FSET(DB) = frequent sets in DB • Based on divide-and-conquer • Count frequency of all items • let D be a frequent item • FSET(DB) = Frequent sets with item D + Frequent sets without item D

  38. Depth-first strategy 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D • Frequent items • A, B, C, D • Frequent sets with D: • remove transactions without D and D itself from DB • Count frequent sets: A, B, C, AC • Append D: AD, BD, CD, ACD • Frequent sets without D: • remove D from all transactions in DB • Find frequent sets: AC, BC 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D

  39. Depth-First Algorithm minsup=2 DB 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D

  40. Depth-First Algorithm minsup=2 DB 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 2 B: 4 C: 4 D: 3

  41. Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3

  42. Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D DB[CD] 3 A, C 4 A, B, C 5 B, • A, • A, B A: 2 A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3

  43. Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D DB[CD] 3 A, C 4 A, B, C 5 B, • A, • A, B A: 2 A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

  44. Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

  45. Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, DB[BD] 4 A A:1 A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

  46. Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

  47. Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2 AD: 2 BD: 2 CD: 2 ACD: 2

  48. Depth-First Algorithm minsup=2 DB 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 2 CD: 2 ACD: 2

  49. Depth-First Algorithm minsup=2 DB[C] DB • B • 2 B • 3 A • 4 A, B 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 2 B: 3 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 2 CD: 2 ACD: 2

  50. Depth-First Algorithm minsup=2 DB[C] DB[BC] DB • 2 • 4 A • B • 2 B • 3 A • 4 A, B 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 1 A: 2 B: 3 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 2 CD: 2 ACD: 2

More Related