Data Mining Toon Calders

Data MiningToon Calders

Why Data mining? • Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Major sources of abundant data

Why Data mining? • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets The Data Gap Total new disk (TB) since 1995 Number of analysts

What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Current Applications • Data analysis and decision support • Market analysis and management • Risk analysis and management • Fraud detection and detection of unusual patterns (outliers) • Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • Bioinformatics and bio-data analysis

Ex. 3: Process Mining process mining • Process mining can be used for: • Process discovery (What is the process?) • Delta analysis (Are we doing what was specified?) • Performance analysis (How can we improve?)

Ex. 3: Process Mining case 1 : task A case 2 : task A case 3 : task A case 3 : task B case 1 : task B case 1 : task C case 2 : task C case 4 : task A case 2 : task B case 2 : task D case 5 : task E case 4 : task C case 1 : task D case 3 : task C case 3 : task D case 4 : task B case 5 : task F case 4 : task D

Data Mining Tasks • Previous lectures: • Classification [Predictive] • Clustering [Descriptive] • This lecture: • Association Rule Discovery [Descriptive] • Sequential Pattern Discovery [Descriptive] • Other techniques: • Regression [Predictive] • Deviation Detection [Predictive]

Outline of today’s lecture • Association Rule Mining • Frequent itemsets and association rules • Algorithms: Apriori and Eclat • Sequential Pattern Mining • Mining frequent episodes • Algorithms: WinEpi and MinEpi • Other types of patterns • strings, graphs, … • process mining

Association Rule Mining • Definition • Frequent itemsets • Association rules • Frequent itemset mining • breadth-first Apriori • depth-first Eclat • Association Rule Mining

Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold

Example: Definition: Association Rule • Association Rule • An implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Y appear in transactions thatcontain X

Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup threshold • confidence ≥ minconf threshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67){Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) • Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive

Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets

Frequent Itemset Generation • Brute-force approach: • Each itemset in the lattice is a candidate frequent itemset • Count the support of each candidate by scanning the database • Match each transaction against every candidate • Complexity ~ O(NMw) => Expensive since M = 2d!!!

Frequent Itemset Generation Strategies • Reduce the number of candidates (M) • Complete search: M=2d • Use pruning techniques to reduce M • Reduce the number of transactions (N) • Reduce size of N as the size of itemset increases • Used by DHP and vertical-based mining algorithms • Reduce the number of comparisons (NM) • Use efficient data structures to store the candidates or transactions • No need to match every candidate against every transaction

Reducing Number of Candidates • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support

Illustrating Apriori Principle Found to be Infrequent Pruned supersets

Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generatecandidates involving Cokeor Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13

Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates A B C D 0 0 0 0 {}

Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 Candidates AB AC AD BC BD CD A B C D 2 4 4 3 {}

Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 AB AC AD BC BD CD 1 2 2 3 2 2 A B C D 2 4 4 3 {}

Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D Candidates minsup=2 ACD BCD AB AC AD BC BD CD 1 2 2 3 2 2 A B C D 2 4 4 3 {}

Apriori • B, C • B, C • A, C, D • A, B, C, D • B, D minsup=2 ACD BCD 2 1 AB AC AD BC BD CD 1 2 2 3 2 2 A B C D 2 4 4 3 {}

Apriori Algorithm • Apriori Algorithm: k := 1 C1 := { {A} | A is an item} Repeat until Ck = {} Count the support of each candidate in Ck • in one scan over DB Fk := { I  Ck : I is frequent} Generate new candidates Ck+1 := { I : |I| = k+1 and all J  I with |J|=k are in Fk} k:=k+1 Returni=1…k-1 Fi

Depth-first strategy • Recursive procedure • FSET(DB) = frequent sets in DB • Based on divide-and-conquer • Count frequency of all items • let D be a frequent item • FSET(DB) = Frequent sets with item D + Frequent sets without item D

Depth-first strategy 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D • Frequent items • A, B, C, D • Frequent sets with D: • remove transactions without D and D itself from DB • Count frequent sets: A, B, C, AC • Append D: AD, BD, CD, ACD • Frequent sets without D: • remove D from all transactions in DB • Find frequent sets: AC, BC 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D

Depth-First Algorithm minsup=2 DB 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D

Depth-First Algorithm minsup=2 DB 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 2 B: 4 C: 4 D: 3

Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3

Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D DB[CD] 3 A, C 4 A, B, C 5 B, • A, • A, B A: 2 A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3

Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D DB[CD] 3 A, C 4 A, B, C 5 B, • A, • A, B A: 2 A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, DB[BD] 4 A A:1 A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2

Depth-First Algorithm minsup=2 DB DB[D] 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D 3 A, C 4 A, B, C 5 B, A: 2 B: 2 C: 2 A: 2 B: 4 C: 4 D: 3 AC: 2 AD: 2 BD: 2 CD: 2 ACD: 2

Depth-First Algorithm minsup=2 DB 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 2 CD: 2 ACD: 2

Depth-First Algorithm minsup=2 DB[C] DB • B • 2 B • 3 A • 4 A, B 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 2 B: 3 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 2 CD: 2 ACD: 2

Depth-First Algorithm minsup=2 DB[C] DB[BC] DB • 2 • 4 A • B • 2 B • 3 A • 4 A, B 1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5 B, D A: 1 A: 2 B: 3 A: 2 B: 4 C: 4 D: 3 AD: 2 BD: 2 CD: 2 ACD: 2

Data Mining Toon Calders

Data Mining Toon Calders

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data