Association Rules

Association Rules

What is Association Rule? • An association rule is a model that identifies specific types of data associations. • They are frequent patterns : pattern(set of items, sequence , etc.) that occur frequently in a database ; • Frequent pattern mining: finding regularities in datafor e.g What products were often purchased together? • What are the subsequent purchase after buying a car? • Can we automatically profile customers?

Need of Association of Rules • Foundation for many data mining tasks • - Association rules, correlation , causality , sequential patterns , spatial and multimedia patterns , associative classification , cluster analysis, iceberg cube, … • Broad applications • - Basket data analysis, cross marketing , catalog design , Sale campaign analysis , web log (Click stream) analysis, ...

Basics • Itemset: a set of items e.g , acm={a,c,m} • Support of itemsets ; Sup(acm)=3 • Given min_sup=3,acm is a frequent pattern • Frequent pattern mining: find all frequent patterns in a database Transaction Database

Applications • Correlation , causality analysis and mining interesting rules • Max patterns and frequent closed itemsets • Constraint-based mining • Sequential Patterns • Periodic Patterns • Computing icebergs cubes

Frequent Pattern Mining Methods • Apriori and its variations / improvements • Mining frequent – patterns without candidate generation • Mining max-patterns and closed itemsets • Mining multi-dimensional , multi-level frequent patterns with flexible support constraints • Interestingness: correlation and causality

Apriori: Candidate Generation-and-test • Any subset of a frequent itemset must be also frequent – an anti-monotone property • - A transaction containing {ATM card,PAN Card} also contain {debit,credit cards} • - {ATM card , PAN card} is frequent -> • { debit,credit} must also be frequent • No superset of any infrequent itemset should be generated or tested • - Many item combinations can be pruned

Apriori - based Mining • Generate length(k+1) candidate itemsets from length k frequent itemsets , and • Test the candidates against DB

Apriori Algorithm • A level-wise , candidate-generation-and test approach (Gavaskar and Srikant 1994) • Database D 1-candidates Freq 1-itemsets 2-candidates Counting Freq 2-itemsets 3-candidates Scan D Scan D Freq 3-itemsets

Steps: • Ck: Candidate itemset of size k • Lk: frequent itemset of size k • L1 = {frequent items}; • For(k=1;Lk !=  ; k++) do • - Ck+1 = candidates generated from Lk; • for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t • Lk+1 = candidates in Ck+1 with min_support • Return Uk Lk;

Details of Apriori • How to generate candidates? • step 1 : self joining Lk • step 2 : pruning • - How to count supports of candidates?

How to Generate Candidates? • Suppose the items in Lk are listed in an order • Step 1: self-join Lk-1 • INSERT INTO Ck • SELECT p.item1,p.item2,…,p.itemk-1,q.itemk-1 • FROM Lk-1 p , Lk-1 q • WHERE p.item1 = q.item1,..,p.item k-2 = q.itemk-2,p.itemk-1 < q.itemk-1 • Step 2: pruning • -For each itemset c in Ck do • - For each (k-l) subset s of c do if(s is not in Lk-1) then delete c from Ck

Example • L3 = {abc , abd , acd , ace, bcd} • Self-joining: L3*L3 • - abcd from abc and abd • - acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4 = {abcd}

How to Count Supports of Candidates? • Why counting supports of candidates a problem? • - The total number of candidates can be very huge • - One transaction may contain many candidates • -Method: • - Candidate itemsets are stored in a hash – tree • Leaf node of hash – tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction

Example: Subset function Transaction 1 2 3 5 6 1,4,7 3,6,9 2,5,8 2 3 4 5 6 7 1 + 2 3 5 6 1 3 + 5 6 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 4 5 1 3 6 1 2 + 3 5 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9

Challenges in Frequent Pattern Mining • Challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • - Reduce number of transaction database scans • - Shrinks number of candidates • - Facilitate support counting of canddiates

Reduce Number of Scans • Reduce the Number of Candidates • A hashing bucket count < min_sup->every candidate in the buck is infrequent • Candidates : a , b , c ,d,e • Hash entries: {ab,ad,ae} {bd,be,de} … • Large 1-itemset : a,b,d,e • - The sum of counts of {ab,ad,ae} < min_sup -> ab should not be a candidate 2-itemset

Scan Database Only Once • Partition the database into n partitions • Itemset X is frequent -> X frequent in atleast one partition • - Scan 1: partition database and find local frequent patterns • - Scan 2: consolidate global frequent patterns

Sampling for Frequent • Select a sample of original database , mine frequent patterns within sample using Apriori • Scan database once to verify frequent itemsets found in sample , only borders of closure of frequent patterns are chechked • - Example : check abcd instead of ab,ac,..,etc. • -Scan database again to find missed frequent patterns

Association Rules