Association Rule Mining

(Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏ Association Rule Mining

An Example

Terminology Transaction Item Itemset

Association Rules Let U be a set of items and let X, YU, with XY =  An association rule is an expression of the form XY, whose meaning is: If the elements of X occur in some context, then so do the elements of Y

Quality Measures Let T be set of all transactions. The following statistical quantities are relevant to association rule mining: support(X)‏ |{t T: X t}| / |T| support(X Y) |{t T: XY t}| / |T| confidence(XY) |{t T: XY t}| / |{t T: X t}| The percentage of all transactions, containing item set x The percentage of all transactions, containing both item sets x and y The percentage of transactions containing item set x, that also contain item set y. How good is item set x at predicting item set y.

Learning Associations The purpose of association rule learning is to find “interesting” rules, i.e., rules that meet the following two user-defined conditions: support(XY) MinSupport confidence(XY) MinConfidence

Itemsets Frequent itemset An itemset whose support is greater than MinSupport (denoted Lk where k is the size of the itemset)‏ Candidate itemset A potentially frequent itemset (denoted Ck where k is the size of the itemset)‏ High percentage of transactions contain the full item set.

Basic Idea Generate all frequent itemsets satisfying the condition on minimum support Build all possible rules from these itemsets and check them against the condition on minimum confidence All the rules above the minimum confidence threshold are returned for further evaluation

AprioriAll (I)‏ • L1 • For each item IjI • count({Ij}) = | {Ti : IjTi} | • If count({Ij}) MinSupport x m • L1L1 {({Ij}, count({Ij})} • k 2 • While Lk-1 • Lk • For each (l1, count(l1)) Lk-1 • For each (l2, count(l2)) Lk-1 • If (l1 = {j1, …, jk-2, x} l2 = {j1, …, jk-2, y} xy)‏ • l {j1, …, jk-2, x, y} • count(l)  | {Ti : lTi } | • If count(l) MinSupport x m LkLk {(l, count(l))} • kk + 1 • Return L1L2… Lk-1 The number of all transactions, containing item I_j If this count is big enough, we add the item and count to a stack, L_1

Rule Generation • Look at set {a,d,e} • Has six candidate association rules: • {a}{d,e} confidence: {a,d,e} / {a} = 0.571 • {d,e}{a} confidence: {a,d,e} / {d,e} = 1.000 • {d}{a,e} confidence: {a,d,e} / {d} = 0.667 • {a,e}{d} confidence: {a,d,e} / {a,e} = 0.667 • {e}{a,d} confidence: {a,d,e} / {e} = 0.571 • {a,d}{e} confidence: {a,d,e} / {a,d} = 0.800

Confidence-Based Pruning

Rule Generation • Look at set {a,d,e}. Let MinConfidence == 0.800 • Has six candidate association rules: • {d,e}{a} confidence: {a,d,e} / {d,e} = 1.000 • {a,e}{d} confidence: {a,d,e} / {a,e} = 0.667 • {a,d}{e} confidence: {a,d,e} / {a,d} = 0.800 • {d}{a,e} confidence: {a,d,e} / {d} = 0.667 • Selected Rules: • {d,e}a and {a,d}e

Summary Apriori is a rather simple algorithm that discovers useful and interesting patterns It is widely used It has been extended to create collaborative filtering algorithms to provide recommendations

References Fast Algorithms for Mining Association Rules (1994) Rakesh Agrawal, Ramakrishnan Srikant. Proc. 20th Int. Conf. Very Large Data Bases, VLDB (PDF)‏ Mining Association Rules between Sets of Items in Large Databases (1993) Rakesh Agrawal, Tomasz Imielinski, Arun Swami. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data Introduction to Data Mining P-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Pearson Education Inc., 2006, Chapter 6

Association Rule Mining