Charles Tappert Seidenberg School of CSIS, Pace University

Data Science and Big Data Analytics Chap 5: Adv Analytical Theory and Methods: Association Rules Charles Tappert Seidenberg School of CSIS, Pace University

Chapter Sections • 5.1 Overview • 5.2 AprioriAlgorithm • 5.3 Evaluation of Candidate Rules • 5.4 Applications of Association Rules • 5.5 Example: Transactions in a Grocery Store • 5.6 Validation and Testing • 5.7 Diagnostics

5.1 Overview • Association rules method • Unsupervised learning method • Descriptive (not predictive) method • Used to find hidden relationships in data • The relationships are represented as rules • Questions association rules might answer • Which products tend to be purchased together • What products do similar customers tend to buy

5.1 Overview • Example – general logic of association rules

5.1 Overview • Rules have the form X -> Y • When X is observed, Y is also observed • Itemset • Collection of items or entities • k-itemset = {item 1, item 2,…,item k} • Examples • Items purchased in one transaction • Set of hyperlinks clicked by a user in one session

5.1 Overview – Apriori Algorithm • Apriori is the most fundamental algorithm • Given itemset L, support of L is the percent of transactions that contain L • Frequentitemset – items appear together “often enough” • Minimum support defines “often enough” (% transactions) • If an itemset is frequent, then any subset is frequent

5.1 Overview – Apriori Algorithm • If {B,C,D} frequent, then all subsets frequent

5.2 Apriori AlgorithmFrequent = minimum support • Bottom-up iterative algorithm • Identify the frequent (min support) 1-itemsets • Frequent 1-itemsets are paired into 2-itemsets, and the frequent 2-itemsets are identified, etc. • Definitions for next slide • D = transaction database • d = minimum support threshold • N = maximum length of itemset (optional parameter) • Ck = set of candidate k-itemsets • Lk = set of k-itemsets with minimum support

5.2 Apriori Algorithm

5.3 Evaluation of Candidate RulesConfidence • Frequent itemsets can form candidate rules • Confidencemeasures the certainty of a rule • Minimum confidence – predefined threshold • Problem with confidence • Given a rule X->Y, confidence considers only the antecedent (X) and the co-occurrence of X and Y • Cannot tell if a rule contains true implication

5.3 Evaluation of Candidate RulesLift • Liftmeasures how much more often X and Y occur together than expected if statistically independent • Lift = 1 if X and Y are statistically independent • Lift > 1 indicates the degree of usefulness of the rule • Example – in 1000 transactions, • If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5 • If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0

5.3 Evaluation of Candidate RulesLeverage • Leverage measures the difference in the probability of X and Y appearing together compared to statistical independence • Leverage = 0 if X and Y are statistically independent • Leverage > 0 indicates degree of usefulness of rule • Example – in 1000 transactions, • If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1 • If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2

5.4 Applications of Association Rules • The term market basket analysis refers to a specific implementation of association rules • For better merchandising – products to include/exclude from inventory each month • Placement of products within related products • Association rules also used for • Recommender systems – Amazon, Netflix • Clickstream analysis from web usage log files • Website visitors to page X click on links A,B,C more than on links D,E,F

> 5.5 Example: Grocery Store Transactions5.5.1 The Groceries Dataset Packages -> Install -> arules, arulesViz # don’t enter next line > install.packages(c("arules", "arulesViz")) # appears on console > library('arules') > library('arulesViz') > data(Groceries) > summary(Groceries) # indicates 9835 rows Class of dataset Groceries is transactions, containing 3 slots • transactionInfo # data frame with vectors having length of transactions • itemInfo# data frame storing item labels • data # binary evidence matrix of labels in transactions > Groceries@itemInfo[1:10,] > apply(Groceries@data[,10:20],2,function(r) paste(Groceries@itemInfo[r,"labels"],collapse=", "))

> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10 supported items 5.5 Example: Grocery Store Transactions5.5.2 Frequent Itemset Generation To illustrate the Apriori algorithm, the code below does each iteration separately. Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total First, get itemsets of length 1 > itemsets<-apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 59 itemsets > inspect(head(sort(itemsets,by="support"),10)) # lists top 10 Second, get itemsetsof length 2 > itemsets<-apriori(Groceries,parameter=list(minlen=2,maxlen=2,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 61 itemsets > inspect(head(sort(itemsets,by="support"),10)) # lists top 10 Third, get itemsets of length 3 > itemsets<-apriori(Groceries,parameter=list(minlen=3,maxlen=3,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 2 itemsets > inspect(head(sort(itemsets,by="support"),10)) # lists top 10

5.5 Example: Grocery Store Transactions5.5.3 Rule Generation and Visualization The Apriori algorithm will now generate rules. Set minimum support threshold to 0.001 (allows more rules, presumably for the scatterplot) and minimum confidence threshold to 0.6 to generate 2,918 rules. > rules <- apriori(Groceries,parameter=list(support=0.001,confidence=0.6,target="rules")) > summary(rules) # finds 2918 rules > plot(rules) # displays scatterplot The scatterplot shows that the highest lift occurs at a low support and a low confidence.

> plot(rules) > plot(rules) 5.5 Example: Grocery Store Transactions5.5.3 Rule Generation and Visualization

5.5 Example: Grocery Store Transactions5.5.3 Rule Generation and Visualization Get scatterplot matrix to compare the support, confidence, and lift of the 2918 rules > plot(rules@quality) # displays scatterplot matrix Lift is proportional to confidence with several linear groupings. Note that Lift = Confidence/Support(Y), so when support of Y remains the same, lift is proportional to confidence and the slope of the linear trend is the reciprocal of Support(Y).

5.5 Example: Grocery Store Transactions5.5.3 Rule Generation and Visualization Compute the 1/Support(Y) which is the slope > slope<-sort(round(rules@quality$lift/rules@quality$confidence,2)) Display the number of times each slope appears in dataset > unlist(lapply(split(slope,f=slope),length)) Display the top 10 rules sorted by lift > inspect(head(sort(rules,by="lift"),10)) Rule {Instant food products, soda} -> {hamburger meat} has the highest lift of 19 (page 154)

5.5 Example: Grocery Store Transactions5.5.3 Rule Generation and Visualization Find the rules with confidence above 0.9 > confidentRules<-rules[quality(rules)$confidence>0.9] > confidentRules # set of 127 rules Plot a matrix-based visualization of the LHS v RHS of rules > plot(confidentRules,method="matrix",measure=c("lift","confidence"),control=list(reorder=TRUE)) The legend on the right is a color matrix indicating the lift and the confidence to which each square in the main matrix corresponds

5.5 Example: Grocery Store Transactions5.5.3 Rule Generation and Visualization Visualize the top 5 rules with the highest lift. > highLiftRules<-head(sort(rules,by="lift"),5) > plot(highLiftRules,method="graph",control=list(type="items")) In the graph, the arrow always points from an item on the LHS to an item on the RHS. For example, the arrows that connects ham, processed cheese, and white bread suggest the rule {ham, processed cheese} -> {white bread} Size of circle indicates support and shade represents lift

5.5 Example: Grocery Store Transactions5.5.3 Rule Generation and Visualization

5.6 Validation and Testing • The frequent and high confidenceitemsetsare found by pre-specified minimum support and minimum confidence levels • Measures like liftand/or leverage then ensure that interesting rules are identified rather than coincidental ones • However, some of the remaining rules may be considered subjectively uninteresting because they don’t yield unexpected profitable actions • E.g., rules like {paper} -> {pencil} are not interesting/meaningful • Incorporating subjective knowledge requires domain experts • Good rules provide valuable insights for institutions to improve their business operations

5.7 Diagnostics • Although minimum support is pre-specified in phases 3&4, this level can be adjusted to target the range of the number of rules – variants/improvements of Apriori are available • For large datasets the Apriori algorithm can be computationally expensive – efficiency improvements • Partitioning • Sampling • Transaction reduction • Hash-based itemset counting • Dynamic itemset counting

Charles Tappert Seidenberg School of CSIS, Pace University