210 likes | 406 Views
Association rule discovery ( Market basket-analysis). MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining . http://www-users.cs.umn.edu/~kumar/dmbook/. What is Association Mining?.
E N D
Association rule discovery(Market basket-analysis) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.http://www-users.cs.umn.edu/~kumar/dmbook/
What is Association Mining? • Discovering interesting relationships between variables in large databases (http://en.wikipedia.org/wiki/Association_rule_learning) • Find out which items predict the occurrence of other items • Also known as “affinity analysis” or “market basket” analysis
Examples of Association Mining • Market basket analysis/affinity analysis • What products are bought together? • Where to place items on grocery store shelves? • Amazon’s recommendation engine • “People who bought this product also bought…” • Telephone calling patterns • Who do a set of people tend to call most often? • Social network analysis • Determine who you “may know”
Market-Basket Analysis • Large set of items • e.g., things sold in a supermarket • Large set of baskets • e.g., things one customer buys in one visit • Supermarket chains keep terabytes of this data • Informs store layout • Suggests “tie-ins” • Place spaghetti sauce in the pasta aisle • Put diapers on sale and raise the price of beer
Market-Basket Transactions Association Rules from these transactions X Y (antecedent consequent) {Diapers} {Beer},{Milk, Bread} {Diapers} {Beer, Bread} {Milk},{Bread} {Milk, Diapers}
Core idea: The itemset • Itemset: A group of items of interest{Milk, Beer, Diapers} • This itemset is a “3 itemset” because itcontains…3 items! • An association rule expresses related itemsets • X Y, where X and Y are two itemsets • {Milk, Diapers} {Beer} means“when you have milk and diapers, you also have beer)
Support • Support count () • Frequency of occurrence of an itemset • {Milk, Beer, Diapers} = 2 (i.e., it’s in baskets 4 and 5) • Support (s) • Fraction of transactions that contain allitemsets in the relationship X Y • s({Milk, Diapers, Beer}) = 2/5 = 0.4 • You can calculate support for both X and Y separately • Support for X = 3/5 = 0.6; Support for Y = 3/5 = 0.6 X Y
Confidence • Confidence is the strength of the association • Measures how often items in Y appear in transactions that contain X c must be between 0 and 11 is a complete association 0 is no association This says 67% of the times when you have milk and diapers in the itemset you also have beer!
Some sample rules All the above rules are binary partitions of the same itemset: {Milk, Diapers, Beer}
But don’t blindly follow the numbers • Rules originating from the same itemset (XY) will have identical support • Since the total elements (XY) are always the same • But they can have different confidence • Depending on what is contained in X and Y • High confidence suggests a strong association • But this can be deceptive • Consider {Bread}{Diapers} • Support for the total itemset is 0.6 (3/5) • And confidence is 0.75 (3/4) – pretty high • But is this just because both are frequently occurring items (s=0.8)? • You’d almost expect them to show up in the same baskets by chance
Lift • Takes into account how co-occurrence differs from what is expected by chance • i.e., if items were selected independently from one another • Based on the support metric Support for total itemset X and Y Support for X times support for Y
Lift Example • What’s the lift of the association rule{Milk, Diapers} {Beer} • So X = {Milk, Diapers} and Y = {Beer}s({Milk, Diapers, Beer}) = 2/5 = 0.4s({Milk, Diapers}) = 3/5 = 0.6s({Beer}) = 3/5 = 0.6So When Lift > 1, the occurrence of X Y together is more likely than what you would expect by chance
Another example Are people more likely to have a checking account if they have a savings account? Support ({Savings} {Checking}) = 5000/10000 = 0.5 Support ({Savings}) = 6000/10000 = 0.6 Support ({Checking}) = 8500/10000 = 0.85 Confidence ({Savings} {Checking}) = 5000/6000 = 0.83 Answer: No In fact, it’s slightly less than what you’d expect by chance!
Selecting the rules • We know how to calculate the measures for each rule • Support • Confidence • Lift • Then we set up thresholds for the minimum rule strength we want to accept • For support – called minsup • For confidence – called minconf
But this can be overwhelming • Imagine all the combinations possiblein your local grocery store • Every product matched with every combination of other products • Tens of thousands of possible rule combinations So where do you start?
Once you are confident in a rule, take action {Milk, Diapers} {Beer}