CS 349: Market Basket Data Mining

CS 349: Market Basket Data Mining All about beer and diapers.

Overview • What is Data Mining • Market Baskets • How fast does it run? • What does it do?

What is Data Mining? • Statistics • Data Analysis • Machine Learning • Databases

Types of Data that can be Mined • market basket • classification • time series • text

Applications of Market Basket • supermarkets • data with boolean attributes • census data: single vs married • word occurrence

Some Measures of the Data • number of baskets : N • number of items : M • average number of items per basket: W (width)

Aspects of Market Basket Mining • What is interesting? • How do you make it run fast?

What is Interesting? (first try) • Itemset I = set of items • association rule - A -> B • support(I) = fraction of baskets that contain I • confidence(A->B) = probability that a basket contains B given that it contains A

How do you find Itemsets with high support? • Apriori algorithm, Agrawal et al (1993) • Find all itemsets with support > s • 1-itemset = itemset with 1 item …k-itemset = itemset with k items • large itemset = itemset with support > s • candidate itemset = itemset that may have support > s

Apriori Algorithm • start with all 1-itemsets • go through data and count their support and find all “large” 1-itemsets • combine them to form “candidate” 2-itemsets • go through data and count their support and find all “large” 2-itemsets • combine them to form “candidate” 3-itemsets …

Run Time • k passes over data where k is the size of the largest candidate itemset • Memory chunking algorithm ==> 2 passes over data on disk but multiple in memory • Toivonen 1996 gives statistical technique 1 + e passes (but more memory) • Brin 1997 - Dynamic Itemset Counting 1 + e passes (less memory)

But what is really interesting? • A->B • Support = P(AB) • Confidence = P(B|A) • Interest = P(AB)/P(A)P(B) • Implication Strength = P(A)P(~B)/P(A~B)

But what is really really interesting? • Causality • Surprise

Summary • What is Data Mining? • Market Baskets • Finding Itemsets with high support • Finding Interesting Rules

CS 349: Market Basket Data Mining