140 likes | 413 Views
CS 349: Market Basket Data Mining. All about beer and diapers. Overview. What is Data Mining Market Baskets How fast does it run? What does it do?. What is Data Mining?. Statistics Data Analysis Machine Learning Databases. Types of Data that can be Mined. market basket
E N D
CS 349: Market Basket Data Mining All about beer and diapers.
Overview • What is Data Mining • Market Baskets • How fast does it run? • What does it do?
What is Data Mining? • Statistics • Data Analysis • Machine Learning • Databases
Types of Data that can be Mined • market basket • classification • time series • text
Applications of Market Basket • supermarkets • data with boolean attributes • census data: single vs married • word occurrence
Some Measures of the Data • number of baskets : N • number of items : M • average number of items per basket: W (width)
Aspects of Market Basket Mining • What is interesting? • How do you make it run fast?
What is Interesting? (first try) • Itemset I = set of items • association rule - A -> B • support(I) = fraction of baskets that contain I • confidence(A->B) = probability that a basket contains B given that it contains A
How do you find Itemsets with high support? • Apriori algorithm, Agrawal et al (1993) • Find all itemsets with support > s • 1-itemset = itemset with 1 item …k-itemset = itemset with k items • large itemset = itemset with support > s • candidate itemset = itemset that may have support > s
Apriori Algorithm • start with all 1-itemsets • go through data and count their support and find all “large” 1-itemsets • combine them to form “candidate” 2-itemsets • go through data and count their support and find all “large” 2-itemsets • combine them to form “candidate” 3-itemsets …
Run Time • k passes over data where k is the size of the largest candidate itemset • Memory chunking algorithm ==> 2 passes over data on disk but multiple in memory • Toivonen 1996 gives statistical technique 1 + e passes (but more memory) • Brin 1997 - Dynamic Itemset Counting 1 + e passes (less memory)
But what is really interesting? • A->B • Support = P(AB) • Confidence = P(B|A) • Interest = P(AB)/P(A)P(B) • Implication Strength = P(A)P(~B)/P(A~B)
But what is really really interesting? • Causality • Surprise
Summary • What is Data Mining? • Market Baskets • Finding Itemsets with high support • Finding Interesting Rules