1 / 14

CS 349: Market Basket Data Mining

CS 349: Market Basket Data Mining. All about beer and diapers. Overview. What is Data Mining Market Baskets How fast does it run? What does it do?. What is Data Mining?. Statistics Data Analysis Machine Learning Databases. Types of Data that can be Mined. market basket

andrew
Download Presentation

CS 349: Market Basket Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 349: Market Basket Data Mining All about beer and diapers.

  2. Overview • What is Data Mining • Market Baskets • How fast does it run? • What does it do?

  3. What is Data Mining? • Statistics • Data Analysis • Machine Learning • Databases

  4. Types of Data that can be Mined • market basket • classification • time series • text

  5. Applications of Market Basket • supermarkets • data with boolean attributes • census data: single vs married • word occurrence

  6. Some Measures of the Data • number of baskets : N • number of items : M • average number of items per basket: W (width)

  7. Aspects of Market Basket Mining • What is interesting? • How do you make it run fast?

  8. What is Interesting? (first try) • Itemset I = set of items • association rule - A -> B • support(I) = fraction of baskets that contain I • confidence(A->B) = probability that a basket contains B given that it contains A

  9. How do you find Itemsets with high support? • Apriori algorithm, Agrawal et al (1993) • Find all itemsets with support > s • 1-itemset = itemset with 1 item …k-itemset = itemset with k items • large itemset = itemset with support > s • candidate itemset = itemset that may have support > s

  10. Apriori Algorithm • start with all 1-itemsets • go through data and count their support and find all “large” 1-itemsets • combine them to form “candidate” 2-itemsets • go through data and count their support and find all “large” 2-itemsets • combine them to form “candidate” 3-itemsets …

  11. Run Time • k passes over data where k is the size of the largest candidate itemset • Memory chunking algorithm ==> 2 passes over data on disk but multiple in memory • Toivonen 1996 gives statistical technique 1 + e passes (but more memory) • Brin 1997 - Dynamic Itemset Counting 1 + e passes (less memory)

  12. But what is really interesting? • A->B • Support = P(AB) • Confidence = P(B|A) • Interest = P(AB)/P(A)P(B) • Implication Strength = P(A)P(~B)/P(A~B)

  13. But what is really really interesting? • Causality • Surprise

  14. Summary • What is Data Mining? • Market Baskets • Finding Itemsets with high support • Finding Interesting Rules

More Related