E N D
1. Market Basket Analysis and Advanced Data Mining Professor Amit Basu
abasu@smu.edu
3. Examples Rule form: LHS ® RHS
IF a customer buys diapers, THEN they also buy beer
diapers ® beer
“Transactions that purchase bread and butter also purchase milk”
bread ? butter ? milk
Customers who purchase maintenance agreements are very likely to purchase large appliances
When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners
4. Representations What’s the difference between these patterns?
(a) Risk = 0.3 * sin(numcards * dem10.25) +
0.83 * (pastdef - dem2) * cos(employed+dem1)2
(b) Risk = 0.93 * priordefault + 0.23 * num_cards – 1.3 * employed – 0.734
(c) IF person has a good credit rating THEN they have fewer accidents
5. Evaluation Support : measure of how often the collection of items in an association occur together as a percentage of all the transactions
In 2% of the purchases at hardware store, both pick and shovel were bought
support = #tuples(LHS, RHS)/N
Confidence : confidence of rule “B given A” is a measure of how much more likely it is that B occurs when A has occurred
100% meaning that B always occurs if A has occurred
confidence = #tuples(LHS, RHS) / #tuples(LHS)
Example: bread and butter ? milk [90%, 1%]
Rules originating from the same itemset have identical support but can have different confidence
6. The association rules mining problem Generate all association rules from the given dataset that have
support greater than a specified minimum
and
confidence greater than a specified minimum
7. Examples Rule form:
LHS ® RHS [confidence, support]
diapers ® beer [60%, 0.5%]
“90% of transactions that purchase bread and butter also purchase milk”
bread and butter ? milk [90%, 1%]
8. Example
9. How Good is an Association Rule? Is support and confidence enough?
Lift (improvement) tells us how much better a rule is at predicting the result than just assuming the result in the first place
Lift = P(LHS^RHS) / (P(LHS).P(RHS)
When lift > 1 then the rule is better at predicting the result than guessing
When lift < 1, the rule is doing worse than informed guessing and using the Negative Rule produces a better rule than guessing
10. Computational Complexity Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
11. The Problem of Lots of Data Fast Food Restaurant…could have 100 items on its menu
How many combinations are there with 3 different menu items? 161,700 !
Supermarket…10,000 or more unique items
50 million 2-item combinations
100 billion 3-item combinations
Use of product hierarchies (groupings) helps address this common issue
Also, the number of transactions in a given time-period could also be huge (hence expensive to analyze)
12. Preparing Data for MBA Determining scope of dataset (one or many stores, what period, etc)
Converting transaction data to itemsets
Generalizing items to appropriate level
Depends on objective of model
Rolling up rare items to get adequate support
13. Search Approach Two sub-problems in discovering all association rules:
Find all sets of items (itemsets) that have transaction support above minimum support
Itemsets that qualify are called large itemsets, and all others small itemsets.
Generate from each large itemset, rules that use items from the large itemset.
Given a large itemset Y, and X is a subset of Y
Take the support of Y and divide it by the support of X
If the ratio c is at least minconf, then X ? (Y - X) is satisfied with confidence factor c
14. Reducing Number of Candidates Apriori principle:
If an itemset is large, then all of its subsets must also be large
Support of an itemset never exceeds the support of its subsets
15. The Apriori Algorithm Progressively identifies large itemsets of different sizes
Exploits the property that any subset of a large itemset is also a large itemset
Also, any superset of a small itemset is also small
16. Extending MBA Dissociation rules
Combining transaction data with complementary data
Shopper characteristics
Store characteristics
Seasonal factors
Analyzing patterns over time
Patterns that span multiple occasions
Need to “sessionize” data
Need to recognize shoppers across sessions
17. Usability of Association Rules
18. Advanced Data Mining Text Mining
Mining non-textual data
Image and video data (Multimedia)
Spatial data
GIS
Temporal data
Time series
Behavioral patterns
Web Mining
Web usage
Web content
19. Mining Image Data Traditional pattern recognition
Neural networks
Supervised learning
Discovering patterns
Unsupervised learning
Clustering
20. Mining Spatial Data Spatial databases typically use special data structures
Extensions of tree-structured indexes
Quad trees, R-trees, k-D trees, etc.
Relationships based on spatial descriptors
Overlapping, disjoint, contains, etc.
Distance-based clustering
Feature extraction
Association rules
If location is near lake, pollution is low
21. Web Mining Mining data that is obtained from the Web
Web Content mining
Web Usage mining
22. Web Content Mining Search engines
Spiders and Crawlers
Metacrawlers
A major challenge is the unstructured form of the data
Lack of high-level standards
Abuse of descriptors (meta-information)
23. Web Usage Mining Mining Web logs
Data is relatively structured
Data is highly dynamic
Problems with identification and location
The inherently non-linear aspects of Web usage behavior
Tracking both forward and backward links
Dynamic personalization
24. Issues and Trends Mining across multiple data sources and sets
Online mining – what are the patterns right now?
Concerns about privacy and other ethical questions
Property
Accuracy