260 likes | 412 Views
Mining Generalized Association Rules. R. Srikant & R. Agrawal (IBM) Presentation by: Colin Cherry. Objectives. What are generalized association rules? Why do we care? How can we get them efficiently? How can we reduce rule redundancy? Is the efficient method any good?. Motivation.
E N D
Mining Generalized Association Rules R. Srikant & R. Agrawal (IBM) Presentation by: Colin Cherry
Objectives • What are generalized association rules? • Why do we care? • How can we get them efficiently? • How can we reduce rule redundancy? • Is the efficient method any good?
Motivation • Association rules find rules of the form: • XY, where X and Y are sets of items • What if there is structure over your items? • Structure can be used to generalize
Hierarchy Example Beverage Soft Drink … Cola … Pepsi Coke … …
Hierarchy Example On Sale Not On Sale … … • Goal of this paper: • Given hierarchies over items: • Capture interesting rules at all levels of multiple hierarchies
Simple Fix • Just add parents to each transaction. • {Coke, 7-up, ranch Doritos, bananas} would become: {Coke, 7-up, ranch Doritos, bananas, Doritos, cola, clear pop, soft drink, chips, junk food, fruit, produce}
Fix Cont’d • Run Apriori on expanded database • Redefine association rules: Make sure: XY XY={} Y contains no ancestors of any item in X
Problems with the fix • Counting may slow down • Total number of items & average transaction size will grow • Could get a lot of redundant rules • Milk Cereal (70%) • Skim Milk Cereal (70%) Do we care?
An Efficient Algorithm • “Cumulate” • Filtering ancestors added to transactions • Hierarchy-aware itemset pruning • For more complicated, speculative algorithms, see paper
Filtering Ancestors • Not counting soft drink? Don’t add it. • Only add ancestors that are in at least one of the candidate itemsets • Delete any items we are not counting • Not counting Doritos? Replace with chips • Each iteration: • Pre-compute the ancestors for each item
Itemset Pruning • No sense counting both {coke,cola,chips} and {coke,chips}, they’ll always be the same • Take out {coke,cola} during count size=2 and you’ll never have to deal with it
Reducing Redundancy Milk Cereal (8% sup, 70% conf) Skim Milk Cereal (2% sup, 70% conf) • If Skim Milk accounts for 1/4 of Milk sales, then the 2nd rule is redundant • Expected support and confidence (wrt hierarchy) will define interesting
Close Ancestors • An itemset Z’ is an ancestor of Z if: • Z’ = Z with some items replaced by ancestors • Z’ has the same number of items as Z • Z’ is a close ancestor of Z if: • No ancestor of Z has Z’ as an ancestor Take {coke,bananas} as Z Z’={cola, bananas} is a close ancestor Z’={soft drink, bananas} is not close Z’={cola,fruit} is not close
Interestingness • A rule XY is interesting if for all interesting, close ancestors X’Y’: Sup({X,Y}) > R*ExpSup({X,Y}|{X’,Y’}) or: Conf(XY) > R*ExpConf(XY|X’Y’) • R is defined by the user
Putting it all together • #1 is interesting - has no ancestor • #2 is interesting - twice expected support • #3 is not interesting • Has exactly expected support according to closest ancestor (#2)
Experiments • Lots of experiments on artificial data in paper. • We’ll look at the results of using Cumulate on real data • Compare to the quick fix - just adding in ancestors to transactions
Interestingness Results • Hierarchical Interestingness pruning: • R = 25% resulted in pruning roughly 40% of the rules • R = 50% resulted in pruning roughly 50% of the reuslts • Pruning had a significant impact!
Objectives Revisited • What are generalized association rules? • Rules aware of hierarchies over items • Why do we care? • Support can be low for individual items • How can we get them efficiently? • Cumulate algorithm - hierarchy aware counting • How can we reduce rule redundancy? • Check surprise with respect to ancestors • Is the efficient method any good? • Yeap!
Hierarchy Example Impulse Fridge … Beverage … Cans Bottles … …
Pros • Rules over items low in the tree may not have minimum support • Can raise min support • Shoot for fewer, more general rules • BUT: You can catch rules at any level of the hierarchy
Data Sets • Supermarket: • 500,000 items • 1.5 million transactions • Hierarchy has 4 levels, 118 roots • Department Store: • 200,000 items • 500,000 transactions • Hierarchy has 7 levels, 89 roots
Summary • Nothing ground-breaking in this paper • But, it provides a solid, efficient method for working with hierarchies • Generalization is a powerful tool to have available in association rules