Mining Generalized Association Rules

Mining Generalized Association Rules R. Srikant & R. Agrawal (IBM) Presentation by: Colin Cherry

Objectives • What are generalized association rules? • Why do we care? • How can we get them efficiently? • How can we reduce rule redundancy? • Is the efficient method any good?

Motivation • Association rules find rules of the form: • XY, where X and Y are sets of items • What if there is structure over your items? • Structure can be used to generalize

Hierarchy Example Beverage Soft Drink … Cola … Pepsi Coke … …

Hierarchy Example On Sale Not On Sale … … • Goal of this paper: • Given hierarchies over items: • Capture interesting rules at all levels of multiple hierarchies

Simple Fix • Just add parents to each transaction. • {Coke, 7-up, ranch Doritos, bananas} would become: {Coke, 7-up, ranch Doritos, bananas, Doritos, cola, clear pop, soft drink, chips, junk food, fruit, produce}

Fix Cont’d • Run Apriori on expanded database • Redefine association rules: Make sure: XY XY={} Y contains no ancestors of any item in X

Problems with the fix • Counting may slow down • Total number of items & average transaction size will grow • Could get a lot of redundant rules • Milk  Cereal (70%) • Skim Milk  Cereal (70%) Do we care?

An Efficient Algorithm • “Cumulate” • Filtering ancestors added to transactions • Hierarchy-aware itemset pruning • For more complicated, speculative algorithms, see paper

Filtering Ancestors • Not counting soft drink? Don’t add it. • Only add ancestors that are in at least one of the candidate itemsets • Delete any items we are not counting • Not counting Doritos? Replace with chips • Each iteration: • Pre-compute the ancestors for each item

Itemset Pruning • No sense counting both {coke,cola,chips} and {coke,chips}, they’ll always be the same • Take out {coke,cola} during count size=2 and you’ll never have to deal with it

Reducing Redundancy Milk  Cereal (8% sup, 70% conf) Skim Milk  Cereal (2% sup, 70% conf) • If Skim Milk accounts for 1/4 of Milk sales, then the 2nd rule is redundant • Expected support and confidence (wrt hierarchy) will define interesting

Close Ancestors • An itemset Z’ is an ancestor of Z if: • Z’ = Z with some items replaced by ancestors • Z’ has the same number of items as Z • Z’ is a close ancestor of Z if: • No ancestor of Z has Z’ as an ancestor Take {coke,bananas} as Z Z’={cola, bananas} is a close ancestor Z’={soft drink, bananas} is not close Z’={cola,fruit} is not close

Interestingness • A rule XY is interesting if for all interesting, close ancestors X’Y’: Sup({X,Y}) > R*ExpSup({X,Y}|{X’,Y’}) or: Conf(XY) > R*ExpConf(XY|X’Y’) • R is defined by the user

Putting it all together • #1 is interesting - has no ancestor • #2 is interesting - twice expected support • #3 is not interesting • Has exactly expected support according to closest ancestor (#2)

Experiments • Lots of experiments on artificial data in paper. • We’ll look at the results of using Cumulate on real data • Compare to the quick fix - just adding in ancestors to transactions

Supermarket

Department Store

Interestingness Results • Hierarchical Interestingness pruning: • R = 25% resulted in pruning roughly 40% of the rules • R = 50% resulted in pruning roughly 50% of the reuslts • Pruning had a significant impact!

Objectives Revisited • What are generalized association rules? • Rules aware of hierarchies over items • Why do we care? • Support can be low for individual items • How can we get them efficiently? • Cumulate algorithm - hierarchy aware counting • How can we reduce rule redundancy? • Check surprise with respect to ancestors • Is the efficient method any good? • Yeap!

Questions? ?

Hierarchy Example Impulse Fridge … Beverage … Cans Bottles … …

Pros • Rules over items low in the tree may not have minimum support • Can raise min support • Shoot for fewer, more general rules • BUT: You can catch rules at any level of the hierarchy

Data Sets • Supermarket: • 500,000 items • 1.5 million transactions • Hierarchy has 4 levels, 118 roots • Department Store: • 200,000 items • 500,000 transactions • Hierarchy has 7 levels, 89 roots

Summary • Nothing ground-breaking in this paper • But, it provides a solid, efficient method for working with hierarchies • Generalization is a powerful tool to have available in association rules

Mining Generalized Association Rules

Mining Generalized Association Rules

Presentation Transcript

Data Mining Association Rules

Mining Association Rules

Mining Association Rules

Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal

DATA MINING - ASSOCIATION RULES-

Mining Association Rules

Chapter 2: Mining Association Rules

Mining Association Rules in Large Databases

Mining Causal Association Rules

Data Mining Association Rules

Association Rules Mining

Mining Association Rules with Constraints

Incremental Mining Association Rules

Incremental Mining of Association Rules

Association Rules Mining with SQL

Mining Association Rules from Stars

Mining Non-Derivable Association Rules

Algorithms for Mining Association Rules

Data Mining-Association Rules and Clustering

Mining Negative Association Rules

Introduction to Data Mining Mining Association Rules

Incremental Mining of Association Rules