Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak

Data Mining Seminar 2003 Introduction • Bar-Code technology • Mining Association Rules over basket data (93) • Tires ^ accessories  automotive service • Cross market, Attached mail. • Very large databases.

Data Mining Seminar 2003 Notation • Items – I = {i1,i2,…,im} • Transaction – set of items • Items are sorted lexicographically • TID – unique identifier for each transaction

Data Mining Seminar 2003 Notation • Association Rule – X  Y

Data Mining Seminar 2003 Confidence and Support • Association rule XY has confidence c, c% of transactions in D that contain X also contain Y. • Association rule XY has support s, s% of transactions in D contain X and Y.

Data Mining Seminar 2003 Define the Problem Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence.

Data Mining Seminar 2003 Discovering all Association Rules • Find all Large itemsets • itemsets with support above minimum support. • Use Large itemsets to generate the rules.

Data Mining Seminar 2003 General idea • Say ABCD and AB are large itemsets • Compute conf = support(ABCD) / support(AB) • If conf >= minconf AB  CD holds.

Data Mining Seminar 2003 Discovering Large Itemsets • Multiple passes over the data • First pass– count the support of individual items. • Subsequent pass • Generate Candidates using previous pass’s large itemset. • Go over the data and check the actual support of the candidates. • Stop when no new large itemsets are found.

Data Mining Seminar 2003 The Trick Anysubset of large itemset is large. Therefore To find large k-itemset • Create candidatesby combining large k-1 itemsets. • Delete those that contain any subset that is not large.

Data Mining Seminar 2003 Algorithm Apriori Count item occurrences Generate new k-itemsets candidates Find the support of all the candidates Take only those with support over minsup

Data Mining Seminar 2003 Candidate generation • Join step • Prune step P and q are 2 k-1 large itemsets identical in all k-2 first items. Join by adding the last item of q to p Check all the subsets, remove a candidate with “small” subset

Data Mining Seminar 2003 Example L3 = { {1 2 3}, {1 24}, {1 3 4}, {1 3 5}, {2 3 4} } After joining { {1 2 3 4}, {1 3 4 5} } After pruning {1 2 3 4} {1 4 5} and {3 4 5} Are not in L3

Data Mining Seminar 2003 Correctness Show that Any subset of large itemset must also be large Join is equivalent to extending Lk-1 with all items and removing those whose (k-1) subsets are not in Lk-1 Preventsduplications

Data Mining Seminar 2003 Subset Function • Candidate itemsets - Ck are stored in a hash-tree • Finds in O(k) time whether a candidate itemset of size k is contained in transaction t. • Total time O(max(k,size(t))

Fast Algorithms for Mining Association Rules