340 likes | 1.18k Views
Data Mining Association Rules. Jen-Ting Tony Hsiao Alexandros Ntoulas CS 240B May 21, 2002. Outline. Overview of data mining association rules Algorithm Apriori FP-Trees. Data Mining Association Rules.
E N D
Data Mining Association Rules Jen-Ting Tony Hsiao Alexandros Ntoulas CS 240B May 21, 2002
Outline • Overview of data mining association rules • Algorithm Apriori • FP-Trees
Data Mining Association Rules • Data mining is motivated by the decision support problem faced by most large retail organizations • Bar-code technology made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data • Successful organizations view such databases as important pieces of the marketing infrastructure • Organizations are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies
Data Mining Association Rules • For example, 98% of customers that purchase tires and auto accessories also get automotive services done • Finding all such rules is valuable for cross-marketing and attached mailing applications • Other applications include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns • Databases involved in these applications are very large, thus imperative to have fast algorithms for this task
Formal Definition of Problem • Formal statement of the problem [AIS93]: • Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I. • A transaction T contains X, a set of some items in I, if X T. • An association rule is an implication of the form X Y, where X I, Y I, and X Y = Ø. • X Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. • X Y has support s in the transaction set D if s% of transactions in D contain X Y.
Example of Confidence and Support • Given the following table of transactions: • 1 2 has 100% confidence, with 100% support • 3 4 has 100% confidence, with 33% support • 2 3 has 33% confidence, with 33% support
Problem Decomposition • Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets, and all others small itemsets. • Use large itemsets to generate the desired rules. If ABCD and AB are large itemsets, then we can determine if rule AB CD holds by computing the ratio conf = support(ABCD)/support(AB). If conf minconf, then the rule holds. (The rule will surely have minimum support because ABCD is large.)
Discoverying Large Itemsets • Algorithms for discovering large itemsets make multiple passes over the data • In the first pass, we count the support of individual items and determine which of them are large (with minimum support) • In each subsequent pass, we start with a seed set of itemsets found to be large in the previous pass, then use this seed set for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data • At the end of the pass, we determine which of the candidate itemsets are actually large, and they become the seed for the next pass. • This process continues until no new large itemsets are found
Notations • The number of items in an itemset is its size and call an itemset of size k a k-itemset • Items within an itemset are kept in lexicographic order
Algorithm Apriori • L1 = {large 1-itemsets}; • for ( k = 2; Lk-1 Ø; k++ ) do begin • Ck = apriori-gen(Lk-1); // New candidates • forall transactions t Ddo begin • Ct = subset(Ck, t); Candidates contained in t • forall candidates c Ctdo • c.count++; • end • Lk = {c Ck | c.count minsup} • end • Answer = kLk;
Apriori Candidate Generation • The apriori-gen function takes as argument Lk – 1, the set of all large (k – 1)-itemsets. It returns a superset of the a set of all large k-itemsets. • First, in the join step, we join Lk – 1 with Lk – 1: insert intoCk selectp.item1, p.item2, …, p.itemk – 1, q.itemk – 1 fromLk – 1 p, Lk – 1q wherep.item1 = q.item1, …, p.itemk – 2 = q.itemk – 2, p.itemk – 1 q.itemk – 1 • Next, in the prune step, we delete all itemsets c Ck such that some (k – 1)-subset of c is not in Lk – 1: forall itemsets c Ckdo forall (k – 1)-subsets s of c do if (s Lk – 1) then delete c from Ck;
Apriori-gen Example • Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}} • In the join step: • {1 2 3} joins with {1 2 4} to produce {1 2 3 4} • {1 3 4} joins with {1 3 5} to produce {1 3 4 5} • After the join step, C4 will be {{1 2 3 4}, {1 3 4 5}} • In the prune step: • {1 2 3 4} is tested for existence of 3-item subsets within L3, thus for {1 2 3}, {1 3 4}, and {2 3 4} • {1 3 4 5} is tested for {1 3 4}, {1 4 5}, and {3 4 5}, with {1 4 5} not found, and thus this set is deleted • We then will be left with only {1 2 3 4} in C4
Discovering Rules • For every large itemset l, we find all non-empty subsets of l • For every such subset a, we output a rule of the form a (l – a) if the ratio of support(l) to support(a) is at least minconf • We consider all subsets of l to generate rules with multiple consequents
Simple Algorithm • We can improve the proceeding procedure by generating the subsets of large itemset in a recursive depth-first fashion • For example, given an itemset ABCD, we first consider the subset ABC, then AB, etc • Then if a subset a of a large itemset l does not generate a rule, the subsets of a need not to be considered for generating rules using l • For example, if ABC D does not have enough confidence, we need not check whether AB CD holds
Simple Algorithm • forall large itemsets lk, k 2 do • call genrules(lk, lk) • Procedure genrules(lk: large k-itemsets, am: large m-itemsets) • A = {(m – 1)-itemsets am – 1 | am – 1 am}; • forallam – 1 Ado again • conf = support(lk)/support(am – 1) • if (conf minconf) then begin • output the rule am – 1 (lk - am – 1), • with confidence = conf and support = support(lk) • if (m – 1 > 1) then • call genrules(lk, am – 1); • // to generate rules with subsets of am – 1 • as the antecedents • end • end
Faster Algorithm • Earlier, we showed that if a rule does not hold for a superset does not hold, it also does not hold for its subset • For example, if ABC D does not have enough confidence, AB CD also will not hold • This can also be applied in the reverse direction, where if a rule holds for a subset, it also holds for its superset • For example, if AB CD has enough confidence, ABC D will also hold
Faster Algorithm • forall large itemsets lk, k 2 do begin • H1 = { consequents of rules derived from lk with one item in the consequent }; • call ap-genrules(lk, H1) • end • Procedure ap-genrules(lk: large k-itemsets, Hm: set of m-item consequents) • if (k > m + 1) then begin • Hm + 1 = apriori-gen(Hm); • forallhm + 1 Hm + 1do begin • conf = support(lk)/support(lk - hm + 1); • if (conf minconf) then • output the rule (lk - hm + 1) hm + 1 with • confidence = conf and support = support(lk); • else • delete hm + 1 from Hm + 1 • end • call ap-genrules(lk, Hm + 1); • end
Faster Algorithm • For example, consider a large itemset ABCDE. Assume that ACDE B and ABCE E are the only one-item consequent rules with minimum confidence. • In the simple algorithm, the recursive call genrules(ABCDE, ACDE) will test if the two-item consequence rules ACD BE, ADE BC, CDE BA, and ACE BD hold. • The first of these rules cannot hold, because E BE, and ABCD E does not have minimum confidence. The second and third rules cannot hold for similar reasons. • The only two-item consequent rule that can possibly hold is ACE BD, where B and D are the consequents in the valid one-item consequent rules. This is the only rule that will be tested by the faster algorithm.
Mining Frequent Itemsets • If we use candidate generation • Need to generate a huge number of candidate sets. • Need to repeatedly scan the database and check a large set of candidates by pattern matching. • Can we avoid that? • FP-Trees (Frequent Pattern Trees)
FP-Tree • General Idea • Divide and Conquer • Outline of the procedure • For each item create its conditional pattern base • Then expand it and create its conditional FP-Tree • Repeat the process recursively on each constructed FP-Tree • Until FP-Tree is either empty or contains only one path
Example TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} • For support = 50% • Step 1: Scan DB find frequent 1-itemsets • Step 2: Order them in descending order • Step 3: Scan DB and construct FP-Tree Item frequency f 4 c 4 a 3 b 3 m 3 p 3
root f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Example – Conditional Pattern Base • Start with the frequent header table • Traverse tree by following links from frequent items • Accumulate prefix paths to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3
{root} f:3 c:3 a:3 m-conditional FP-tree Example – Conditional FP-Tree • For every pattern base • Accumulate the frequency for each item • Construct an FP-Tree for the frequent items of the pattern base • m-conditional pattern base: • fca:2, fcab:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam
Mining The FP-Tree • Construct conditional pattern bases for each node in the initial FP-Tree • Construct conditional FP-Tree from each conditional pattern base • Recursively visit conditional FP-Trees and expand frequent patterns • If the conditional FP-Tree contains only one path enumerate all the combinations of the items
Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} NULL a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f NULL NULL Example – Mining The FP-Tree
{root} f:3 c:3 am-conditional FP-tree {root} {root} f:3 {root} c:3 f:3 f:3 a:3 cm-conditional FP-tree cam-conditional FP-tree m-conditional FP-tree Example – Mining The FP-Tree • Recursively visit conditional FP-Trees
{root} f:3 c:3 a:3 Example – Mining The FP-Tree • If we have a single path, the frequent itemsets of a tree can be generated by enumerating all combinations of nodes of the tree. Enumeration of all patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam m-conditional FP-tree
Advantages of FP-Tree • Completeness: • Never breaks a long pattern of any transaction • Preserves complete information for frequent pattern mining • Compactness • Reduce irrelevant information—infrequent items are gone • Frequency descending ordering: more frequent items are more likely to be shared • Never be larger than the original database (if not count node-links and counts)
Comparison of FP-Tree Method and Apriori Data set size T25I20D10K
References • R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pages 207-216, May 1993 • R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules (1994) Proc. 20th Int. Conf. Very Large Data Bases, VLDB • Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation. SIGMOD Conference 2000: 1-12