Data Mining Association Rules

Data Mining Association Rules Jen-Ting Tony Hsiao Alexandros Ntoulas CS 240B May 21, 2002

Outline • Overview of data mining association rules • Algorithm Apriori • FP-Trees

Data Mining Association Rules • Data mining is motivated by the decision support problem faced by most large retail organizations • Bar-code technology made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data • Successful organizations view such databases as important pieces of the marketing infrastructure • Organizations are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies

Data Mining Association Rules • For example, 98% of customers that purchase tires and auto accessories also get automotive services done • Finding all such rules is valuable for cross-marketing and attached mailing applications • Other applications include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns • Databases involved in these applications are very large, thus imperative to have fast algorithms for this task

Formal Definition of Problem • Formal statement of the problem [AIS93]: • Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I. • A transaction T contains X, a set of some items in I, if X  T. • An association rule is an implication of the form X  Y, where X  I, Y  I, and X  Y = Ø. • X  Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. • X  Y has support s in the transaction set D if s% of transactions in D contain X  Y.

Example of Confidence and Support • Given the following table of transactions: • 1  2 has 100% confidence, with 100% support • 3  4 has 100% confidence, with 33% support • 2  3 has 33% confidence, with 33% support

Problem Decomposition • Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets, and all others small itemsets. • Use large itemsets to generate the desired rules. If ABCD and AB are large itemsets, then we can determine if rule AB CD holds by computing the ratio conf = support(ABCD)/support(AB). If conf  minconf, then the rule holds. (The rule will surely have minimum support because ABCD is large.)

Discoverying Large Itemsets • Algorithms for discovering large itemsets make multiple passes over the data • In the first pass, we count the support of individual items and determine which of them are large (with minimum support) • In each subsequent pass, we start with a seed set of itemsets found to be large in the previous pass, then use this seed set for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data • At the end of the pass, we determine which of the candidate itemsets are actually large, and they become the seed for the next pass. • This process continues until no new large itemsets are found

Notations • The number of items in an itemset is its size and call an itemset of size k a k-itemset • Items within an itemset are kept in lexicographic order

Algorithm Apriori • L1 = {large 1-itemsets}; • for ( k = 2; Lk-1 Ø; k++ ) do begin • Ck = apriori-gen(Lk-1); // New candidates • forall transactions t  Ddo begin • Ct = subset(Ck, t); Candidates contained in t • forall candidates c  Ctdo • c.count++; • end • Lk = {c  Ck | c.count  minsup} • end • Answer = kLk;

Apriori Candidate Generation • The apriori-gen function takes as argument Lk – 1, the set of all large (k – 1)-itemsets. It returns a superset of the a set of all large k-itemsets. • First, in the join step, we join Lk – 1 with Lk – 1: insert intoCk selectp.item1, p.item2, …, p.itemk – 1, q.itemk – 1 fromLk – 1 p, Lk – 1q wherep.item1 = q.item1, …, p.itemk – 2 = q.itemk – 2, p.itemk – 1 q.itemk – 1 • Next, in the prune step, we delete all itemsets c  Ck such that some (k – 1)-subset of c is not in Lk – 1: forall itemsets c  Ckdo forall (k – 1)-subsets s of c do if (s  Lk – 1) then delete c from Ck;

Apriori-gen Example • Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}} • In the join step: • {1 2 3} joins with {1 2 4} to produce {1 2 3 4} • {1 3 4} joins with {1 3 5} to produce {1 3 4 5} • After the join step, C4 will be {{1 2 3 4}, {1 3 4 5}} • In the prune step: • {1 2 3 4} is tested for existence of 3-item subsets within L3, thus for {1 2 3}, {1 3 4}, and {2 3 4} • {1 3 4 5} is tested for {1 3 4}, {1 4 5}, and {3 4 5}, with {1 4 5} not found, and thus this set is deleted • We then will be left with only {1 2 3 4} in C4

Discovering Rules • For every large itemset l, we find all non-empty subsets of l • For every such subset a, we output a rule of the form a (l – a) if the ratio of support(l) to support(a) is at least minconf • We consider all subsets of l to generate rules with multiple consequents

Simple Algorithm • We can improve the proceeding procedure by generating the subsets of large itemset in a recursive depth-first fashion • For example, given an itemset ABCD, we first consider the subset ABC, then AB, etc • Then if a subset a of a large itemset l does not generate a rule, the subsets of a need not to be considered for generating rules using l • For example, if ABC D does not have enough confidence, we need not check whether AB  CD holds

Simple Algorithm • forall large itemsets lk, k 2 do • call genrules(lk, lk) • Procedure genrules(lk: large k-itemsets, am: large m-itemsets) • A = {(m – 1)-itemsets am – 1 | am – 1 am}; • forallam – 1  Ado again • conf = support(lk)/support(am – 1) • if (conf  minconf) then begin • output the rule am – 1 (lk - am – 1), • with confidence = conf and support = support(lk) • if (m – 1 > 1) then • call genrules(lk, am – 1); • // to generate rules with subsets of am – 1 • as the antecedents • end • end

Faster Algorithm • Earlier, we showed that if a rule does not hold for a superset does not hold, it also does not hold for its subset • For example, if ABC D does not have enough confidence, AB  CD also will not hold • This can also be applied in the reverse direction, where if a rule holds for a subset, it also holds for its superset • For example, if AB  CD has enough confidence, ABC D will also hold

Faster Algorithm • forall large itemsets lk, k 2 do begin • H1 = { consequents of rules derived from lk with one item in the consequent }; • call ap-genrules(lk, H1) • end • Procedure ap-genrules(lk: large k-itemsets, Hm: set of m-item consequents) • if (k > m + 1) then begin • Hm + 1 = apriori-gen(Hm); • forallhm + 1 Hm + 1do begin • conf = support(lk)/support(lk - hm + 1); • if (conf minconf) then • output the rule (lk - hm + 1)  hm + 1 with • confidence = conf and support = support(lk); • else • delete hm + 1 from Hm + 1 • end • call ap-genrules(lk, Hm + 1); • end

Faster Algorithm • For example, consider a large itemset ABCDE. Assume that ACDE B and ABCE  E are the only one-item consequent rules with minimum confidence. • In the simple algorithm, the recursive call genrules(ABCDE, ACDE) will test if the two-item consequence rules ACD  BE, ADE  BC, CDE  BA, and ACE  BD hold. • The first of these rules cannot hold, because E  BE, and ABCD  E does not have minimum confidence. The second and third rules cannot hold for similar reasons. • The only two-item consequent rule that can possibly hold is ACE  BD, where B and D are the consequents in the valid one-item consequent rules. This is the only rule that will be tested by the faster algorithm.

Mining Frequent Itemsets • If we use candidate generation • Need to generate a huge number of candidate sets. • Need to repeatedly scan the database and check a large set of candidates by pattern matching. • Can we avoid that? • FP-Trees (Frequent Pattern Trees)

FP-Tree • General Idea • Divide and Conquer • Outline of the procedure • For each item create its conditional pattern base • Then expand it and create its conditional FP-Tree • Repeat the process recursively on each constructed FP-Tree • Until FP-Tree is either empty or contains only one path

Example TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} • For support = 50% • Step 1: Scan DB find frequent 1-itemsets • Step 2: Order them in descending order • Step 3: Scan DB and construct FP-Tree Item frequency f 4 c 4 a 3 b 3 m 3 p 3

root f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Example – Conditional Pattern Base • Start with the frequent header table • Traverse tree by following links from frequent items • Accumulate prefix paths to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3

{root} f:3 c:3 a:3 m-conditional FP-tree Example – Conditional FP-Tree • For every pattern base • Accumulate the frequency for each item • Construct an FP-Tree for the frequent items of the pattern base • m-conditional pattern base: • fca:2, fcab:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam

Mining The FP-Tree • Construct conditional pattern bases for each node in the initial FP-Tree • Construct conditional FP-Tree from each conditional pattern base • Recursively visit conditional FP-Trees and expand frequent patterns • If the conditional FP-Tree contains only one path enumerate all the combinations of the items

Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} NULL a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f NULL NULL Example – Mining The FP-Tree

{root} f:3 c:3 am-conditional FP-tree {root} {root} f:3 {root} c:3 f:3 f:3 a:3 cm-conditional FP-tree cam-conditional FP-tree m-conditional FP-tree Example – Mining The FP-Tree • Recursively visit conditional FP-Trees

{root} f:3 c:3 a:3 Example – Mining The FP-Tree • If we have a single path, the frequent itemsets of a tree can be generated by enumerating all combinations of nodes of the tree. Enumeration of all patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam m-conditional FP-tree

Advantages of FP-Tree • Completeness: • Never breaks a long pattern of any transaction • Preserves complete information for frequent pattern mining • Compactness • Reduce irrelevant information—infrequent items are gone • Frequency descending ordering: more frequent items are more likely to be shared • Never be larger than the original database (if not count node-links and counts)

Comparison of FP-Tree Method and Apriori Data set size T25I20D10K

References • R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pages 207-216, May 1993 • R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules (1994) Proc. 20th Int. Conf. Very Large Data Bases, VLDB • Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation. SIGMOD Conference 2000: 1-12

Data Mining Association Rules

Data Mining Association Rules

Presentation Transcript

Data Mining: Association

Mining Association Rules

Mining Association Rules

DATA MINING - ASSOCIATION RULES-

Mining Association Rules

Data Mining Association Rules: Advanced Concepts and Algorithms

Chapter 2: Mining Association Rules

Advanced Topics in Data Mining: Association Rules

High Performance Data Mining Chapter 4: Association Rules

Mining Causal Association Rules

Data Mining Association Rules

Association Rules Mining

Incremental Mining Association Rules

Association Rules Mining with SQL

Mining Association Rules from Microarray Gene Expression Data

Mining Generalized Association Rules

Mining Fuzzy Spatial Association Rules from Image Data

Data Mining-Association Rules and Clustering

Introduction of Data Mining and Association Rules

Mining Negative Association Rules

Mining Association Rules from Microarray Gene Expression Data

Introduction to Data Mining Mining Association Rules