580 likes | 762 Views
Association rule mining. Prof. Navneet Goyal CSIS Department, BITS- Pilani. Association Rule Mining. Find all rules of the form Itemset1 Itemset2 having: support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules
E N D
Association rule mining Prof. NavneetGoyal CSIS Department, BITS-Pilani
Association Rule Mining • Find all rules of the form Itemset1 Itemset2 having: • support ≥ minsup threshold • confidence ≥ minconf threshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
Association Rule Mining • 2 step process • FI generation • Rule generation
FI Generation • Brute-force approach: • Each itemset in the lattice is a candidateFI • Count the support of each candidate by scanning the database • Match each transaction against every candidate • Complexity ~ O(NMw) => Expensive since M = 2d!!!
Computational Complexity • Given d unique items: • Total number of itemsets = 2d • Total number of possible association rules: If d=6, R = 602 rules
Association Rule Mining • 2 step process • FI generation • Rule generation
Frequent Itemset Generation Strategies • Reduce the number of candidates (M) • Complete search: M=2d • Use pruning techniques to reduce M • Reduce the number of transactions (N) • Reduce size of N as the size of itemset increases • Used by DHP and vertical-based mining algorithms • Reduce the number of comparisons (NM) • Use efficient data structures to store the candidates or transactions • No need to match every candidate against every transaction
Reducing Number of Candidates • Apriori Principle • If an itemset is frequent, then all its subsets must be frequent • Apriori principle holds due to the following property of the support measure: • Support of on itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support
Illustrating Apriori Principle Found to be Infrequent Pruned supersets
Sampling Algorithm • Tx. DB can get very big! • Sample the DB and apply apriori to the sample • Use reduced minsup (smalls) • Find large (frequent) itemsets from the sample using smalls • Call this set of large itemsets as Potentially Large (PL) • Find the negative border (BD-) of PL • Minimal set of itemsets which are not in PL, but whose subsets are all in PL.
Negative Border Example Let Items = {A,…,F} and there are itemsets: {A}, {B}, {C}, {F}, {A,B}, {A,C}, {A,F}, {C,F}, {A,C,F} The whole negative border is: {{B,C}, {B,F}, {D}, {E}}
Sampling Algorithm: Example • Sample Database = {t1, t2} • Smalls = 20% & Min_sup = 40% • PL = {{Br},{PB},{J},{Br, J},{Br, PR},{J,PB}} • BD- (PL) = {{M},{Be}} • C1 = PL U BD- (PL) = {{Br},{PB},{J},{M},{Be},{Br, J},{Br, PR},{J,PB}} • First scan of the DB to find L with min_sup = 40% (itemset must appear in 2 txs.)
Sampling Algorithm: Example • L = {{Br},{PB},{M},{Be},{Br, PB}} • Set C2 = L • BD- (C2) = {{Br, M}, {Br, Be}, {PB,M}, {PB,Be},{M,Be}} (ignore those itemsets which we known are not large, for eg. {J} and its supersets) • C3 = C2 U BD- (C2) = {{Br},{PB},{M},{Be},{Br, PB}, {Br, M}, {Br, Be}, {PB,M}, {PB,Be},{M,Be}}
Sampling Algorithm: Example • Now again find the negative border of C3 • BD- (C3) = {{Br, PB,M}, {Br, M, Be}, {Br, PB, Be}, {PB,M,Be}} • C4 = C3 U BD- (C3) = {{Br},{PB},{M},{Be},{Br, PB}, {Br, M}, {Br, Be}, {PB,M}, {PB,Be},{M,Be},{Br, PB,M}, {Br, M, Be}, {Br, PB, Be}, {PB,M,Be}} • BD- (C4) = {{Br, PB, M, Be}}
Sampling Algorithm: Example • So finally C5 = {{Br},{PB},{M},{Be},{Br, PB}, {Br, M}, {Br, Be}, {PB,M}, {PB,Be},{M,Be},{Br, PB,M}, {Br, M, Be}, {Br, PB, Be}, {PB,M,Be},{Br, PB, M, Be}} • Now it is easy to see that BD- (C5) = • DO the scan of the DB (second scan) to find out frequent itemsets. While doing this scan you need not check itemsets in L • Final L = {{Br},{PB},{M},{Be},{Br, PB}}
Toivonen’s Algorithm • Start as in the simple algorithm, but lower the threshold slightly for the sample. • Example: if the sample is 1% of the baskets, use 0.008 as the support threshold rather than 0.01 . • Goal is to avoid missing any itemset that is frequent in the full set of baskets.
Toivonen’s Algorithm (contd.) • Add to the itemsets that are frequent in the sample the negative border of these itemsets. • An itemset is in the negative border if it is not deemed frequent in the sample, but all its immediate subsets are. • Example: ABCD is in the negative border if and only if it is not frequent, but all of ABC , BCD , ACD , and ABD are.
Toivonen’s Algorithm (contd.) • In a second pass, count all candidate frequent itemsets from the first pass, and also count the negative border. • If no itemset from the negative border turns out to be frequent, then the candidates found to be frequent in the whole data are exactly the frequent itemsets.
Toivonen’s Algorithm (contd.) • What if we find something in the negative border is actually frequent? • We must start over again! • But by choosing the support threshold for the sample wisely, we can make the probability of failure low, while still keeping the number of itemsets checked on the second pass low enough for main-memory.
Conclusions • Advantages: Reduced failure probability, while keeping candidate-count low enough for memory • Disadvantages: Potentially large number of candidates insecond pass
Partitioning • Divide database into partitions D1,D2,…,Dp • Apply Apriori to each partition • Any large itemset must be large in at least one partition • DO YOU AGREE? • Let’s do the proof! • Remember proof by contradiction
Partitioning Algorithm • Divide D into partitions D1,D2,…,Dp; • For I = 1 to p do • Li = Apriori(Di); • C = L1 … Lp; • Count C on D to generate L; • Do we need to count? • Is C=L?
Partitioning Example L1={{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} D1 L2={{Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}} D2 S=10%
Partitioning • Advantages: • Adapts to available main memory • Easily parallelized • Maximum number of database scans is two. • Disadvantages: • May have many candidates during second scan.
AR Generation from FIs • So far we have seen algorithms for finding FIs • Lets now look at how we can generate the ARs from FIs • FIs are concerned only with “support” • Time to bring in the concept of “confidence” • For each FI l, generate all non-empty subsets of l • For each non-empty subset s of l, output the rule s (l-s) if
AR Generation from FIs • For each k-FI, Y, we can have up to 2k-2 ARs. • Ignore empty antecedents/consequents • Partition Y into 2 non-empty subsets X & Y-X, such that XY-X satisfies min_conf • We need not worry about min_sup! • Y={1,2,3} • 6 candidate ARs: {1,2} {3}, {1,3} {2}, {2,3} {1}, {1} {2,3}, {2} {1,3}, {3} {1,2} • DO we need any additional scans to find confidence? • For {1,2} {3}, the confidence is ({1,2,3})/({1,2}) • 123 is frequent, therefore 12 is also frequent. So no need to find support counts again
Rule Generation • Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement • If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)
Rule Generation • How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) • But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Pruned Rules Rule Generation for Apriori Algorithm Lattice of rules Low Confidence Rule
Next Class • Time Complexity of algorithms for finding FIs • Efficient Counting for FIs using hash tree • PCY algorithm for FI
Computational Complexity • Factors affecting computational complexity of Apriori: • Min_sup • No. of items (dimensionality) • No. of transactions • Average transaction width
Computational Complexity • No. of items (dimensionality) • More space for storing support counts of items • If the no. of FIs grow with dim., the computation & I/O costs will increase because of the large no. of candidates generated by the algo. • No. of Transactions • Apriori makes repeated passes of the tr. DB • Run time increases as a result • Average Transaction width • Max. size of FIs increase as avg size of tx. Increases • More itemsets need to be examined during candidate generation and support counting • As width increases, more itemsets are contained in the tx. Will inc. the hash tree traversal
Support Counting • Compare each tx. against every candidate itemset & update the support count of candidates contained in the tx. • Computationally expensive when no. of txs. & no. of candidates is large • How to make it efficient? • Enumerate all itemsets contained in a tx. & use them to update support counts of their respective CIs • T1 has {1,2,3,5,6}. 5C3 = 10 itemsets of size 3. some of these 10 will correspond to C3. Others are ignored. • How to make matching operation efficient? • Use HASH TREE!!!
Support Counting Given a transaction t, what are the possible subsets of size 3?
Hash Tree • Partition CIs into different buckets and store them in hast tree • During support counting, itemsets in each tx. are also hashed into their appropriate buckets using the same hash finction
Hash Tree • Example: 3-itemset • All candidate 3-itemsets are hashed • Enumerate all the 3-itemsets of the tx. • All 3-itemsets contained in a transaction are also hashed • Comparison of a 3-itemset of tx. with all candidate 3-itemsets is avoided • Comparison is required to be done only in the appropriate bucket • Saves time
Hash function 3,6,9 1,4,7 2,5,8 2 3 4 5 6 7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Hash Tree • For each internal node use hash fn. h(p) = p mod 3 • All candidate itemsets are stored at the leaf nodes of the hash tree • Suppose you have 15 candidate 3-itemsets: • {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
2 3 4 1 2 5 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 5 8 3 6 8 3 6 7 3 4 5 1 3 6 14 5 1 5 9 3 5 6 Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7
2 3 4 1 25 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 58 3 6 8 3 6 7 3 4 5 1 3 6 1 4 5 1 5 9 3 5 6 Hash Tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 2, 5 or 8
2 3 4 1 2 5 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 5 8 36 8 36 7 3 4 5 1 3 6 1 4 5 1 5 9 3 5 6 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9
Hash Function 3 + 2 + 1 + 5 6 3 5 6 1 2 3 5 6 2 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 1 3 6 3 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 6 8 9 3 5 7 3 5 6 5 6 7 4 5 7 1 5 9 Subset Operation Using Hash Tree transaction
Hash Function 2 + 1 + 1 5 + 3 + 1 3 + 1 2 + 6 5 6 5 6 1 2 3 5 6 3 5 6 3 5 6 2 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 3 5 6 3 5 7 6 8 9 5 6 7 4 5 7 Subset Operation Using Hash Tree transaction 1 3 6 3 4 5 1 5 9
Hash Function 2 + 1 5 + 1 + 3 + 1 3 + 1 2 + 6 3 5 6 5 6 5 6 1 2 3 5 6 2 3 5 6 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 3 5 7 3 5 6 6 8 9 4 5 7 5 6 7 Subset Operation Using Hash Tree transaction 1 3 6 3 4 5 1 5 9 Match transaction against 11 out of 15 candidates
Compact Representation of FIs • Generally, the no. of FIs generated by a tx. DB can be very large • Good if we could identify a small representative set of FIs from which all other FIs could be generated • 2 such representations: • Maximal FIs • Closed FIs
Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border
Maximal FIs • Maximal FIs are the smallest set of itemsets from which all the FIs can be derived • Maximal FIs do not contain support information of their subsets • An additional scan of the DB is needed to determine the support count of the non-maximal FIs
Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset
Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions