340 likes | 637 Views
Chapter 5 Frequent Patterns and Association Rule Mining. Outline. Frequent Itemsets and Association Rule APRIORI Efficient Mining Post-processing Applications. Transactional Data. Definitions: An item : an article in a basket, or an attribute-value pair
E N D
Chapter 5Frequent Patterns and Association Rule Mining CISC6930
Outline • Frequent Itemsets and Association Rule • APRIORI • Efficient Mining • Post-processing • Applications CISC6930
Transactional Data • Definitions: • An item: an article in a basket, or an attribute-value pair • A transaction: set of items purchased in a basket; it may have TID (transaction ID) • A transactionaldataset: A set of transactions CISC6930
Itemsets & Frequent Itemsets • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative)support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold CISC6930
Association Rule • Find all the rules X Ywith minimum support and confidence • support, s, probability that a transaction contains X Y • s = P(X Y) = support count (X Y) / number of all transactions • confidence, c,conditional probability that a transaction having X also contains Y • c = P(X|Y) = support count (X Y) / support count (X) CISC6930
Customer buys both Customer buys diaper Customer buys beer Example • Let minsup = 50%, minconf = 50% • Number of all transactions = 5 & Min. support count = 5*50%=2.5 ⇒ 3 • Items: Beer, Nuts, Diaper, Coffee, Eggs, Milk • Freq. Pat.: {Beer}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, {Beer, Diaper}:3 • Association rules (support, confidence): • Beer Diaper (60%, 100%) • Diaper Beer (60%, 75%) CISC6930
Use of Association Rules • Association rules do not necessarily represent causality or correlation between the two itemsets. • X Y does not mean X causes Y, noCausality • X Y can be different from Y X, unlike correlation • Association rules assist in Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. • What products were often purchased together?— Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? CISC6930
Computational Complexity of Frequent Itemset Mining • How many itemsets are potentially to be generated in the worst case? • The number of frequent itemsets to be generated is senstive to the minsup threshold • When minsup is low, there exist potentially an exponential number of frequent itemsets • The worst case is close to MN where M: # distinct items, and N: max length of transactions when M is large. • The worst case complexty vs. the expected probability • Ex. Suppose Walmart has 104 kinds of products • The chance to pick up one product 10-4 • The chance to pick up a particular set of 10 products: ~10-40 • What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? CISC6930
Outline • Frequent Itemsets and Association Rule • APRIORI • Efficient Mining • Post-processing • Applications CISC6930
Association Rule Mining • Major steps in association rule mining • Frequent itemsets computation • Rule derivation • Use of support and confidence to measure strength CISC6930
The Downward Closure Property • The downward closure property of frequent patterns • Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} CISC6930
ABC ABD ACD BCD AB AC AD BC BD CD A B C D Apriori: A Candidate Generation & Test Approach • A frequent (used to be called large) itemset is an itemset whose support (S) is ≥ minSup. • Apriori pruning principle: • If there is any itemset which is infrequent, its superset should not be generated/tested! CISC6930
APRIORI • Method: • Initially, scan DB once to get frequent 1-itemset • Generatelength (k+1)candidateitemsets from length kfrequentitemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated CISC6930
The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan CISC6930
The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk; CISC6930
Implementation of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4 = {abcd} CISC6930
Self-joining • Lk is generated by joining Lk-1 with itself • Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. • l1 from Lk-1 = {l1[1], l1[2],…., l1[k-2], l1[k-1] } • l2 from Lk-1 = {l2[1], l2[2],…., l2[k-2], l2[k-1] } • Only when first k-2 items of l1 and l2 are in common, and l1[k-1] < l2[k-1], these two itemsets are joinable CISC6930
Further Improvement of the Apriori Method • Major computational challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates • Completeness: any association rule mining algorithm should get the same set of frequent itemsets. CISC6930
Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Scan 1: partition database and find local frequent patterns • Since the sub-database is relatively smaller than original database, all frequent itemsets are tested in one scan. • Scan 2: consolidate global frequent patterns • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 CISC6930
DHP: Reduce the Number of Candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Candidates: a, b, c, d, e • Frequent 1-itemset: a, b, d, e • Using some hash function, get hash entries: {ab, ad, ae} with bucket count as 3, {bd, bd, be, de} with bucket count as 4… • ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 CISC6930
Transaction Reduction • A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. • Such a transaction could be marked or removed from further consideration of (k+1)-itemsets. • Less support counting CISC6930
Sampling for Frequent Patterns • Select a sample of original database, mine frequent patterns within sample using Apriori with a lower support count threshold than min. support. • Scan database once to verify frequent itemsets found in sample, only bordersof closure of frequent patterns are checked • Example: check abcd instead of ab, ac, …, etc. • Scan database again to find missed frequent patterns • H. Toivonen. Sampling large databases for association rules. In VLDB’96 CISC6930
Outline • Frequent Itemsets and Association Rule • APRIORI • Efficient Mining • Post-processing • Applications CISC6930
Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset Xis closed frequent if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) • Closed pattern is a lossless compression of freq. patterns • Reducing the # of patterns and rules • An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) CISC6930
Closed Patterns and Max-Patterns • Exercise. DB = {<a1, …, a100>, < a1, …, a50>} • Min_sup = 1. • What is the set of closed itemset? • <a1, …, a100>: 1 • < a1, …, a50>: 2 • What is the set of max-pattern? • <a1, …, a100>: 1 • What is the set of all patterns? • !! • Max-pattern loses the information of actual support counts. CISC6930
MaxMiner: Mining Max-patterns • 1st scan: find frequent items • A, B, C, D, E • 2nd scan: find support for • AB, AC, AD, AE, ABCDE • BC, BD, BE, BCDE • CD, CE, CDE, DE, • Max-Miner uses pruning based on subset infrequency, as does Apriori, but it also uses pruning based on superset frequency. • Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan • R. Bayardo. Efficiently mining long patterns from databases. SIGMOD’98 Potential max-patterns CISC6930
Complete set-enumeration tree { } 3 1 2 1,2 1,3 2,3 1,2,3 CISC6930
Mining Closed Frequent Itemsets (CFI) • Strategies • Item merging: every transaction has X and Y but no proper superset of Y, then X⋃Y is a CFI. There is no need to find CFI having X but no Y. • Sub-itemset pruning: if X is a proper subset of CFI Y, and support(X)=support(Y), then X and all X’s descendants could be pruned. • Perform superset-checking and subset-checking CISC6930
Database: Min_sup = 2, Min_conf = 50% Using Apriori algorithm, 20 frequent itemsets are found, out of them {acdf}(2), {cef}(3), {ae}(2), {cf}(4) {a}(3), {e}(4) J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. Mining Frequent Closed Patterns: CLOSET Min_sup=2 CISC6930
Flist: list of all frequent items in support ascending order Flist: d(2)-a(3)-f(4)-e(4)-c(4) Divide search space Patterns having d(2) {10:cefa}, {40:cfa} Every transaction having d(2) also has cfa(2) {cfad}(2) is a frequent closed pattern Patterns having a(3) but no d, {10:cef},{20:e},{40:cf} {a}(3) is a frequent closed pattern since none of above has a. Find frequent closed pattern recursively: {ea}(2) Patterns having f but no a, d, {10,30,50:ce}, {40:c} {cf}(4) is a frequent closed pattern since all above have it. {cef}(3) is also a frequent closed pattern. Patterns having e but not a, d, f, {10,30,50:c} {e}(4) is a frequent closed patterns. Patterns having only c. Mining Frequent Closed Patterns: CLOSET Min_sup=2 CISC6930
Data Format • Horizontal data format • TID-itemset {TID: itemset} • Vertical data format • Item-TID-set {item: TID_set} • Mining could be performed by intersecting the TID_set of every pair of frequent items. CISC6930