Chapter 5 Frequent Patterns and Association Rule Mining

Chapter 5Frequent Patterns and Association Rule Mining CISC6930

Outline • Frequent Itemsets and Association Rule • APRIORI • Efficient Mining • Post-processing • Applications CISC6930

Transactional Data • Definitions: • An item: an article in a basket, or an attribute-value pair • A transaction: set of items purchased in a basket; it may have TID (transaction ID) • A transactionaldataset: A set of transactions CISC6930

Itemsets & Frequent Itemsets • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative)support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold CISC6930

Association Rule • Find all the rules X  Ywith minimum support and confidence • support, s, probability that a transaction contains X  Y • s = P(X  Y) = support count (X  Y) / number of all transactions • confidence, c,conditional probability that a transaction having X also contains Y • c = P(X|Y) = support count (X  Y) / support count (X) CISC6930

Customer buys both Customer buys diaper Customer buys beer Example • Let minsup = 50%, minconf = 50% • Number of all transactions = 5 & Min. support count = 5*50%=2.5 ⇒ 3 • Items: Beer, Nuts, Diaper, Coffee, Eggs, Milk • Freq. Pat.: {Beer}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, {Beer, Diaper}:3 • Association rules (support, confidence): • Beer  Diaper (60%, 100%) • Diaper  Beer (60%, 75%) CISC6930

Use of Association Rules • Association rules do not necessarily represent causality or correlation between the two itemsets. • X  Y does not mean X causes Y, noCausality • X  Y can be different from Y  X, unlike correlation • Association rules assist in Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. • What products were often purchased together?— Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? CISC6930

Computational Complexity of Frequent Itemset Mining • How many itemsets are potentially to be generated in the worst case? • The number of frequent itemsets to be generated is senstive to the minsup threshold • When minsup is low, there exist potentially an exponential number of frequent itemsets • The worst case is close to MN where M: # distinct items, and N: max length of transactions when M is large. • The worst case complexty vs. the expected probability • Ex. Suppose Walmart has 104 kinds of products • The chance to pick up one product 10-4 • The chance to pick up a particular set of 10 products: ~10-40 • What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? CISC6930

Association Rule Mining • Major steps in association rule mining • Frequent itemsets computation • Rule derivation • Use of support and confidence to measure strength CISC6930

The Downward Closure Property • The downward closure property of frequent patterns • Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} CISC6930

ABC ABD ACD BCD AB AC AD BC BD CD A B C D Apriori: A Candidate Generation & Test Approach • A frequent (used to be called large) itemset is an itemset whose support (S) is ≥ minSup. • Apriori pruning principle: • If there is any itemset which is infrequent, its superset should not be generated/tested! CISC6930

APRIORI • Method: • Initially, scan DB once to get frequent 1-itemset • Generatelength (k+1)candidateitemsets from length kfrequentitemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated CISC6930

The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan CISC6930

The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk; CISC6930

Implementation of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4 = {abcd} CISC6930

Self-joining • Lk is generated by joining Lk-1 with itself • Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. • l1 from Lk-1 = {l1[1], l1[2],…., l1[k-2], l1[k-1] } • l2 from Lk-1 = {l2[1], l2[2],…., l2[k-2], l2[k-1] } • Only when first k-2 items of l1 and l2 are in common, and l1[k-1] < l2[k-1], these two itemsets are joinable CISC6930

Further Improvement of the Apriori Method • Major computational challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates • Completeness: any association rule mining algorithm should get the same set of frequent itemsets. CISC6930

Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Scan 1: partition database and find local frequent patterns • Since the sub-database is relatively smaller than original database, all frequent itemsets are tested in one scan. • Scan 2: consolidate global frequent patterns • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 CISC6930

DHP: Reduce the Number of Candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Candidates: a, b, c, d, e • Frequent 1-itemset: a, b, d, e • Using some hash function, get hash entries: {ab, ad, ae} with bucket count as 3, {bd, bd, be, de} with bucket count as 4… • ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 CISC6930

Transaction Reduction • A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. • Such a transaction could be marked or removed from further consideration of (k+1)-itemsets. • Less support counting CISC6930

Sampling for Frequent Patterns • Select a sample of original database, mine frequent patterns within sample using Apriori with a lower support count threshold than min. support. • Scan database once to verify frequent itemsets found in sample, only bordersof closure of frequent patterns are checked • Example: check abcd instead of ab, ac, …, etc. • Scan database again to find missed frequent patterns • H. Toivonen. Sampling large databases for association rules. In VLDB’96 CISC6930

Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset Xis closed frequent if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) • Closed pattern is a lossless compression of freq. patterns • Reducing the # of patterns and rules • An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) CISC6930

Closed Patterns and Max-Patterns • Exercise. DB = {<a1, …, a100>, < a1, …, a50>} • Min_sup = 1. • What is the set of closed itemset? • <a1, …, a100>: 1 • < a1, …, a50>: 2 • What is the set of max-pattern? • <a1, …, a100>: 1 • What is the set of all patterns? • !! • Max-pattern loses the information of actual support counts. CISC6930

MaxMiner: Mining Max-patterns • 1st scan: find frequent items • A, B, C, D, E • 2nd scan: find support for • AB, AC, AD, AE, ABCDE • BC, BD, BE, BCDE • CD, CE, CDE, DE, • Max-Miner uses pruning based on subset infrequency, as does Apriori, but it also uses pruning based on superset frequency. • Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan • R. Bayardo. Efficiently mining long patterns from databases. SIGMOD’98 Potential max-patterns CISC6930

Complete set-enumeration tree { } 3 1 2 1,2 1,3 2,3 1,2,3 CISC6930

Mining Closed Frequent Itemsets (CFI) • Strategies • Item merging: every transaction has X and Y but no proper superset of Y, then X⋃Y is a CFI. There is no need to find CFI having X but no Y. • Sub-itemset pruning: if X is a proper subset of CFI Y, and support(X)=support(Y), then X and all X’s descendants could be pruned. • Perform superset-checking and subset-checking CISC6930

Database: Min_sup = 2, Min_conf = 50% Using Apriori algorithm, 20 frequent itemsets are found, out of them {acdf}(2), {cef}(3), {ae}(2), {cf}(4) {a}(3), {e}(4) J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. Mining Frequent Closed Patterns: CLOSET Min_sup=2 CISC6930

Flist: list of all frequent items in support ascending order Flist: d(2)-a(3)-f(4)-e(4)-c(4) Divide search space Patterns having d(2) {10:cefa}, {40:cfa} Every transaction having d(2) also has cfa(2)  {cfad}(2) is a frequent closed pattern Patterns having a(3) but no d, {10:cef},{20:e},{40:cf} {a}(3) is a frequent closed pattern since none of above has a. Find frequent closed pattern recursively: {ea}(2) Patterns having f but no a, d, {10,30,50:ce}, {40:c} {cf}(4) is a frequent closed pattern since all above have it. {cef}(3) is also a frequent closed pattern. Patterns having e but not a, d, f, {10,30,50:c} {e}(4) is a frequent closed patterns. Patterns having only c. Mining Frequent Closed Patterns: CLOSET Min_sup=2 CISC6930

Data Format • Horizontal data format • TID-itemset {TID: itemset} • Vertical data format • Item-TID-set {item: TID_set} • Mining could be performed by intersecting the TID_set of every pair of frequent items. CISC6930

Chapter 5 Frequent Patterns and Association Rule Mining

Chapter 5 Frequent Patterns and Association Rule Mining

Presentation Transcript

Association Rule Mining Constraint Based Association Rule Mining

Association Rule Mining

Mining Frequent Patterns and Association Rules

Association Rule Mining

Mining Frequent Patterns

Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns

Association Rule Mining

Mining Frequent Patterns, Association, and Correlations Pertemuan 05

Association Rule Mining

Mining Frequent Patterns, Association, and Correlations (cont.) Pertemuan 06

Association Rule Mining

Association Rule Mining

Chapter 5: Mining Frequent Patterns, Association and Correlations

Chapter 5: Mining Frequent Patterns, Association and Correlations

Association Rule Mining

Association Rule Mining

Association Rule Mining

Association rule mining

Association Rule Mining

Chapter 5: Mining Frequent Patterns, Association and Correlations

Chapter 5: Mining Frequent Patterns, Association and Correlations