Technologies for Mining Frequent Patterns in Large Databases

Technologies for Mining Frequent Patterns in Large Databases Jiawei Han Intelligent Database Systems Research Lab. Simon Fraser University, Canada http://www.cs.sfu.ca/~han

Tutorial Outline • What is frequent pattern mining? • Frequent pattern mining algorithms • Apriori and its variations • A multi-dimensional view of frequent pattern mining • Constraint-based frequent pattern mining • Recent progress on efficient mining methods • Mining frequent patterns without candidate generation • CLOSET: Efficient mining of frequent closet itemsets • FreeSpan: Towards efficient sequential pattern mining

Part IWhat Is Frequent Pattern Mining? • What is frequent pattern? • Why frequent pattern mining? • Challenges in frequent pattern mining

What Is Frequent Pattern Mining? • What is a frequent pattern? • Pattern (set of items, sequence, etc.) that occurs together frequently in a database [AIS93] • Frequent pattern: an important form of regularity • What products were often purchased together? — beers and diapers! • What are the consequences of a hurricane? • What is the next target after buying a PC?

Application Examples • Market Basket Analysis • *  Maintenance Agreement What the store should do to boost Maintenance Agreement sales • Home Electronics  * What other products should the store stocks up on if the store has a sale on Home Electronics • Attached mailing in direct marketing • Detecting “ping-pong”ing of patients transaction: patient item: doctor/clinic visited by a patient support of a rule: number of common patients

Frequent Pattern Mining—A Corner Stone in Data mining • Association analysis • Basket data analysis, cross-marketing, catalog design, loss-leader analysis, text database analysis • Correlation or causality analysis • Clustering • Classification • Association-based classification analysis • Sequential pattern analysis • Web log sequence, DNA analysis, etc. • Partial periodicity, cyclic/temporal associations

Association Rule Mining • Given • A database of customer transactions • Each transaction is a list of items (purchased by a customer in a visit) • Find all rules that correlate the presence of one set of items with that of another set of items • Example: 98% of people who purchase tires and auto accessories also get automotive services done • Any number of items in the consequent/antecedent of rule • Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances).

Basic Concepts • Rule form: “A® B [support s, confidence c]”. Support: usefulness of discovered rules Confidence: certainty of the detected association Rules that satisfy both min_sup and min_conf are called strong. • Examples: • buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%] • age(x, “30-34”) ^ income(x ,“42K-48K”) ® buys(x, “high resolution TV”) [2%,60%] • major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

Rule Measures: Support and Confidence • Find all the rules X & Y  Z with minimum confidence and support • support,s, probability that a transaction contains {X, Y, Z} • confidence,c,conditional probability that a transaction having {X, Y} also contains Z. Customer buys both Customer buys diaper Customer buys beer Let minimum support 50%, and minimum confidence 50%, we have • A  C (50%, 66.6%) • C  A (50%, 100%)

Part IIFrequent pattern mining methods: Apriori and its variations • The Apriori algorithm • Improvements of Apriori • Incremental, parallel, and distributed methods • Different measures in association mining

An Influential Mining Methodology — The Apriori Algorithm • The Apriori method: • Proposed by Agrawal & Srikant 1994 • A similar level-wise algorithm by Mannila et al. 1994 • Major idea: • A subset of a frequent itemset must be frequent • E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be. Anyone is infrequent, its superset cannot be! • A powerful, scalable candidate set pruning technique: • It reduces candidate k-itemsets dramatically (for k > 2)

Mining Association Rules — Example For rule AC: support = support({AC}) = 50% confidence = support({AC})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent. Min. support 50% Min. confidence 50%

Procedure of Mining Association Rules: • Find the frequent itemsets: the sets of items that have minimum support (Apriori) • A subset of a frequent itemset must also be a frequent itemset, i.e., if {A  B} isa frequent itemset, both {A} and {B} should be a frequent itemset • Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) • Use the frequent itemsets to generate association rules.

The Apriori Algorithm • Join Step Ckis generated by joining Lk-1with itself • Prune Step Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset, hence should be removed. (Ck: Candidate itemset of size k) (Lk : frequent itemset of size k)

Apriori—Pseudocode Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;

The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D

How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

How to Count Supports of Candidates? • Why counting supports of candidates a problem? • The total number of candidates can be very huge • One transaction may contain many candidates • Method: • Candidate itemsets are stored in a hash-tree • Leaf node of hash-tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction

Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

Subset function 3,6,9 1,4,7 2,5,8 2 3 4 5 6 7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Example: Counting Supports of Candidates Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6

Generating Strong Association Rules • Confidence(A B) = Prob(B|A) = support(A B)/support(A) • Example: L3={2,3,5} 2^3  5, confidence=2/2=100% 2^5  3, confidence=2/3=67% 3^5  2, confidence=2/2=100% 2  3^5, confidence=2/3=67% 3  2^5, confidence=2/3=67% 5  3^2, confidence=2/3=67%

Efficient Implementation of Apriori in SQL • S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and Implications. In SIGMOD’98 • Implementations based on pure SQL-92 • Impossible to get good performance out of pure SQL based approaches alone • Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. • Get orders of magnitude improvement

Improvements of Apriori • General ideas • Scan the transaction database as fewer passes as possible • Reduce number of candidates • Facilitate support counting of candidates

DIC: Reduce Number of Scans • S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97 • Basic idea • Count the itemsets at the boundary in a lattice • Push the boundary dynamically • Using trie structure to keep track counters and reordering items to reduce counting costs

Once all (k-1)-itemset of a k-itemset are all frequent, the counting of the k-itemset can begin Any upper nodes of an infrequent itemset should not be counted 1-itemsets 2-itemsets 1-itemsets … Example of DIC ABCD ABC ABD ACD BCD AB AC BC AD BD CD Transactions B C D A Apriori {} 2-items Itemset lattice and boundary DIC 3-items

DIC: Pros and Cons • Number of scans • Can be reduced in some cases • But how about non-homogeneous data and high support situations? • Item reordering • “Item reordering did not work as well as we had hoped” • Performance • 30% gain at low support ends • 30% lose at high support ends

DHP: Reduce the Number of Candidates • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 • Major features • Efficient generation for candidate itemsets • Effective reduction on transaction database size

DHP: Efficient Generation for Candidates • In the k pass, count support for k-candidates, entries in hash table • A (k+1)-itemset in Lk*Lk is qualified as a (k+1)-candidate only if it passes the hash filtering, i.e., it is hashed into a hash entry whose value is no less than support threshold • Example • Candidates: a, b, c, d, e • Hash entries: {ab, ad, ae} {bd, be, de} … • Frequent 1-itemset: a, b, d, e • ab is not a candidate 2-itemset if the count of the hash bucket, {ab, ad, ae}, is below support threshold

DHP: Effective Reduction on Database Size • An item in transaction t can be trimmed if it does not appear in at least k of the candidate k-itemsets in t • Examples • Transaction acd can be discarded if only ac is frequent • Transaction bce must be kept if bc, be, and cd are frequent

Partition: Scan Database Only Twice • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 • Mine all frequent itemsets by scanning transaction database only twice

Scan One in Partition • Divide database into n partitions. • A global frequent itemset must be frequent in at least one partition. • Process one partition in main memory at a time, for each partition • generate local frequent itemsets using the Apriori algorithm • also form tidlist for all itemsets to facilitate counting in the merge phase • tidlist: contains the transaction Ids of all transactions that contain the itemset within a given partition

Scan Two in Partition • Merge local frequent itemsets to generate a set of all potential large itemsets • Count actual supports • Support can be computed from the tidlists

Partition: Pros and Cons • Achieve both CPU and I/O improvements over Apriori • The number of distinct local frequent itemsets may be very large • tidlists to be maintained can be huge

Sampling for Mining Frequent Itemsets • H. Toivonen. Sampling large databases for association rules. In VLDB’96 • Select a sample of original database, mine frequent itemsets within sample using Apriori • Scan database once to verify frequent itemsets found in sample, only bordersof closure of frequent itemsets are checked • Example: check abcd instead of ab, ac, …, etc. • Scan database again to find missed frequent itemsets

Challenges for the Sampling Method • How to sample a large database? • When support threshold is pretty low, sampling may not generate results good enough

Incremental Association Mining • A transaction database and a set of frequent itemset already mined • A set of update transactions for transaction database, including insertion and deletion • How to update the frequent itemset for the updated transaction database? Frequent itemsets What are the updated frequent itemsets? Transaction database Update transactions

FUP: Incremental Update of Discovered Rules • D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In ICDE’96 • View a database: original DB È incremental db. • A k-itemset (for any k) • frequent in DB È db if frequent in both DB and db. • infrequent in DB È db if also in both DB and db. • For those only frequent in DB, merge corresponding counts in db. • For those only frequent in db, search DB to update their itemset counts.

Incremental Update of Discovered Rules • A fast updating algorithm, FUP (Cheung et al.’96) • View a database: original DB È incremental db. • A k-itemset (for any k), • frequent in DB È db if frequent in both DB and db. • infrequent in DB È db if also in both DB and db. • For those only frequent in DB, merge corresponding counts in db. • For those only frequent in db, search DB to update their itemset counts. • Similar methods can be adopted for data removal and update, or distributed/parallel mining.

Parallel and Distributed Association Mining • D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In PDIS 1996 • M. Tamura and M. Kitsuregawa. Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems. In VLDB 1999 • E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In SIGMOD’97 • M. Zaki, S. Parthasarathy, and M. Ogihara. Parallel algorithms for discovery of association rules. In Data Mining and Knowledge Discovery. Vol.1 No.4, 1997

Interestingness Measures • Objective measures Two popular measurements: • support; and • confidence • Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if • it is unexpected (surprising to the user); and/or • actionable (the user can do something with it)

Criticism to Support and Confidence • Example 1: (Aggarwal & Yu, PODS98) • Among 5000 students • 3000 play basketball • 3750 eat cereal • 2000 both play basket ball and eat cereal • play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. • play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

Criticism to Support and Confidence (Cont.) • Example 2: • X and Y: positively correlated, • X and Z, negatively related • support and confidence of X=>Z dominates

Other Interestingness Measures: Interest • Interest (lift) • taking both P(A) and P(B) in consideration • P(A^B)=P(B)*P(A), if A and B are independent events • A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated.

Other Interestingness Measures: Conviction • Conviction • from implication: A  B  A  ( B) • factors in both P(A) and P(B) and has value 1 when the relevant items are completely unrelated (confidence does not) • rules which hold 100% of the time have the highest possible value  (interest does not)

Collective Strength • Collective strength is a number between 0 and  with 1 as the break-even point where v(I) is the violation ratio of itemset I. An itemset is said to be in violationof a transaction if some of the items are present in the transaction, and others are not. v(I) is equal to the fraction of transactions which contain a proper non-null subset of I • Recasting collective strength as:

Collective Strength (2) • Let I be a set of items {i1, i2, … ik}. Let pr denote the frequency of the item ir in the database. • the probability that the itemset I occurs in a transaction is • the probability that none of the items in I occurs in the transaction is • the expected fraction of transactions that contains at least one item in I, and where at least one item is absent:

Collective Strength (3) • Example: • Collective Strength of I {X,Y}:

Summary • Frequent pattern mining is an important data mining task • Apriori is an important frequent pattern mining methodology • A set of Apriori-like mining methods have been developed since 1994 • Interestingness measure is important at discovery interesting rules

Technologies for Mining Frequent Patterns in Large Databases Jiawei Han Intelligent Database Systems Research Lab. Simon Fraser University, Canada http://www.cs.sfu.ca/~han

Tutorial Outline • What is frequent pattern mining? • Frequent pattern mining algorithms • Apriori and its variations • A multi-dimensional view of frequent pattern mining • Constraint-based frequent pattern mining • Recent progress on efficient mining methods • Mining frequent patterns without candidate generation • CLOSET: Efficient mining of frequent closet itemsets • FreeSpan: Towards efficient sequential pattern mining

Technologies for Mining Frequent Patterns in Large Databases