Data Mining

Data Mining Presented By: Kevin Seng CS632 - Data Mining

Papers Rakesh Agrawal and Ramakrishnan Srikant: • Fast algorithms for mining association rules. • Mining sequential patterns. CS632 - Data Mining

Outline… For each paper… • Present the problem. • Describe the algorithms. • Intuition • Design • Performance. CS632 - Data Mining

Market Basket Introduction • Retailers are able to collect massive amounts of sales data (basket data) • Bar-code technology • E-commerce • Sales data generally includes customer id, transaction date and items bought. CS632 - Data Mining

Market Basket Problem • It would be useful to find association rules between transactions. • ie. 75% of the people who buy spaghetti also by tomato sauce. • Given a set of basket data, how can we efficiently find the set of association rules? CS632 - Data Mining

Formal Definition (1) • L = {i1,i2,… im} set of items. • Database Dis a set of transactions. • Transaction T is a set of items such that T  L. • An unique identifier, TID, is associated with each transaction. CS632 - Data Mining

Formal Definition (2) • T contains X, a set of some items in L, if X  T. • Association rule, X  Y • X T, Y T, X Y =  • Confidence – % of transactions which contain X which also contain Y. • Support - % of transactions in D which contain X  Y. CS632 - Data Mining

Formal Definition (3) • Given a set of transactions D, we want to generate all association rules that have support and confidence greater than the user-specified minimum support (minsup) and minimum confidence (minconf). CS632 - Data Mining

Problem Decomposition Two sub-problems: • Find all itemsets that have transaction support above minsup. • These itemsets are called large itemsets. • From all the large itemsets, generate the set of association rules that have confidence about minconf. CS632 - Data Mining

Second Sub-problem Straightforward approach: • For every large itemset l, find all non-empty subsets of l. • For every such subset a, output a rule of the form a (l – a) if ratio of support(l) to support(a) is at least minconf. CS632 - Data Mining

Discovering Large Itemsets • Done with multiple passes over the data. • First pass, find all individual items that are large (have minimum support). • Subsequent pass, using large itemsets found in previous pass: • Generate candidate itemsets. • Count support for each candidate itemset. • Eliminate itemsets that do not have min support. CS632 - Data Mining

Algorithm L1 = {large 1-itemsets}; for( k=2; Lk-1; k++) do begin Ck = apriori-gen(Lk-1); // New candidates forall transactions t  D do // Counting support Ct = subset(Ck, t); // Candidates in t forall candidates c  Ct do c.count++; end Lk = {c  Ck | c.count  minsup} End Answer = kLk; CS632 - Data Mining

Candidate Generation AIS and SETM algorithms: • Uses the transactions in the database to generate new candidates. • But… this generates a lot of candidates which we know beforehand are not large! CS632 - Data Mining

Apriori Algorithms • Generate candidates using only large itemsets found in previous pass without considering the database. Intuition: • Any subset of a large itemset must be large. CS632 - Data Mining

Apriori Candidate Generation • Takes in Lk-1 and returns Ck. Two steps: • Join large itemsets Lk-1 with Lk-1. • Prune out all itemsets in joined result which contain a (k-1)subset not found in Lk-1. CS632 - Data Mining

Candidate Generation (Join) insert into Ck selectp.item1, p.item2,…,p.itemk-1,q.itemk-1 fromLk-1p, Lk-1 q where p.item1= q.item1,…, p.itemk-2= q.itemk-2, p.itemk-1< q.itemk-1 CS632 - Data Mining

Candidate Gen. (Example) Join  Prune  CS632 - Data Mining

Counting Support • Need to count the number of transactions which support a given itemset. • For efficiency, use a hash-tree. • Subset Function CS632 - Data Mining

Subset Function (Hash-tree) • Candidate itemsets are stored in hash-tree. • Leaf node – contains a list of itemsets. • Interior node – contains a hash table. • Each bucket of the hash table points to another node. • Root is at depth 1. • Interior nodes at depth d points to nodes at depth d+1. CS632 - Data Mining

Hash-tree Example (1) depth 1 2 1 {2 3 4} 2 2 3 3 {1 2 3} {1 2 4} {1 3 4} {1 3 5} t=2 CS632 - Data Mining

Using the hash-tree • If we are at a leaf – find all itemsets contained in transaction. • If we are at an interior node – hash on each remaining element in transaction. • Root node – hash on all elements in transaction. CS632 - Data Mining

Hash-tree Example (2) 1 2 {2 3 4} 2 3 {1 2 3} {1 2 4} {1 3 4} {1 3 5} CS632 - Data Mining

AprioriTid (1) • Does not use the transactions in the database for counting itemset support. • Instead stores transactions as sets of possible large itemsets, Ck. • Each member of Ck is of the form: < TID, {Xk}> , Xkis a possible large itemset CS632 - Data Mining

AprioriTid (2) Advantage of Ck • If a transaction does not contain any candidate k-itemset then it will have no entry in Ck. • Number of entries in Ckmay be less than the number of transactions in D. • Especially for large k. • Speeds up counting! CS632 - Data Mining

AprioriTid (3) However… • For small k each entry in Ck may be larger than it’s corresponding transaction. • The usual space vs. time. CS632 - Data Mining

AprioriTid (4) Example CS632 - Data Mining

Observation • When Ck does not fit in main memory we can see large jump in execution time. • AprioriTid beats Apriori only when Ck can fit in main memory. CS632 - Data Mining

AprioriHybrid • It is not necessary to use the same algorithm for all the passes. • Combine the two algorithms! • Start with Apriori • When Ck can fit in main memory switch to AprioriTID CS632 - Data Mining

Performance (1) • Measured performance by running algorithms on generated synthetic data. • Used the following parameters: CS632 - Data Mining

Performance (2) CS632 - Data Mining

Performance (3) CS632 - Data Mining

Mining Sequential Patterns (1) • Sequential patterns are ordered list of itemsets. • Market basket example: • Customers typically rent “star wars” then “empire strikes back” then “return of the Jedi” • “Fitted sheets and pillow cases” then “comforter” then “drapes and ruffles” CS632 - Data Mining

Mining Sequential Patterns (2) • Looks at sequences of transactions as opposed to a single transaction. • Groups transactions based on customer ID. • Customer sequence. CS632 - Data Mining

Formal Definition (1) • Given a database D of customer transactions. • Each transaction consists of: customer id, transaction-time, items purchased. • No customer has more than one transaction with the same transaction-time. CS632 - Data Mining

Formal Definition (2) • Itemset i, (i1 i2...im) where ij is an item. • Sequence s, s1s2…sn where sj is an itemset. • Sequence a1a2…an contained in b1b2…bn if there exist integersi1< i2 ... < in such that a1 bi1 , a2 bi2 ,…, an bin . • A sequence s is maximal if it is not contained in any other sequence. CS632 - Data Mining

Formal Definition (3) • A customer supports a sequence s if s is contained in the customer sequence for this customer. • Support of a sequence - % of customers who support the sequence. • For mining association rules, support was % of transactions. CS632 - Data Mining

Formal Definition (4) • Given a database D of customer transactions find the maximal sequences among all sequences that have a certain user-specified minimum support. • Sequences that have support above minsup are large sequences. CS632 - Data Mining

Algorithm: Sort Phase • Customer ID – Major key • Transaction-time – Minor key Converts the original transaction database into a database of customer sequences. CS632 - Data Mining

Algorithm: Litemset Phase (1) Litemset Phase: • Find all large itemsets. Why? • Because each itemset in a large sequence has to be a large itemset. CS632 - Data Mining

Algorithm: Litemset Phase (2) • To get all large itemsets we can use the Apriori algorithms discussed earlier. • Need to modify support counting. • For sequential patterns, support is measured by fraction of customers. CS632 - Data Mining

Algorithm: Litemset Phase (3) • Each large itemset is then mapped to a set of contiguous integers. • Used to compare two large itemsets in constant time. CS632 - Data Mining

Algorithm: Transformation (1) • Need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. • Represent transactions as sets of large itemsets. • Customer sequence now becomes a list of sets of itemsets. CS632 - Data Mining

Algorithm: Transformation (2) CS632 - Data Mining

Algorithm: Sequence Phase (1) • Use the set of large itemsets to find the desired sequences. • Similar structure to Apriori algorithms used to find large itemsets. • Use seed set to generate candidate sequences. • Count support for each candidate. • Eliminate candidate sequences which are not large. CS632 - Data Mining

Algorithm: Sequence Phase (2) Two types of algorithms: • Count-all: counts all large sequences, including non-maximal sequences. • AprioriAll • Count-some: try to avoid counting non-maximal sequences by counting longer sequences first. • AprioriSome • DynamicSome CS632 - Data Mining

Algorithm: Maximal Phase (1) • Find the maximal sequences among the set of large sequences. • Set of all large subsequences S CS632 - Data Mining

Algorithm: Maximal Phase (2) • Use hash-tree to find all subsequences of sk in S. • Similar to subset function used in finding large itemsets. • S is stored in hash-tree. CS632 - Data Mining

AprioriAll (1) CS632 - Data Mining

AprioriAll (2) • Hash-tree is used for counting. • Candidate generation similar to candidate generation in finding large itemsets. • Except that order matters and therefore we don’t have the condition: p.itemk-1< q.itemk-1 CS632 - Data Mining

AprioriAll (3) Example of candidate generation: CS632 - Data Mining

Data Mining

Data Mining

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining

Data Mining: Data

Data Mining: Data