1 / 63

Data Mining

Data Mining. Presented By: Kevin Seng. Papers. Rakesh Agrawal and Ramakrishnan Srikant: Fast algorithms for mining association rules. Mining sequential patterns. Outline…. For each paper… Present the problem. Describe the algorithms. Intuition Design Performance.

tate
Download Presentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Presented By: Kevin Seng CS632 - Data Mining

  2. Papers Rakesh Agrawal and Ramakrishnan Srikant: • Fast algorithms for mining association rules. • Mining sequential patterns. CS632 - Data Mining

  3. Outline… For each paper… • Present the problem. • Describe the algorithms. • Intuition • Design • Performance. CS632 - Data Mining

  4. Market Basket Introduction • Retailers are able to collect massive amounts of sales data (basket data) • Bar-code technology • E-commerce • Sales data generally includes customer id, transaction date and items bought. CS632 - Data Mining

  5. Market Basket Problem • It would be useful to find association rules between transactions. • ie. 75% of the people who buy spaghetti also by tomato sauce. • Given a set of basket data, how can we efficiently find the set of association rules? CS632 - Data Mining

  6. Formal Definition (1) • L = {i1,i2,… im} set of items. • Database Dis a set of transactions. • Transaction T is a set of items such that T  L. • An unique identifier, TID, is associated with each transaction. CS632 - Data Mining

  7. Formal Definition (2) • T contains X, a set of some items in L, if X  T. • Association rule, X  Y • X T, Y T, X Y =  • Confidence – % of transactions which contain X which also contain Y. • Support - % of transactions in D which contain X  Y. CS632 - Data Mining

  8. Formal Definition (3) • Given a set of transactions D, we want to generate all association rules that have support and confidence greater than the user-specified minimum support (minsup) and minimum confidence (minconf). CS632 - Data Mining

  9. Problem Decomposition Two sub-problems: • Find all itemsets that have transaction support above minsup. • These itemsets are called large itemsets. • From all the large itemsets, generate the set of association rules that have confidence about minconf. CS632 - Data Mining

  10. Second Sub-problem Straightforward approach: • For every large itemset l, find all non-empty subsets of l. • For every such subset a, output a rule of the form a (l – a) if ratio of support(l) to support(a) is at least minconf. CS632 - Data Mining

  11. Discovering Large Itemsets • Done with multiple passes over the data. • First pass, find all individual items that are large (have minimum support). • Subsequent pass, using large itemsets found in previous pass: • Generate candidate itemsets. • Count support for each candidate itemset. • Eliminate itemsets that do not have min support. CS632 - Data Mining

  12. Algorithm L1 = {large 1-itemsets}; for( k=2; Lk-1; k++) do begin Ck = apriori-gen(Lk-1); // New candidates forall transactions t  D do // Counting support Ct = subset(Ck, t); // Candidates in t forall candidates c  Ct do c.count++; end Lk = {c  Ck | c.count  minsup} End Answer = kLk; CS632 - Data Mining

  13. Candidate Generation AIS and SETM algorithms: • Uses the transactions in the database to generate new candidates. • But… this generates a lot of candidates which we know beforehand are not large! CS632 - Data Mining

  14. Apriori Algorithms • Generate candidates using only large itemsets found in previous pass without considering the database. Intuition: • Any subset of a large itemset must be large. CS632 - Data Mining

  15. Apriori Candidate Generation • Takes in Lk-1 and returns Ck. Two steps: • Join large itemsets Lk-1 with Lk-1. • Prune out all itemsets in joined result which contain a (k-1)subset not found in Lk-1. CS632 - Data Mining

  16. Candidate Generation (Join) insert into Ck selectp.item1, p.item2,…,p.itemk-1,q.itemk-1 fromLk-1p, Lk-1 q where p.item1= q.item1,…, p.itemk-2= q.itemk-2, p.itemk-1< q.itemk-1 CS632 - Data Mining

  17. Candidate Gen. (Example) Join  Prune  CS632 - Data Mining

  18. Counting Support • Need to count the number of transactions which support a given itemset. • For efficiency, use a hash-tree. • Subset Function CS632 - Data Mining

  19. Subset Function (Hash-tree) • Candidate itemsets are stored in hash-tree. • Leaf node – contains a list of itemsets. • Interior node – contains a hash table. • Each bucket of the hash table points to another node. • Root is at depth 1. • Interior nodes at depth d points to nodes at depth d+1. CS632 - Data Mining

  20. Hash-tree Example (1) depth 1 2 1 {2 3 4} 2 2 3 3 {1 2 3} {1 2 4} {1 3 4} {1 3 5} t=2 CS632 - Data Mining

  21. Using the hash-tree • If we are at a leaf – find all itemsets contained in transaction. • If we are at an interior node – hash on each remaining element in transaction. • Root node – hash on all elements in transaction. CS632 - Data Mining

  22. Hash-tree Example (2) 1 2 {2 3 4} 2 3 {1 2 3} {1 2 4} {1 3 4} {1 3 5} CS632 - Data Mining

  23. AprioriTid (1) • Does not use the transactions in the database for counting itemset support. • Instead stores transactions as sets of possible large itemsets, Ck. • Each member of Ck is of the form: < TID, {Xk}> , Xkis a possible large itemset CS632 - Data Mining

  24. AprioriTid (2) Advantage of Ck • If a transaction does not contain any candidate k-itemset then it will have no entry in Ck. • Number of entries in Ckmay be less than the number of transactions in D. • Especially for large k. • Speeds up counting! CS632 - Data Mining

  25. AprioriTid (3) However… • For small k each entry in Ck may be larger than it’s corresponding transaction. • The usual space vs. time. CS632 - Data Mining

  26. AprioriTid (4) Example CS632 - Data Mining

  27. Observation • When Ck does not fit in main memory we can see large jump in execution time. • AprioriTid beats Apriori only when Ck can fit in main memory. CS632 - Data Mining

  28. AprioriHybrid • It is not necessary to use the same algorithm for all the passes. • Combine the two algorithms! • Start with Apriori • When Ck can fit in main memory switch to AprioriTID CS632 - Data Mining

  29. Performance (1) • Measured performance by running algorithms on generated synthetic data. • Used the following parameters: CS632 - Data Mining

  30. Performance (2) CS632 - Data Mining

  31. Performance (3) CS632 - Data Mining

  32. Mining Sequential Patterns (1) • Sequential patterns are ordered list of itemsets. • Market basket example: • Customers typically rent “star wars” then “empire strikes back” then “return of the Jedi” • “Fitted sheets and pillow cases” then “comforter” then “drapes and ruffles” CS632 - Data Mining

  33. Mining Sequential Patterns (2) • Looks at sequences of transactions as opposed to a single transaction. • Groups transactions based on customer ID. • Customer sequence. CS632 - Data Mining

  34. Formal Definition (1) • Given a database D of customer transactions. • Each transaction consists of: customer id, transaction-time, items purchased. • No customer has more than one transaction with the same transaction-time. CS632 - Data Mining

  35. Formal Definition (2) • Itemset i, (i1 i2...im) where ij is an item. • Sequence s, s1s2…sn where sj is an itemset. • Sequence a1a2…an contained in b1b2…bn if there exist integersi1< i2 ... < in such that a1 bi1 , a2 bi2 ,…, an bin . • A sequence s is maximal if it is not contained in any other sequence. CS632 - Data Mining

  36. Formal Definition (3) • A customer supports a sequence s if s is contained in the customer sequence for this customer. • Support of a sequence - % of customers who support the sequence. • For mining association rules, support was % of transactions. CS632 - Data Mining

  37. Formal Definition (4) • Given a database D of customer transactions find the maximal sequences among all sequences that have a certain user-specified minimum support. • Sequences that have support above minsup are large sequences. CS632 - Data Mining

  38. Algorithm: Sort Phase • Customer ID – Major key • Transaction-time – Minor key Converts the original transaction database into a database of customer sequences. CS632 - Data Mining

  39. Algorithm: Litemset Phase (1) Litemset Phase: • Find all large itemsets. Why? • Because each itemset in a large sequence has to be a large itemset. CS632 - Data Mining

  40. Algorithm: Litemset Phase (2) • To get all large itemsets we can use the Apriori algorithms discussed earlier. • Need to modify support counting. • For sequential patterns, support is measured by fraction of customers. CS632 - Data Mining

  41. Algorithm: Litemset Phase (3) • Each large itemset is then mapped to a set of contiguous integers. • Used to compare two large itemsets in constant time. CS632 - Data Mining

  42. Algorithm: Transformation (1) • Need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. • Represent transactions as sets of large itemsets. • Customer sequence now becomes a list of sets of itemsets. CS632 - Data Mining

  43. Algorithm: Transformation (2) CS632 - Data Mining

  44. Algorithm: Sequence Phase (1) • Use the set of large itemsets to find the desired sequences. • Similar structure to Apriori algorithms used to find large itemsets. • Use seed set to generate candidate sequences. • Count support for each candidate. • Eliminate candidate sequences which are not large. CS632 - Data Mining

  45. Algorithm: Sequence Phase (2) Two types of algorithms: • Count-all: counts all large sequences, including non-maximal sequences. • AprioriAll • Count-some: try to avoid counting non-maximal sequences by counting longer sequences first. • AprioriSome • DynamicSome CS632 - Data Mining

  46. Algorithm: Maximal Phase (1) • Find the maximal sequences among the set of large sequences. • Set of all large subsequences S CS632 - Data Mining

  47. Algorithm: Maximal Phase (2) • Use hash-tree to find all subsequences of sk in S. • Similar to subset function used in finding large itemsets. • S is stored in hash-tree. CS632 - Data Mining

  48. AprioriAll (1) CS632 - Data Mining

  49. AprioriAll (2) • Hash-tree is used for counting. • Candidate generation similar to candidate generation in finding large itemsets. • Except that order matters and therefore we don’t have the condition: p.itemk-1< q.itemk-1 CS632 - Data Mining

  50. AprioriAll (3) Example of candidate generation: CS632 - Data Mining

More Related