1 / 43

Fast Algorithms for Discovering the Maximum Frequent Set Dao-I Tony Lin

Fast Algorithms for Discovering the Maximum Frequent Set Dao-I Tony Lin. Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University http://www/?. Department of Computer Science Courant Institute of Mathematical Sciences New York University .

kisha
Download Presentation

Fast Algorithms for Discovering the Maximum Frequent Set Dao-I Tony Lin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Algorithms for Discovering the Maximum Frequent SetDao-I Tony Lin • Title • NameDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University • http://www/? Department of Computer Science Courant Institute of Mathematical Sciences New York University

  2. Association Rule Examples • Based on supermarket databases, it may be interesting to discover that “90% of the customers who bought Bread and Butter also bought Tea” • Based on the alarm signals in telecommunication databases, it may be interesting to discover that “90% of the time when alarm X and alarm Y occurred within an interval of time, alarm Z also occurred within that same time interval” • Based on the stock market trading databases, it may be interesting to discover that “90% of the time during the last month when the prices of stock X and stock Y went up, the price of stock Z also went up”

  3. Basic Definitions • Basic terms: • 1,2, …, n: The set of all items • e.g., items in supermarkets, alarm signals in telecommunication networks, or stocks in stock markets • Transaction: A set of items • e.g., items I bought yesterday in a supermarket, alarm signals occurring on Thursday, or stocks whose prices increased during the last one hour • Database: A set of transactions • Itemset: A collection (set) of items • Support:The percentage of the transactions containing the itemset • Frequent itemset: Support( itemset )  Min-Support • Confidence of a rule {1,2}  {3} : Support( {1,2,3} ) / Support( {1,2} ) • Interesting association rules: rules having at least minimum confidence and minimum support

  4. Association Rule Example • Database of 5 items: item 1, 2, 3, 4, 5 • Transactions:{ 1, 2, 3, 4, 5 }{ 1, 3 } { 1, 2 } { 1, 2, 3, 4 } • User sets: Min-Support to 50% and Min-Confidence to 75% • Some itemsets are frequent: e.g. {1}, {1,4} and {1,2,3,4}. Since they occur in at least 2 out of 4 transactions • Some association rules are interesting : {1,4} {2,3}:support( {1,2,3,4} ) /support( {1,4} ) = 50 / 50  75%( {1,2} {3,4} ):support( {1,2,3,4} ) /support( {1,2} )=50 / 75 75% {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Black: infrequent itemsets {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1}{2}{3}{4} {5}

  5. Association Rule Discovery • Step 1: Discovering the set of all frequent itemsets • This step is very time consuming • Step 2: Generating the association rules • This step is straightforward and fast

  6. The Key Problem • The key problem for finding the association rules is discovering the frequent set ( = the set of all frequent itemsets) • Given a (large) database and a user-defined min-support threshold, what is the frequent set? • What is the maximum frequent set?

  7. Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions

  8. The Importance ofthe Maximum Frequent Set • Maximum frequent set (MFS): • The set of all maximal frequent itemsets • A concise way of describing the entire frequent set • Maximal (length) frequent itemset: • A frequent itemset which has no proper frequent superset • Of course: • An itemset is frequent iff it is a subset a maximal frequent itemset • The maximum frequent set uniquely determines the entire frequent set, since the union of its subsets forms the frequent set • Discovering the maximum frequent set is a key problem in many data mining applications: • Not only in the discovery of association rules but also in the discovery of theories, strong rules, episodes, and minimal keys

  9. MFS Example • Database of 5 items: item 1 through 5 • Transactions: { 1, 2, 3, 4, 5 }{ 1, 3 } { 1, 2 } { 1, 2, 3, 4 } • User sets: Min-Support to 50% • The frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, and {1,2,3,4} since they occur in at least 2 out of 4 transactions • The maximum frequent set is { {1,2,3,4} } {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1}{2}{3}{4} {5}

  10. Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions

  11. Two Closure Properties of Frequent Itemsets • Let A and B be two itemsets and A B • Property 1: A infrequent  B infrequent(if a transaction does not contain A, it cannot contain B) • Property 2: B frequent  A frequent(if a transaction contains B, it must contain A) {1,2,3,4,5} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} B {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} A {5} B {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} A {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3}

  12. Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions

  13. Bottom-Up Search Algorithms • Examples:AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) • Property 1 is used to exclude the infrequent itemsets (supersets) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,3} {2,3} {1,4} {2,4} {1,2} {3,4} {1} {2} {3} {4} {5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets

  14. Top-Down Search Algorithms • Examples:TopDown (ZPOL97), guess-and-correct (MT97) • Property 2is used to include frequent itemsets (subsets) {1,2,3,4,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,3,4} {1,2,3,5} {3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {2,5} {3,5} {4,5} {1,5} {5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets

  15. Complexity of One-Way Searches • For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) • For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {5} {1} {2} {3} {4} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,5} {2,5} {3,5} {4,5} {5}

  16. Problems with the Traditional One-Way Search Algorithms • The traditional approach for discovering the maximum frequent set uses: • either a bottom-up search (Property 1) • or a top-down search (Property 2) • Bottom-up search is good when ALL maximal frequent itemsets are short • Top-down search is good when ALL maximal frequent itemsets are long • What to do when some maximal frequent itemsets are long and some are short? • What to do in general cases?

  17. Naïve Two-Way Search

  18. Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions

  19. Our Two-Way Search Approach: Pincer-Search • Synergistically run both bottom-up search and top-down search at the same time • Use information gathered in the bottom-up search to helpprune candidates in the top-down search • Use Property 1 to eliminate candidates in the top-down search • Use information gathered in the top-down search to helpprune candidates in the bottom-up search • Use Property 2 to eliminate candidates in the bottom-up search

  20. Pincer Search: CombiningTop-down and Bottom-up Searches • Eliminated in thetop-down search by using the Property1 • Eliminated in thebottom-up searchby using the Property2 • This example shows how combining both searches could dramatically reduce • the number of candidates examined • the number of passes of reading the database {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,5} {2,5} {3,5} {4,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3} {4} {5}

  21. Pincer-Search Algorithm 01. L0 := ; k := 1; C1 := {{ i } | i } 02. MFCS := {{1,2, ...,n}}; MFS :=  03. while Ck  04. read database and count supports for Ck and MFCS 05. remove frequent itemsets from MFCS and add them to MFS 06. determine frequent set Lk and and infrequent set Sk 07. use Sk to update MFCS 08. generate new candidate set Ck+1(join, recover, and prune) 09. k := k +1 10. return MFS

  22. MFCS:A New Data Structure Maintained • For bottom-up search: Candidate set • For top-down search: Use a new dynamically maintained data structure: maximum frequent candidate set (MFCS) • MFCS: The set of all maximal itemsets that are not known to be infrequent • MFCS supports efficient coordination between bottom-up and top-down searches

  23. MFCS Example {1,2,3,4,5,6,7, ... , 20} {1,2,3,4,5,6} {1,2,3,4,5} {2,4,5,6} {1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5} {1,6} {3,6} {2,6} {4,6} {5,6} {1} {2} {3} {4} {5} {6} {7}... {20} Blue: frequent itemsets Black: infrequent itemsets Green: MFCS

  24. MFCS-Gen Algorithm Input: Old MFCS, new infrequent set Sk foundin pass k Output: New MFCS 01. for all itemset s in Sk 02. for all itemsets m in MFCS 03. if s is a subset of m 04. MFCS := MFCS \ { m } 05. for all items e in itemset s 06. if m \ { e } is not a subset of any itemset in the MFCS 07. MFCS := MFCS  { m \ { e } } 08. return MFCS

  25. MFCS-Gen Example {1,2,3,4,5,6,7, ... , 20} By removing infrequent items 7 to n {1,2,3,4,5,6} By infrequent itemset {1,6} {1,2,3,4,5} {2,3,4,5,6} By infrequent itemset {3,6} {2,4,5,6} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {2,4,5} {2,4,6} {2,5,6} {4,5,6} {1,2,3} {1,2,4} {1,2,5} {1,3,4} {1,3,5} {1,4,5} {2,3,4} {2,3,5} {3,4,5} {1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5} {2,6} {4,6} {5,6} {1,6} {3,6} {1} {2} {3} {4} {5} {6} {7}... {20} Blue: frequent itemsets Black: infrequent itemsets Red: maximal frequent itemsets Green: MFCS

  26. Pincer-Search: Search Path

  27. Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions

  28. Experiments onSynthetic Databases • The benchmark databases are generated by a popular synthetic data generation program from IBM Quest project • Parameters: • n is the number of different items (set to 1000) • |T| is the average transaction size • |I| is the average size of the maximal frequent itemsets, • |D| is the number of transactions • |L| is the number of the maximal frequent itemsets

  29. Concentrated and ScatteredDistributions • A distribution is scattered when: the frequent itemsets are grouped in a WIDE and SHORT shape; many SHORT maximal frequent itemsets • A distribution is concentrated when: the frequent items are grouped in a NARROW and TALL shape; a few LONG maximal frequent itemsets {1,2, …, n} {1,2, …, n} scattered concentrated {1}{2}{3}…{n} {1}{2}{3}…{n}

  30. Scattered Distributions(lower is better)

  31. Scattered Distributions(lower is better)

  32. Some Observations • T5.I2.D100K experiment • Pincer-Search used more candidates than Apriori algorithm (due to the candidates considered in the MFCS) • Pincer-Search algorithm still performed better since the I/O time saved compensated the extra cost • T10.I4.D100K experiment • Pincer-Search spent work maintaining the MFCS, but did not prune enough candidates to cover the extra cost • For instance, Pincer-Search algorithm performed slightly worse than Apriori algorithm when min-support is 0.75%

  33. Concentrated Distributions (lower is better)

  34. Concentrated Distributions (lower is better)

  35. Non-motone Property of the MFS • Both the number of candidates and the number of frequent itemsets increase as the min-support decreases • NOT true for the number of maximal frequent itemsets • Assume MFS is {{1,2},{1,3},{2,3}} when min-support is 9% • If min-support decreases to 6% then MFS could become {{1,2,3}} • This property will NOT help bottom-up search algorithms • However, this property may help the Pincer-Search algorithm

  36. Some Observations • Pincer-Search is good for discovering the frequent itemsets with concentrated distributions • The improvements can be up to several orders of magnitude • For instance, the improvements are more than 2 orders of magnitude on the experiment of T20.I15.D100K database and when min-supports are 7% and 6% • One can expect even greater improvements when some maximal frequent itemsets are longer

  37. NYSE Databases (lower is better)

  38. Census Data (lower is better)

  39. Experiments on Real-Life Databases • Some preliminary experiments on NYSE stock market databases also show promising results • Pincer-Search algorithm performed quite well on the experiments on PUMS database, which contains Public Use Microdata Samples • This database is similar to the database used in DIC (BMUT97) and Max-Miner (R98)

  40. Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions

  41. Conclusions • Pincer-Search is good for concentrated distributions • In general, can use Adaptive Pincer-Search • Need to study other distributions of the frequent itemsets • More experiments on real-life databases needed • Partition the lattice for parallel and distributed algorithms

  42. General Assumptions: Big n and Large Database {1,2, …, n} 2n subsets of {1,2, …, n} Large Database {1}{2}{3}…{n} Main Memory

  43. Techniques to Reduce the I/O Time Partition (SON95), Sampling (T96) {1,2, …, n} Subsets of {1,2, …, n} ? Large Database {1}{2}{3}…{n} Main Memory

More Related