430 likes | 603 Views
Fast Algorithms for Discovering the Maximum Frequent Set Dao-I Tony Lin. Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University http://www/?. Department of Computer Science Courant Institute of Mathematical Sciences New York University .
E N D
Fast Algorithms for Discovering the Maximum Frequent SetDao-I Tony Lin • Title • NameDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University • http://www/? Department of Computer Science Courant Institute of Mathematical Sciences New York University
Association Rule Examples • Based on supermarket databases, it may be interesting to discover that “90% of the customers who bought Bread and Butter also bought Tea” • Based on the alarm signals in telecommunication databases, it may be interesting to discover that “90% of the time when alarm X and alarm Y occurred within an interval of time, alarm Z also occurred within that same time interval” • Based on the stock market trading databases, it may be interesting to discover that “90% of the time during the last month when the prices of stock X and stock Y went up, the price of stock Z also went up”
Basic Definitions • Basic terms: • 1,2, …, n: The set of all items • e.g., items in supermarkets, alarm signals in telecommunication networks, or stocks in stock markets • Transaction: A set of items • e.g., items I bought yesterday in a supermarket, alarm signals occurring on Thursday, or stocks whose prices increased during the last one hour • Database: A set of transactions • Itemset: A collection (set) of items • Support:The percentage of the transactions containing the itemset • Frequent itemset: Support( itemset ) Min-Support • Confidence of a rule {1,2} {3} : Support( {1,2,3} ) / Support( {1,2} ) • Interesting association rules: rules having at least minimum confidence and minimum support
Association Rule Example • Database of 5 items: item 1, 2, 3, 4, 5 • Transactions:{ 1, 2, 3, 4, 5 }{ 1, 3 } { 1, 2 } { 1, 2, 3, 4 } • User sets: Min-Support to 50% and Min-Confidence to 75% • Some itemsets are frequent: e.g. {1}, {1,4} and {1,2,3,4}. Since they occur in at least 2 out of 4 transactions • Some association rules are interesting : {1,4} {2,3}:support( {1,2,3,4} ) /support( {1,4} ) = 50 / 50 75%( {1,2} {3,4} ):support( {1,2,3,4} ) /support( {1,2} )=50 / 75 75% {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Black: infrequent itemsets {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1}{2}{3}{4} {5}
Association Rule Discovery • Step 1: Discovering the set of all frequent itemsets • This step is very time consuming • Step 2: Generating the association rules • This step is straightforward and fast
The Key Problem • The key problem for finding the association rules is discovering the frequent set ( = the set of all frequent itemsets) • Given a (large) database and a user-defined min-support threshold, what is the frequent set? • What is the maximum frequent set?
Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions
The Importance ofthe Maximum Frequent Set • Maximum frequent set (MFS): • The set of all maximal frequent itemsets • A concise way of describing the entire frequent set • Maximal (length) frequent itemset: • A frequent itemset which has no proper frequent superset • Of course: • An itemset is frequent iff it is a subset a maximal frequent itemset • The maximum frequent set uniquely determines the entire frequent set, since the union of its subsets forms the frequent set • Discovering the maximum frequent set is a key problem in many data mining applications: • Not only in the discovery of association rules but also in the discovery of theories, strong rules, episodes, and minimal keys
MFS Example • Database of 5 items: item 1 through 5 • Transactions: { 1, 2, 3, 4, 5 }{ 1, 3 } { 1, 2 } { 1, 2, 3, 4 } • User sets: Min-Support to 50% • The frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, and {1,2,3,4} since they occur in at least 2 out of 4 transactions • The maximum frequent set is { {1,2,3,4} } {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1}{2}{3}{4} {5}
Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions
Two Closure Properties of Frequent Itemsets • Let A and B be two itemsets and A B • Property 1: A infrequent B infrequent(if a transaction does not contain A, it cannot contain B) • Property 2: B frequent A frequent(if a transaction contains B, it must contain A) {1,2,3,4,5} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} B {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} A {5} B {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} A {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3}
Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions
Bottom-Up Search Algorithms • Examples:AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) • Property 1 is used to exclude the infrequent itemsets (supersets) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,3} {2,3} {1,4} {2,4} {1,2} {3,4} {1} {2} {3} {4} {5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets
Top-Down Search Algorithms • Examples:TopDown (ZPOL97), guess-and-correct (MT97) • Property 2is used to include frequent itemsets (subsets) {1,2,3,4,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,3,4} {1,2,3,5} {3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {2,5} {3,5} {4,5} {1,5} {5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets
Complexity of One-Way Searches • For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) • For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {5} {1} {2} {3} {4} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,5} {2,5} {3,5} {4,5} {5}
Problems with the Traditional One-Way Search Algorithms • The traditional approach for discovering the maximum frequent set uses: • either a bottom-up search (Property 1) • or a top-down search (Property 2) • Bottom-up search is good when ALL maximal frequent itemsets are short • Top-down search is good when ALL maximal frequent itemsets are long • What to do when some maximal frequent itemsets are long and some are short? • What to do in general cases?
Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions
Our Two-Way Search Approach: Pincer-Search • Synergistically run both bottom-up search and top-down search at the same time • Use information gathered in the bottom-up search to helpprune candidates in the top-down search • Use Property 1 to eliminate candidates in the top-down search • Use information gathered in the top-down search to helpprune candidates in the bottom-up search • Use Property 2 to eliminate candidates in the bottom-up search
Pincer Search: CombiningTop-down and Bottom-up Searches • Eliminated in thetop-down search by using the Property1 • Eliminated in thebottom-up searchby using the Property2 • This example shows how combining both searches could dramatically reduce • the number of candidates examined • the number of passes of reading the database {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,5} {2,5} {3,5} {4,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3} {4} {5}
Pincer-Search Algorithm 01. L0 := ; k := 1; C1 := {{ i } | i } 02. MFCS := {{1,2, ...,n}}; MFS := 03. while Ck 04. read database and count supports for Ck and MFCS 05. remove frequent itemsets from MFCS and add them to MFS 06. determine frequent set Lk and and infrequent set Sk 07. use Sk to update MFCS 08. generate new candidate set Ck+1(join, recover, and prune) 09. k := k +1 10. return MFS
MFCS:A New Data Structure Maintained • For bottom-up search: Candidate set • For top-down search: Use a new dynamically maintained data structure: maximum frequent candidate set (MFCS) • MFCS: The set of all maximal itemsets that are not known to be infrequent • MFCS supports efficient coordination between bottom-up and top-down searches
MFCS Example {1,2,3,4,5,6,7, ... , 20} {1,2,3,4,5,6} {1,2,3,4,5} {2,4,5,6} {1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5} {1,6} {3,6} {2,6} {4,6} {5,6} {1} {2} {3} {4} {5} {6} {7}... {20} Blue: frequent itemsets Black: infrequent itemsets Green: MFCS
MFCS-Gen Algorithm Input: Old MFCS, new infrequent set Sk foundin pass k Output: New MFCS 01. for all itemset s in Sk 02. for all itemsets m in MFCS 03. if s is a subset of m 04. MFCS := MFCS \ { m } 05. for all items e in itemset s 06. if m \ { e } is not a subset of any itemset in the MFCS 07. MFCS := MFCS { m \ { e } } 08. return MFCS
MFCS-Gen Example {1,2,3,4,5,6,7, ... , 20} By removing infrequent items 7 to n {1,2,3,4,5,6} By infrequent itemset {1,6} {1,2,3,4,5} {2,3,4,5,6} By infrequent itemset {3,6} {2,4,5,6} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {2,4,5} {2,4,6} {2,5,6} {4,5,6} {1,2,3} {1,2,4} {1,2,5} {1,3,4} {1,3,5} {1,4,5} {2,3,4} {2,3,5} {3,4,5} {1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5} {2,6} {4,6} {5,6} {1,6} {3,6} {1} {2} {3} {4} {5} {6} {7}... {20} Blue: frequent itemsets Black: infrequent itemsets Red: maximal frequent itemsets Green: MFCS
Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions
Experiments onSynthetic Databases • The benchmark databases are generated by a popular synthetic data generation program from IBM Quest project • Parameters: • n is the number of different items (set to 1000) • |T| is the average transaction size • |I| is the average size of the maximal frequent itemsets, • |D| is the number of transactions • |L| is the number of the maximal frequent itemsets
Concentrated and ScatteredDistributions • A distribution is scattered when: the frequent itemsets are grouped in a WIDE and SHORT shape; many SHORT maximal frequent itemsets • A distribution is concentrated when: the frequent items are grouped in a NARROW and TALL shape; a few LONG maximal frequent itemsets {1,2, …, n} {1,2, …, n} scattered concentrated {1}{2}{3}…{n} {1}{2}{3}…{n}
Some Observations • T5.I2.D100K experiment • Pincer-Search used more candidates than Apriori algorithm (due to the candidates considered in the MFCS) • Pincer-Search algorithm still performed better since the I/O time saved compensated the extra cost • T10.I4.D100K experiment • Pincer-Search spent work maintaining the MFCS, but did not prune enough candidates to cover the extra cost • For instance, Pincer-Search algorithm performed slightly worse than Apriori algorithm when min-support is 0.75%
Non-motone Property of the MFS • Both the number of candidates and the number of frequent itemsets increase as the min-support decreases • NOT true for the number of maximal frequent itemsets • Assume MFS is {{1,2},{1,3},{2,3}} when min-support is 9% • If min-support decreases to 6% then MFS could become {{1,2,3}} • This property will NOT help bottom-up search algorithms • However, this property may help the Pincer-Search algorithm
Some Observations • Pincer-Search is good for discovering the frequent itemsets with concentrated distributions • The improvements can be up to several orders of magnitude • For instance, the improvements are more than 2 orders of magnitude on the experiment of T20.I15.D100K database and when min-supports are 7% and 6% • One can expect even greater improvements when some maximal frequent itemsets are longer
Experiments on Real-Life Databases • Some preliminary experiments on NYSE stock market databases also show promising results • Pincer-Search algorithm performed quite well on the experiments on PUMS database, which contains Public Use Microdata Samples • This database is similar to the database used in DIC (BMUT97) and Max-Miner (R98)
Road Map • Introduction • The importance of the maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and real-life databases • Conclusions
Conclusions • Pincer-Search is good for concentrated distributions • In general, can use Adaptive Pincer-Search • Need to study other distributions of the frequent itemsets • More experiments on real-life databases needed • Partition the lattice for parallel and distributed algorithms
General Assumptions: Big n and Large Database {1,2, …, n} 2n subsets of {1,2, …, n} Large Database {1}{2}{3}…{n} Main Memory
Techniques to Reduce the I/O Time Partition (SON95), Sampling (T96) {1,2, …, n} Subsets of {1,2, …, n} ? Large Database {1}{2}{3}…{n} Main Memory