210 likes | 402 Views
Pincer-Search*:An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem * Appeared in Advances in Database Technology- EDBT’98, Proceedings, LNCS Vol. 1377, Springer, pp. 105-119, March 1998. Title
E N D
Pincer-Search*:An Efficient Algorithm for Discovering the Maximum Frequent SetDao-I Lin and Zvi M. Kedem *Appeared inAdvances in Database Technology- EDBT’98, Proceedings, LNCS Vol. 1377, Springer, pp. 105-119, March 1998 • Title • NameDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University • http://www/?
Applications • Association rule applications: • Based on supermarket databases, one might be interested to know that “95% of the customers who bought pasta and ground meat also bought spaghetti sauce” • Based on the alarm signals in telecommunication databases, one might be interested to know that “one can have 90% confidence that alarm C will occur within some interval of time if alarm A and alarm B have occurred in that interval of time” • Based on the stock market trading databases, one might be interested to know that “90% of the time during the last month when the prices of stock A and stock B went up, the price of stock C also went up”
Setting • Basic terms: • 1,2, …, n: The set of all items • e.g. items in supermarkets, alarm signals in telecommunication networks, or stocks in stock markets • Transaction: A set of items • e.g. items purchased in a supermarket, alarm signals occurring within an interval of time, or stocks that their prices went up during the last one hour • Database: A set of transactions • User-defined threshold (min-support): A number in [0,1] • Frequent itemset: A collection of items (an itemset) occurring in at least min-supportfraction of the database • The problem: • Given a large database of sets of items and a user-defined min-support threshold, what are the frequent itemsets?
The Importance of the Maximum Frequent Set • Maximal frequent itemsets: • The frequent itemsets such that no proper superset of them is frequent • Maximum frequent set: • The set of all maximal frequent itemsets • Fact: • An itemset is frequent if and only if it is a subset a maximal frequent itemset • The maximum frequent set uniquely determines the entire frequent set, since the union of its subsets forms the frequent set • Discovering the maximum frequent set is a key problem in many data mining applications: • Such as the discovery of association rules, theories, strong rules, episodes, and minimal keys
An Example • Database Transaction 1 {1,2,3,5} 2 {1,5} 3 {1,2} 4 {1,2,3} • Min-supportis 0.5 • Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and {1,2,3}, since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3},{1,5}} {1,2,3,4,5} {1,2,3} {1,2} {1,3} {2,3} {1,5} {4} {5} {1}{2}{3}
Two Closure Properties • Let A and B be two itemsets and A B • Property1: A infrequent B infrequent(if a transaction does not contain A, it cannot contain B) • Property2: B frequent A frequent(if a transaction contains B, it must contain A) {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} B {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} A {5} B {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} A {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3}
Traditional One-Way Search Approaches • Traditional approach for discovering the maximum frequent set is either using a bottom-up search or a top-down search approach • Bottom-up search is good when ALLmaximal frequent itemsets are short • Top-down search is good when ALLmaximal frequent itemsets are long • One-way search can only make use of ONE of the two closure properties to prune candidates
One-Way Search Algorithms • Property1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) • Property2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {1} {2} {3} {4} {5} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}
Complexity of One-Way Searches • For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) • For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {5} {1} {2} {3} {4} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}
Our Two-Way Search Approach: Pincer-Search • Run both bottom-up search and top-down search at the same time • Use information gathered in the bottom-up search to helppruning candidates in the top-down search • Use Property1 to eliminate candidates in the top-down search • Use information gathered in the top-down search to helppruning candidates in the bottom-up search • Use Property2 to eliminate candidates in the bottom-up search • Can efficiently discover both long and short maximal frequent itemsets
Pincer Search: CombiningTop-down and Bottom-up Searches • Eliminated in thetop-down search by using the Property1 • Eliminated in thebottom-up searchby using the Property2 • This example shows how combining both searches could dramatically reduce • the number of candidates examined • the pass of reading the database {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,5} {2,5} {3,5} {4,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3} {4} {5}
Performance:Observations and Experiments • Non-monotone property of the maximum frequent set • Both the number of candidates and the number of of frequent itemsets increase as the min-support decreases • NOT true for the number of maximal frequent itemsets • If MFS is {{1,2},{2,3},{3,4}} when min-support is 9% • If min-support decreases to 6% then MFS could become {{1,2,3}} • This property will NOT help bottom-up search algorithms • However, this property may help the Pincer-Search algorithm • Concentrated and scattered distributions • Concentrated: For the same number of frequent itemsets,the frequent items are grouped in a NARROW and TALL shape; a few LONG maximal frequent itemsets • Scattered: For the same number of frequent itemsets,the frequent itemsets are grouped in a WIDE and SHORT shape; many SHORT maximal frequent itemsets
Experiments on Scattered Distributions • The benchmark databases are generated by a well-know synthetic data generation program from IBM Quest project • |T| is the average transaction size, |I| is the average size of the maximal frequent itemsets, |D| is the number of transactions, and |L| is the number of the maximal frequent itemsets • The experiment on T5.I2.D100K shows that although Pincer-Search algorithm used more candidates than Apriori algorithm (due to the candidates considered in the MFCS), Pincer-Search algorithm still performed better since the I/O time saved compensated the extra cost • The experiment on T10.I4.D100K shows that it is also possible for Pincer-Search algorithm to spend efforts on maintaining the MFCS, but did not prune enough candidates to cover the extra cost • For instance, Pincer-Search algorithm performed slightly worse than Apriori algorithm when min-support is 0.75%
Experiments on Concentrated Distributions • These experiments show that Pincer-Search algorithm is good for discovering the maximum frequent set with concentrated distributions • The improvements can be up to several orders of magnitude • For instance, the improvements are more than 2 orders of magnitude on the experiment of T20I15.D100K database and when min-supports are 7% and 6% • One can expect even greater improvements when some maximal frequent itemsets are longer
Experiments on Real-Life Databasesand Conclusions • Pincer-Search algorithm performed quite well on the experiments on this PUMS database, which contains Public Use Microdata Samples • Some preliminary experiments on NYSE stock market databases also show promising results • Conclusions: • Pincer-Search is good for concentrated distributions • In general, can use Adaptive Pincer-Search • Delay the use of the two-way search approach until a later pass • More experiments on real-life databases are in progress