920 likes | 3.46k Views
Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem. Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University http://www/?. Department of Computer Science Courant Institute of Mathematical Sciences
E N D
Pincer-Search: A New Algorithm for Discovering the Maximum Frequent SetDao-I Lin and Zvi M. Kedem • Title • NameDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University • http://www/? Department of Computer Science Courant Institute of Mathematical Sciences New York University
Overview • The importance of maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and census databases • Conclusions
Setting • Basic terms: • 1,2, …, n: The set of all items • Transaction: A set of items • Database: A set of transactions • User-defined threshold (suppmin): A number in [0,1] • Frequent itemset: A combination of items (an itemset) occurring in at least suppminfraction of the database • Maximum frequent set • An itemset is frequent if and only if it is a subset a maximal frequent itemset • Maximum frequent set: The set of all maximal frequent itemsets • Discovering the maximum frequent set is a key problem in many data mining applications • Association rules, strong rules, episodes, and minimal keys
An Example • Database TransactionI itemset 1 {1,2,3,5} 2 {1,5} 3 {1,2} 4 {1,2,3} • Set suppminto 0.5 • Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and {1,2,3} since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3},{1,5}} {1,2,3,4,5} {1,2,3} {1,2} {1,3} {2,3} {1,5} {4} {5} {1}{2}{3}
An Example • Database Transaction Itemset 1 {1,2,3,4,5} 2 {1,3} 3 {1,2} 4 {1,2,3,4} • Set suppminto 0.5 • Frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, and {1,2,3,4} since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3,4}} {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1}{2}{3}{4} {5}
Setting • Basic terms: • 1,2, …, n: The set of all items • Transaction: A set of items • Database: A set of transactions • User-defined threshold (suppmin): A number in [0,1] • Frequent itemset: A combination of items (an itemset) occurring in at least suppminfraction of the database • Maximum frequent set • An itemset is frequent if and only if it is a subset a maximal frequent itemset • Maximum frequent set: The set of all maximal frequent itemsets • Discovering the maximum frequent set is a key problem in many data mining applications • Association rules, strong rules, episodes, and minimal keys
Two Observations • Let A and B be two itemsets and A B • Observation-1: A infrequent B infrequent(if a transaction does not contain A, it cannot contain B) • Observation-2: B frequent A frequent(if a transaction contains B, it must contain A) {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} A {5} B {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3}
Computing the Maximum Frequent Set • Observation-1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) • Observation-2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {1} {2} {3} {4} {5} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}
Complexity of One-Way Search • For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) • For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {5} {1} {2} {3} {4} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}
Pincer Search: CombiningTop-down and Bottom-up Searches • Use Observation-1 to eliminate candidates in the top-down search • Use Observation-2 to eliminate candidates in the bottom-up search • This example shows how combining both searches could dramatically reduce • the number of candidates examined • the pass of reading the database {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {2,5} {3,5} {4,5} {1} {2} {3} {4} {5}
MFCS:A New Data Structure Maintained • For bottom-up search: Candidate set (as usual) • For top-down search: Use a new dynamically maintained data structure: maximum frequent candidate set (MFCS) • MFCS is a set of itemsets: • Union of its subsets contains all known frequent itemsets • Union of its subsets does not contain any currently known infrequent itemsets • It is of minimum cardinality • MFCS supports efficient coordination between bottom-up and top-down searches
Pincer-Search: Search Path {1,2,3,4,5} By {2,5} {1,2,3,4} {1,3,4,5} By {3,5} {1,3,4} {1,4,5} By {4,5} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,5} {2,5} {3,5} {4,5} {1} {2} {3} {4} {5}
Pincer-Search Algorithm 01. L0 := ; k := 1; C1 := {{ i } | i } 02. MFCS := {{1,2, ...,n}}; MFS := 03. while Ck 04. read database and count supports for Ck and MFCS 05. MFS := MFS { frequent itemsets in MFCS } 06. determine frequent set Lk and and infrequent set Sk 07. use Sk to update MFCS 08. generate new candidate set Ck+1 (join, recover, and prune) 09. k := k +1 10. return MFS
Performance:Observations and Experiments • Non-monotone property of the maximum frequent set • Both the number of candidates and the number of of frequent itemsets increase as the suppmin decreases • NOT true for the number of maximal frequent itemsets • If MFS is {{1,2},{2,3},{3,4}} when suppmin is 9% • If suppmin decreases to 6% then MFS could become {{1,2,3}} • This property will NOT help bottom-up search algorithms • However, this property may help the Pincer-Search algorithm • Concentrated and scattered distributions • Concentrated: on each level, the frequent itemsets have many common items; the frequent items tend to cluster (Narrow and tall) • Scattered: the frequent itemsets do not have many common items (Wide and flat)
Conclusions • Pincer-Search is good for concentrated distributions • In general, can use Adaptive Pincer-Search • More experiments on real-life databases needed