Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem

Pincer-Search: A New Algorithm for Discovering the Maximum Frequent SetDao-I Lin and Zvi M. Kedem • Title • NameDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University • http://www/? Department of Computer Science Courant Institute of Mathematical Sciences New York University

Overview • The importance of maximum frequent set • Structural properties • Traditional one-way search algorithms • Pincer-Search algorithm • Experiments on synthetic and census databases • Conclusions

Setting • Basic terms: • 1,2, …, n: The set of all items • Transaction: A set of items • Database: A set of transactions • User-defined threshold (suppmin): A number in [0,1] • Frequent itemset: A combination of items (an itemset) occurring in at least suppminfraction of the database • Maximum frequent set • An itemset is frequent if and only if it is a subset a maximal frequent itemset • Maximum frequent set: The set of all maximal frequent itemsets • Discovering the maximum frequent set is a key problem in many data mining applications • Association rules, strong rules, episodes, and minimal keys

An Example • Database TransactionI itemset 1 {1,2,3,5} 2 {1,5} 3 {1,2} 4 {1,2,3} • Set suppminto 0.5 • Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and {1,2,3} since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3},{1,5}} {1,2,3,4,5} {1,2,3} {1,2} {1,3} {2,3} {1,5} {4} {5} {1}{2}{3}

An Example • Database Transaction Itemset 1 {1,2,3,4,5} 2 {1,3} 3 {1,2} 4 {1,2,3,4} • Set suppminto 0.5 • Frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, and {1,2,3,4} since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3,4}} {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1}{2}{3}{4} {5}

Setting • Basic terms: • 1,2, …, n: The set of all items • Transaction: A set of items • Database: A set of transactions • User-defined threshold (suppmin): A number in [0,1] • Frequent itemset: A combination of items (an itemset) occurring in at least suppminfraction of the database • Maximum frequent set • An itemset is frequent if and only if it is a subset a maximal frequent itemset • Maximum frequent set: The set of all maximal frequent itemsets • Discovering the maximum frequent set is a key problem in many data mining applications • Association rules, strong rules, episodes, and minimal keys

Two Observations • Let A and B be two itemsets and A B • Observation-1: A infrequent  B infrequent(if a transaction does not contain A, it cannot contain B) • Observation-2: B frequent  A frequent(if a transaction contains B, it must contain A) {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} A {5} B {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3}

Computing the Maximum Frequent Set • Observation-1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) • Observation-2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {1} {2} {3} {4} {5} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}

Complexity of One-Way Search • For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) • For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {5} {1} {2} {3} {4} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}

Pincer Search: CombiningTop-down and Bottom-up Searches • Use Observation-1 to eliminate candidates in the top-down search • Use Observation-2 to eliminate candidates in the bottom-up search • This example shows how combining both searches could dramatically reduce • the number of candidates examined • the pass of reading the database {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {2,5} {3,5} {4,5} {1} {2} {3} {4} {5}

MFCS:A New Data Structure Maintained • For bottom-up search: Candidate set (as usual) • For top-down search: Use a new dynamically maintained data structure: maximum frequent candidate set (MFCS) • MFCS is a set of itemsets: • Union of its subsets contains all known frequent itemsets • Union of its subsets does not contain any currently known infrequent itemsets • It is of minimum cardinality • MFCS supports efficient coordination between bottom-up and top-down searches

Pincer-Search: Search Path {1,2,3,4,5} By {2,5} {1,2,3,4} {1,3,4,5} By {3,5} {1,3,4} {1,4,5} By {4,5} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,5} {2,5} {3,5} {4,5} {1} {2} {3} {4} {5}

Pincer-Search Algorithm 01. L0 := ; k := 1; C1 := {{ i } | i } 02. MFCS := {{1,2, ...,n}}; MFS :=  03. while Ck  04. read database and count supports for Ck and MFCS 05. MFS := MFS  { frequent itemsets in MFCS } 06. determine frequent set Lk and and infrequent set Sk 07. use Sk to update MFCS 08. generate new candidate set Ck+1 (join, recover, and prune) 09. k := k +1 10. return MFS

Performance:Observations and Experiments • Non-monotone property of the maximum frequent set • Both the number of candidates and the number of of frequent itemsets increase as the suppmin decreases • NOT true for the number of maximal frequent itemsets • If MFS is {{1,2},{2,3},{3,4}} when suppmin is 9% • If suppmin decreases to 6% then MFS could become {{1,2,3}} • This property will NOT help bottom-up search algorithms • However, this property may help the Pincer-Search algorithm • Concentrated and scattered distributions • Concentrated: on each level, the frequent itemsets have many common items; the frequent items tend to cluster (Narrow and tall) • Scattered: the frequent itemsets do not have many common items (Wide and flat)

Scattered Distributions

Concentrated Distributions

Census Data

Conclusions • Pincer-Search is good for concentrated distributions • In general, can use Adaptive Pincer-Search • More experiments on real-life databases needed

Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem

Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem

Presentation Transcript

Chapter 16

CpE602: Applied Discrete Mathematics

Maximum flow

FP (Frequent pattern)-growth algorithm

A* search

COMP 6521 PROJECT External Sorting Algorithm Mining Frequent Itemsets from Secondary Memory

I terative Improvement Algorithm

Prolog Search

Dijkstra’s Algorithm and Heuristic Graph Search

CPS 196.03: Information Management and Mining

Forward-backward algorithm

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING

The Binary Search Algorithm

Automatic Algorithm Configuration based on Local Search

CCC-Pincer Ligands for Hydrocarbon Functionalization

Skip Search algorithm

Search Algorithm

Mining Compressed Frequent-Pattern Sets