GenMax

GenMax From: “Efficiently Mining Frequent Itemsets” By : Karam Gouda & Mohammed J. Zaki Zeev Dvir – dvirzeev@post.tau.ac.il

The Problem • Given a large database of items transactions, find all frequent itemsets • A frequent itemset is a set of items that occurs in at-least a user-specified percentage of the data-base • We call this percentage : min_sup (for minimum support). Zeev Dvir – dvirzeev@post.tau.ac.il

A Maximal Frequent Itemset is a frequent itemset, that doesn’t have a frequent superset • FI := frequent itemsets MFI := maximal frequent itemsets • Fact: |MFI| << |FI| GenMax is an algorithm to find the exact MFI Zeev Dvir – dvirzeev@post.tau.ac.il

Example Min_sup = 3 ABCD ABC ABD ACD BCD AB AC AD BC BD CD A B C D Zeev Dvir – dvirzeev@post.tau.ac.il

Some Useful Definitions • The Combine-Set of an itemset I , is the set of items that can be added to I to create a frequent itemset. • For example , in the previous example, The combine-set of the itemset {A} is {B, C}. • The combine-set of the empty itemset is called F1 and is actually the set of frequent itemsets ofsize 1. Zeev Dvir – dvirzeev@post.tau.ac.il

Zeev Dvir – dvirzeev@post.tau.ac.il

Improvement • At each level, sort the combine-set (C) in increasing order of support • An itemset with low support has a smaller chance of producing a large combine-set in the next level • The sooner we prune the tree, the more work we save • This heuristic was first used in MaxMiner Zeev Dvir – dvirzeev@post.tau.ac.il

Bottlenecks • Superset checking : The best algorithms for superset checking give an amortized bound of per operation. that’s bad if we have many itemsets in the MFI. 2. Frequency testing : How can we make frequency testing faster ? Zeev Dvir – dvirzeev@post.tau.ac.il

Optimizing Superset Checking • A technique called “Progressive Focusing” is used to narrow down the group of potential supersets, as the recursive calls are made • LMFI := Local MFI • Before each recursive call, we construct the LMFI for the next call, based on the current LMFI and the new item added. Zeev Dvir – dvirzeev@post.tau.ac.il

LMFI Example FGHI FGHJ … FGH FGI … FG … Zeev Dvir – dvirzeev@post.tau.ac.il

Zeev Dvir – dvirzeev@post.tau.ac.il

Frequency Testing Optimization • GenMax uses a “vertical database format”: • For each item , we have a set of all the transactions containing this item. • This set is called a tidset. (Transaction ID Set). • This method makes support computations easier, because we don’t have to go over the entire database. Zeev Dvir – dvirzeev@post.tau.ac.il

Vertical Database A {1, 3, 4, 5} B {1, 3, 4, 6} C {1 ,2 ,3 ,4 ,7} D {2, 4, 6} t(A) = {1, 3, 4, 5} t(AC) = {1, 3, 4} supp(I) = |t(I)| Zeev Dvir – dvirzeev@post.tau.ac.il

ABC ABD ABE AB … = { C , E } t(ABC) t(ABE) Each item y in the combine-set , actually represents the itemset , and stores the tidset associated with it. Zeev Dvir – dvirzeev@post.tau.ac.il

Additional Optimization • Diffsets:don’t store the entire tidsets, only the differences between tidsets (described in “Fast Vertical Mining Using Diffsets”) Zeev Dvir – dvirzeev@post.tau.ac.il

Experimental Results • GenMax is compared with: MaxMiner , MAFIA, MAFIA-PP • MaxMiner & MAFIA-PP give the exact MFI, while MAFIA gives a superset of the MFI • The Databases used in the experiments are grouped according to the MFI length distribution Zeev Dvir – dvirzeev@post.tau.ac.il

Type I Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

Type II Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

Type III Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

Type IV Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

The End Zeev Dvir – dvirzeev@post.tau.ac.il

GenMax

GenMax

Presentation Transcript

The Concept of Maximal Frequent Itemsets