1 / 22

GenMax

GenMax. From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki. The Problem. Given a large database of items transactions, find all frequent itemsets

thimba
Download Presentation

GenMax

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GenMax From: “Efficiently Mining Frequent Itemsets” By : Karam Gouda & Mohammed J. Zaki Zeev Dvir – dvirzeev@post.tau.ac.il

  2. The Problem • Given a large database of items transactions, find all frequent itemsets • A frequent itemset is a set of items that occurs in at-least a user-specified percentage of the data-base • We call this percentage : min_sup (for minimum support). Zeev Dvir – dvirzeev@post.tau.ac.il

  3. A Maximal Frequent Itemset is a frequent itemset, that doesn’t have a frequent superset • FI := frequent itemsets MFI := maximal frequent itemsets • Fact: |MFI| << |FI| GenMax is an algorithm to find the exact MFI Zeev Dvir – dvirzeev@post.tau.ac.il

  4. Example Min_sup = 3 ABCD ABC ABD ACD BCD AB AC AD BC BD CD A B C D Zeev Dvir – dvirzeev@post.tau.ac.il

  5. Some Useful Definitions • The Combine-Set of an itemset I , is the set of items that can be added to I to create a frequent itemset. • For example , in the previous example, The combine-set of the itemset {A} is {B, C}. • The combine-set of the empty itemset is called F1 and is actually the set of frequent itemsets ofsize 1. Zeev Dvir – dvirzeev@post.tau.ac.il

  6. Zeev Dvir – dvirzeev@post.tau.ac.il

  7. Zeev Dvir – dvirzeev@post.tau.ac.il

  8. Improvement • At each level, sort the combine-set (C) in increasing order of support • An itemset with low support has a smaller chance of producing a large combine-set in the next level • The sooner we prune the tree, the more work we save • This heuristic was first used in MaxMiner Zeev Dvir – dvirzeev@post.tau.ac.il

  9. Bottlenecks • Superset checking : The best algorithms for superset checking give an amortized bound of per operation. that’s bad if we have many itemsets in the MFI. 2. Frequency testing : How can we make frequency testing faster ? Zeev Dvir – dvirzeev@post.tau.ac.il

  10. Optimizing Superset Checking • A technique called “Progressive Focusing” is used to narrow down the group of potential supersets, as the recursive calls are made • LMFI := Local MFI • Before each recursive call, we construct the LMFI for the next call, based on the current LMFI and the new item added. Zeev Dvir – dvirzeev@post.tau.ac.il

  11. LMFI Example FGHI FGHJ … FGH FGI … FG … Zeev Dvir – dvirzeev@post.tau.ac.il

  12. Zeev Dvir – dvirzeev@post.tau.ac.il

  13. Frequency Testing Optimization • GenMax uses a “vertical database format”: • For each item , we have a set of all the transactions containing this item. • This set is called a tidset. (Transaction ID Set). • This method makes support computations easier, because we don’t have to go over the entire database. Zeev Dvir – dvirzeev@post.tau.ac.il

  14. Vertical Database A {1, 3, 4, 5} B {1, 3, 4, 6} C {1 ,2 ,3 ,4 ,7} D {2, 4, 6} t(A) = {1, 3, 4, 5} t(AC) = {1, 3, 4} supp(I) = |t(I)| Zeev Dvir – dvirzeev@post.tau.ac.il

  15. ABC ABD ABE AB … = { C , E } t(ABC) t(ABE) Each item y in the combine-set , actually represents the itemset , and stores the tidset associated with it. Zeev Dvir – dvirzeev@post.tau.ac.il

  16. Additional Optimization • Diffsets:don’t store the entire tidsets, only the differences between tidsets (described in “Fast Vertical Mining Using Diffsets”) Zeev Dvir – dvirzeev@post.tau.ac.il

  17. Experimental Results • GenMax is compared with: MaxMiner , MAFIA, MAFIA-PP • MaxMiner & MAFIA-PP give the exact MFI, while MAFIA gives a superset of the MFI • The Databases used in the experiments are grouped according to the MFI length distribution Zeev Dvir – dvirzeev@post.tau.ac.il

  18. Type I Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

  19. Type II Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

  20. Type III Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

  21. Type IV Datasets Zeev Dvir – dvirzeev@post.tau.ac.il

  22. The End Zeev Dvir – dvirzeev@post.tau.ac.il

More Related