1 / 24

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets. National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN. Takeaki Uno Masashi Kiyomi Hiroki Arimura. 1/ Nov/2004 Frequent Itemset Mining Implementations ’04.

melva
Download Presentation

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN Takeaki Uno Masashi Kiyomi Hiroki Arimura 1/Nov/2004 Frequent Itemset Mining Implementations ’04

  2. Our approach Typical approach Summary

  3. Frequent Itemset Mining • Almost all computation time is spent for frequency counting ⇒How to reduce •#FI to be checked •cost of frequency counting

  4. Hypercube Decomposition [form Ver.1] • Reduce #FI to be checked 1. Decompose the set of all FI’s into hypercubes, each of which is included in an equivalence class 2. Enumerate maximal and minimal of each hypercube (with frequency counting) 3.Generate other FI’s between maximal and minimal (without frequency counting) Efficient when support is small

  5. denotation of 1,2,3 denotation of 1,2,4 denotation of 1,2,5 B B C A A Occurrence Deliver [ver1] 3 4 5 4 5 5 A B C database 3 3 •Compute the denotations of P ∪{i}for alli’s at once, by transposing the trimmed database • Trimmed database is composed of - items to be added - transactions including P itemset:1,2 denotation:A,B,C Trimmed database A B C 1 2 linear time in the size of trimmed database Efficient for sparse datasets

  6. Loss of Occurrence Deliver[new] θ P∪ •Avoiding frequency counting of infrequent itemset P∪{e} has been considered to be important •However, the computation time for such itemsets is 1/3 of all computation cost on average, in our experiments (if we sort items by their frequency (size of tuple list)) 3 4 5 6 7 8 9 AD ELM ABCEFGH JKLN ABDEFGI JKLMSTW BEGILT MTW ABCDFGH IKLMNST Occurrence deliver has an advantage of its simple structure

  7. Anytime Database Reduction [new] •Database reduction:Reduce the database, by [fp-growth, etc]   ◆Remove iteme,if e is included in less thanθ transactions or included in all transactions   ◆merge identical transactions into one •Anytime database reduction:Recursively apply trimming and this reduction, in the recursion database size becomes small in lower levels of the recursion In the recursion tree, lower level iterations are exponentially many rather than upper level iterations. very efficient

  8. i j Example of Anytime D. R. [new] trim  anytime database reduction  trim  anytime database reduction….

  9. Array(reduced) vs. Trie (FP-tree) [new] initialization is fast (LCM O(||T||) :Trie O(|T|log|T| + ||T||) ) •Trie can compress the trimmed database [fp-growth, etc] •By experiments for FIMI instances, we compute the average compression ratio by Trie for trimmed database over all iterations •#items(cells) in Tries 1/2 average, 1/6 minimum (dense case) • If Trie is constructed by a binary tree, it needs at least 3 pointers for each item. memory use (computation time)  twice, minimum 2/3

  10. Results

  11. Closed Itemset Mining • avoid (prune) non-closed itemsets? (existing pruning is not complete) • quickly operate closure? • save memory use? (existing approach uses much memory) • How to

  12. Prefix Preserving Closure Extension [ver1] •Prefix preserving closure extension(PPC-extension) is a variation of closure extension Def. closure tail of a closed itemset P ⇔ the minimum j s.t. closure (P ∩ {1,…,j}) =P Def. H =closure(P∪{i}) (closure extension of P) is a PPC-extension ofP ⇔i > closure tailand H ∩{1,…,i-1} =P ∩{1,…,i-1} no duplication occurs by depth-first search “Any” closed itemset H is generated from another “unique” closed itemset by PPC-extension (i.e., from closure(H ∩{1,…,i-1}) )

  13. Example of ppc-extension [ver1] φ • closure extension acyclic • ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} closure extension ppc extension {2,5} {2,7,9} T= {1,2,7,9} {2,3,4,5} {1,2,7,8,9} {1,2,5,6,7,9}

  14. Results

  15. Maximal Frequent Itemset Mining •How to •avoid (prune) non-maximal imteset? •check maximality quickly? •save memory? (existing maximality check and pruning use much memory)

  16. Backtracking-based Pruning [new] 1 2 3 4 5 6 7 8 9 10 re-sort •During backtracking algorithm for FI, : current itemset    : a MFI including K •re-sort items s.t. itemsof H locate end 6 8 10 4 5 7 9 •Then, new MFI NEVER be found in recursive calls w.r.t. items in H  omit such recursive calls rec. call no rec. call We can avoid so many non-MFI’s

  17. Fast Maximality Check (CI,MFI) [new] •To reduce the computation cost for maximality check, closedness check, we use more database reduction •At anytime database reduction, we keep  ◆the intersection of merged transactions, for closure operation  ◆the sum of merged transactions as a weighted transaction database, for maximality check • Closure is the intersection of transactions • Frequency of one more larger itemsets are sum of transactions in the trimmed database By using these reduced databases,computation time becomes short (no more than frequency counting)

  18. Results

  19. Experiments Result CPU, memory, OS: AMD Athron XP 1600+, 224MB, Linux Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI (All these marked high scores atcompetition FIMI03) 13 datasets of FIMI repository •Fast at large supports for all instances of FI, CI, MFI •Fast for all instances for CI (except for Accidents) •Fast for all sparse datasets of FI, CI, MFI •Slowonly foraccidents, T40I10D100Kof FI, MFI, and pumsbstar of MFI

  20. Summary of Results large supports small supports

  21. results

  22. Conclusion •When equivalence classes are large, PPC-extension and Hypercube decomposition works well •Anytime database reduction and Occurrence deliver have advantages on initialization, sparse cases and simplicity compared to Trie and Down project •Backtracking-based pruning saves memory usage •More database reduction works well as much as memory storage approaches

  23. Future Work • LCM is weak at MFI mining and dense datasets •More efficient Pruning for MFI •Some new data structures for dense cases •Fast radix sort for anytime database reduction •IO optimization ?????

  24. List of Datasets Machine learning benchmark ・Chess ・Mushroom ・Pumsb ・Pumsb* ・Connect Aartificial datasets ・T10I4D100K ・T40I10D100K Real datasets ・BMS-WebVeiw-1 ・BMS-WebVeiw-2 ・BMS-POS ・Retail ・Kosarak ・Accidents

More Related