160 likes | 254 Views
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining. National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN. Takeaki Uno Masashi Kiyomi Hiroki Arimura. 20/Aug/2005 Open Source Data Mining ’05.
E N D
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN Takeaki Uno Masashi Kiyomi Hiroki Arimura 20/Aug/2005 Open Source Data Mining ’05
Computation of Pattern Mining For frequent itemsets and closed itemsets, enumeration methods are almost optimal coding technique = × TIME + #iterations time of an iteration I/O (#iterations) is not so larger than (#solutions) (linearly bounded!) • linear time in #solution • polynomial delay •frequency counting, •data structure reconstruction, •closure operation, •pruning, ... Goal: clarify feature of •enumeration algorithms •real-world data sets For what cases(parameter), which technique is good? “theoretical intuitions/evidences” are important We focus on data structure and computation in an iteration
Motivation Good: dense data, large support Bad: sparse data, small support • Some data structures have been proposed for storing huge datasets, and accelerate the computation (frequency counting) • Each has its own advantages and disadvantages 1. Bitmap 2. Prefix Tree 3.Array List (with deletion of duplications) c e Good: non-sparse data, structured data Bad: sparse data, non-structured data b a d c e g Good: non-dense data Bad: very dense data Datasets have both dense part and sparse part How can we fit?
Observations d e n s e sparse transactions items •Usually, databases satisfy power law the part of few items is dense, and the rest is very sparse •Using reduced conditional databases, in almost all iterations, the size of the database is very small Quick operations for small database are very efficient rec. depth ...
c items d e n s e sparse Idea of Combination •Use bitmap and array lists for dense and sparse parts •Use prefix tree of constant size for frequency counting • Choose a constant c • F =c items of largest frequency • Split each transaction T in two parts, dense part composed of items in F sparse part composed of items not in F • Store dense part by bitmap, and sparse part by array list transactions items We can take all their advantages
Complete Prefix Tree c d b d c d a d c d d b c d d We use complete prefix tree: prefix tree including all patterns
Complete Prefix Tree 0111 1111 0011 1001 0101 1101 0001 1001 0110 1110 0010 1010 0100 1100 0000 1000 We use complete prefix tree: prefix tree including all patterns Parent of a pattern is obtained by clearing the highest bit (Ex. 010110000110) no pointer is needed Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2c is not huge
Complete Prefix Tree 1111 0111 0011 1001 0101 1101 0001 1001 0110 1110 0010 1010 1100 0100 0000 1000 We use complete prefix tree: prefix tree including all patterns Any prefix tree is its subtree Parent of a pattern is obtained by clearing the highest bit (Ex. 010110000110) no pointer is needed Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2c is not huge Ex) transactions {a,b,c,d}, {a}, {a,d}
Frequency counting 0111 0111 1111 0011 0011 1001 0101 1101 0101 1101 0001 0001 1001 0110 1110 0010 1010 1100 0100 1100 0100 0000 1000 • Frequency of a pattern (vertex) =# descendant leaves • Occurrence by adding item i = patterns with ith bit = 1 Bottom up sweep is good 2 1 2 3 Linear time in the size of prefix tree
“Constant Size” Dominates •How much iterations input “constant size database” ? “Small iterations” dominate computation time, “Strategy change” is not a heavy task
More Advantages •Reconstruction of prefix trees is a heavy task complete prefix tree needs no reconstruction •Coding prefix trees is not easy complete prefix tree is easy to be coded • Radix sort used for detecting the identical transactions is heavy when data is dense Bitmaps for dense parts accelerate the radix sort
For Closed/Maximal Itemsets prefix prefix 0111 1111 •Compute the closure/maximality by storing the previously obtained itemsets No additional function is needed •Depth-first search (closure extension type) Need prefix of each itemsets prefix prefix 0011 1001 prefix 0101 1101 By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2) 0001 1001 0110 1110 prefix prefix 0010 1010 0100 1100 0000 1000
Experiments We applied the data structure to LCM2 CPU, memory, OS: Pentium4 3.2GHz, 2GB memory, Linux Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI, nonodrfp, aim2, DCI-closed (All these marked high scores atcompetition FIMI04) 14 datasets of FIMI repository Memory usage decreased to half, for dense datasets, but not for sparse datasets
Discussion and Future Work •Combination of bitmaps and array lists reduces memory space efficiently, for dense datasets •Using prefix trees for constant number of item is sufficient for speeding up frequency counting, for non-sparse datasets, •The data structure is orthogonal to other methods for closed/maximal itemset mining, maximality check, pruning, closure operations etc. Future work: other pattern mining problems •Bitmaps and prefix trees are not so efficient for semi-structured data (semi-structure gives huge variations, hardly represented by bits, and to be shared) •Simplify the techniques so that they can be applied easily •Stablememory allocation (no need to dynamic allocation)