LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN Takeaki Uno Masashi Kiyomi Hiroki Arimura 20/Aug/2005 Open Source Data Mining ’05

Computation of Pattern Mining For frequent itemsets and closed itemsets, enumeration methods are almost optimal coding technique = × TIME + #iterations time of an iteration I/O (#iterations) is not so larger than (#solutions) (linearly bounded!) • linear time in #solution • polynomial delay •frequency counting, •data structure reconstruction, •closure operation, •pruning, ... Goal: clarify feature of •enumeration algorithms •real-world data sets For what cases(parameter), which technique is good? “theoretical intuitions/evidences” are important We focus on data structure and computation in an iteration

Motivation Good: dense data, large support Bad: sparse data, small support • Some data structures have been proposed for storing huge datasets, and accelerate the computation (frequency counting) • Each has its own advantages and disadvantages 1. Bitmap 2. Prefix Tree 3.Array List (with deletion of duplications) c e Good: non-sparse data, structured data Bad: sparse data, non-structured data b a d c e g Good: non-dense data Bad: very dense data Datasets have both dense part and sparse part How can we fit?

Observations d e n s e sparse transactions items •Usually, databases satisfy power law  the part of few items is dense, and the rest is very sparse •Using reduced conditional databases, in almost all iterations, the size of the database is very small  Quick operations for small database are very efficient rec. depth ...

c items d e n s e sparse Idea of Combination •Use bitmap and array lists for dense and sparse parts •Use prefix tree of constant size for frequency counting • Choose a constant c • F =c items of largest frequency • Split each transaction T in two parts, dense part composed of items in F sparse part composed of items not in F • Store dense part by bitmap, and sparse part by array list transactions items We can take all their advantages

Complete Prefix Tree c d b d c d a d c d d b c d d We use complete prefix tree: prefix tree including all patterns

Complete Prefix Tree 0111 1111 0011 1001 0101 1101 0001 1001 0110 1110 0010 1010 0100 1100 0000 1000 We use complete prefix tree: prefix tree including all patterns Parent of a pattern is obtained by clearing the highest bit (Ex. 010110000110)  no pointer is needed Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2c is not huge

Complete Prefix Tree 1111 0111 0011 1001 0101 1101 0001 1001 0110 1110 0010 1010 1100 0100 0000 1000 We use complete prefix tree: prefix tree including all patterns  Any prefix tree is its subtree Parent of a pattern is obtained by clearing the highest bit (Ex. 010110000110)  no pointer is needed Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2c is not huge Ex) transactions {a,b,c,d}, {a}, {a,d}

Frequency counting 0111 0111 1111 0011 0011 1001 0101 1101 0101 1101 0001 0001 1001 0110 1110 0010 1010 1100 0100 1100 0100 0000 1000 • Frequency of a pattern (vertex) =# descendant leaves • Occurrence by adding item i = patterns with ith bit = 1  Bottom up sweep is good 2 1 2 3 Linear time in the size of prefix tree

“Constant Size” Dominates •How much iterations input “constant size database” ? “Small iterations” dominate computation time, “Strategy change” is not a heavy task

More Advantages •Reconstruction of prefix trees is a heavy task  complete prefix tree needs no reconstruction •Coding prefix trees is not easy  complete prefix tree is easy to be coded • Radix sort used for detecting the identical transactions is heavy when data is dense  Bitmaps for dense parts accelerate the radix sort

For Closed/Maximal Itemsets prefix prefix 0111 1111 •Compute the closure/maximality by storing the previously obtained itemsets  No additional function is needed •Depth-first search (closure extension type)  Need prefix of each itemsets prefix prefix 0011 1001 prefix 0101 1101 By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2) 0001 1001 0110 1110 prefix prefix 0010 1010 0100 1100 0000 1000

Experiments We applied the data structure to LCM2 CPU, memory, OS:　Pentium4 3.2GHz, 2GB memory, Linux Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI, nonodrfp, aim2, DCI-closed (All these marked high scores atcompetition FIMI04) 14 datasets of FIMI repository Memory usage decreased to half, for dense datasets, but not for sparse datasets

Experimental Results

Discussion and Future Work •Combination of bitmaps and array lists reduces memory space efficiently, for dense datasets •Using prefix trees for constant number of item is sufficient for speeding up frequency counting, for non-sparse datasets, •The data structure is orthogonal to other methods for closed/maximal itemset mining, maximality check, pruning, closure operations etc. Future work: other pattern mining problems •Bitmaps and prefix trees are not so efficient for semi-structured data (semi-structure gives huge variations, hardly represented by bits, and to be shared) •Simplify the techniques so that they can be applied easily •Stablememory allocation (no need to dynamic allocation)

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining

Presentation Transcript

Event correlation and data mining for event logs

Opinion Mining A Short Tutorial

Chapter 10 – Arrays and ArrayList s

Searching via Traversals Searching a Binary Search Tree (BST) Binary Search on a Sorted Array Data Structure Conversi

Chapter 2 Data Mining

Data Mining

Data Mining Tools

Spatial and Temporal Data Mining

Comparison Networks

JavaScript

Web Mining : A Bird ’ s Eye View

Algorithms in Bioinformatics: A Practical Introduction

241-423 Advanced Data Structures and Algorithms

Large Area Surveys with Array Receivers

Mining Complex Types of Data

UNIT-1 Introduction

Data Mining: Classification and Prediction

The Mighty Prefix

DATA MINING LECTURE 4

CSE 634 Data Mining Concepts and Techniques Association Rule Mining