Verify and mining frequent patterns from large windows over data streams

Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008

Outline • Introduction and motivation • SWIM algorithm • DTV、DFV algorithm • Experiments • Conclusion

Introduction and motivation • Conditional counting • Verifiers: DTV ,DFV verify the frequency of previously frequent itemsets over newly arriving windows • Fast verifier for incremental frequent itemset mining: Sliding window incremental miner (SWIM)

SWIM algorithm • The difficulty: a new pattern is added to pattern tree for the first time, its true frequency in the whole window is not known, since this pattern wasn`t frequent in the previous n-1 slides • W: window • PT (Pattern tree): a superset of the frequent patterns over W • aux_array: stores the frequency of a pattern for each window, for which the frequency is unknown • p.fi: the frequency of p in the ith slide • p.freq: p`s cumulative frequency in the current window

SWIM algorithm (cont.) • Example: … S2 S3 S4 S5 S6 S7 … W4 W5 W6 W7 W4: aux_array=<p.f4,p.f4> p.freq=p.f4 W5:aux_array=<p.f2+p.f4,p.f4+p.f5> p.freq=p.f4+p.f5 W6:aux_array=<p.f2++p.f3+p.f4,p.f3+p.f4+p.f5> p.freq=p.f4+p.f5+p.f6 W7:p.freq=p.f5+p.f6+p.f7

Analysis of SWIM algorithm • Delay: the frequency of pattern turns out to be larger than the minimum support • Maximum delay:n-1 slides (n: number of slides) • Bottleneck: counting frequencies of itemsets over a given dataset( delay=L, n-L+1slides)

Conditional counting • Goal: verifies counts for a given set of patterns • 1.p`s true frequency in D if it has occurred at least min_freq times • 2.reports it has occurred less than min_freq (frequency not required in this case, it can skip any pattern whose frequency less than min_freq)

Conditional counting (cont.) • Verification • given a set of transaction T, a set of pattern P and a threshold s • goal: find the exact freq of each p P w.r.t T, iff its freq is≧ s • ifs=0 ,verification=counting, but if s>0 extra computation can be avoided • Proposed fast verifiers • DTV, DFV, hybrid ∈

a b:4 g:2 b:? e:? d:? b:? d:2 a:2 b:1 a:3 f:? g:? f:? b:2 d:? c:2 c:3 g:? b:3 e:? e:1 b:? d:2 d:4 b:5 a:5 c:5 d:? b:1 g:1 e:1 e:1 h:1 g:1 f:1 g:4 a root root root root:? root:? root:? root:4 b b c c d e f a a a a b b b b c c c c d d d d e e e e f f f f g g h h Double-Tree Verifier (DTV) FP-tree a b c d e f g g:2 h Original fp-tree Conditionalized fp-tree on g Conditionalized fp-tree |g on d Pattern-tree Initial pattern tree pattern tree | ”g” pattern tree | ”g” after verification against FP-tree Filling original pattern tree using reverse pointers

Double-Tree Verifier (DTV) • for very small min_freq values, it becomes impossible to run FP-growth due to the exponential number of paths • Advantage: it is useful when the minimum support decreases

Depth-First Verifier (DFV) • Ancestor Failure: if a path in the fp-tree has already proved to not contain a prefix of the pattern p, then it does not contain p itself either (apriori property) • Smaller Sibling Equivalence: if a path in the fp-tree has already been marked to (or not to) contain a smaller sibling of the pattern p, then it does (or does not) contain p itself too • Parent Success: if a path in the fp-tree has already been marked to contain the parent pattern of p, then it also contain p

Hybrid Version • many transactions in the fp-tree and many patterns in the pattern tree :DTV is faster than DFV • trees are small: DFV is faster than DFV • Hybrid: start with DTV until the conditionalized tree are “small enough” and after that point switch to DFV

Experiments

Experiments (cont.) transaction=100k

Conclusion • Speed up many other application: • incremental mining (SWIM) • enhancing static algorithms (counting phase) • privacy preserving techniques (long transaction) • monitoring /concept shift detection • Hybrid : no exactly point to switch DTV to DFV

Verify and mining frequent patterns from large windows over data streams

Verify and mining frequent patterns from large windows over data streams

Presentation Transcript

Mining of Frequent Patterns from Sensor Data

Technologies for Mining Frequent Patterns in Large Databases

Mining Frequent Patterns and Association Rules

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Frequent Pattern Mining in Data Streams

Mining Frequent Patterns

Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns

Mining Data Streams

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Data Mining: Concepts and Techniques Mining Frequent Patterns

Mining Data Streams

Constrained Frequent Itemset Mining from Uncertain Data Streams

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns

Mining Data Streams

Mining Data Streams

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Mining Data Streams