170 likes | 322 Views
Verify and mining frequent patterns from large windows over data streams. Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008. Outline. Introduction and motivation SWIM algorithm DTV 、 DFV algorithm Experiments Conclusion. Introduction and motivation. Conditional counting
E N D
Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008
Outline • Introduction and motivation • SWIM algorithm • DTV、DFV algorithm • Experiments • Conclusion
Introduction and motivation • Conditional counting • Verifiers: DTV ,DFV verify the frequency of previously frequent itemsets over newly arriving windows • Fast verifier for incremental frequent itemset mining: Sliding window incremental miner (SWIM)
SWIM algorithm • The difficulty: a new pattern is added to pattern tree for the first time, its true frequency in the whole window is not known, since this pattern wasn`t frequent in the previous n-1 slides • W: window • PT (Pattern tree): a superset of the frequent patterns over W • aux_array: stores the frequency of a pattern for each window, for which the frequency is unknown • p.fi: the frequency of p in the ith slide • p.freq: p`s cumulative frequency in the current window
SWIM algorithm (cont.) • Example: … S2 S3 S4 S5 S6 S7 … W4 W5 W6 W7 W4: aux_array=<p.f4,p.f4> p.freq=p.f4 W5:aux_array=<p.f2+p.f4,p.f4+p.f5> p.freq=p.f4+p.f5 W6:aux_array=<p.f2++p.f3+p.f4,p.f3+p.f4+p.f5> p.freq=p.f4+p.f5+p.f6 W7:p.freq=p.f5+p.f6+p.f7
Analysis of SWIM algorithm • Delay: the frequency of pattern turns out to be larger than the minimum support • Maximum delay:n-1 slides (n: number of slides) • Bottleneck: counting frequencies of itemsets over a given dataset( delay=L, n-L+1slides)
Conditional counting • Goal: verifies counts for a given set of patterns • 1.p`s true frequency in D if it has occurred at least min_freq times • 2.reports it has occurred less than min_freq (frequency not required in this case, it can skip any pattern whose frequency less than min_freq)
Conditional counting (cont.) • Verification • given a set of transaction T, a set of pattern P and a threshold s • goal: find the exact freq of each p P w.r.t T, iff its freq is≧ s • ifs=0 ,verification=counting, but if s>0 extra computation can be avoided • Proposed fast verifiers • DTV, DFV, hybrid ∈
a b:4 g:2 b:? e:? d:? b:? d:2 a:2 b:1 a:3 f:? g:? f:? b:2 d:? c:2 c:3 g:? b:3 e:? e:1 b:? d:2 d:4 b:5 a:5 c:5 d:? b:1 g:1 e:1 e:1 h:1 g:1 f:1 g:4 a root root root root:? root:? root:? root:4 b b c c d e f a a a a b b b b c c c c d d d d e e e e f f f f g g h h Double-Tree Verifier (DTV) FP-tree a b c d e f g g:2 h Original fp-tree Conditionalized fp-tree on g Conditionalized fp-tree |g on d Pattern-tree Initial pattern tree pattern tree | ”g” pattern tree | ”g” after verification against FP-tree Filling original pattern tree using reverse pointers
Double-Tree Verifier (DTV) • for very small min_freq values, it becomes impossible to run FP-growth due to the exponential number of paths • Advantage: it is useful when the minimum support decreases
Depth-First Verifier (DFV) • Ancestor Failure: if a path in the fp-tree has already proved to not contain a prefix of the pattern p, then it does not contain p itself either (apriori property) • Smaller Sibling Equivalence: if a path in the fp-tree has already been marked to (or not to) contain a smaller sibling of the pattern p, then it does (or does not) contain p itself too • Parent Success: if a path in the fp-tree has already been marked to contain the parent pattern of p, then it also contain p
Hybrid Version • many transactions in the fp-tree and many patterns in the pattern tree :DTV is faster than DFV • trees are small: DFV is faster than DFV • Hybrid: start with DTV until the conditionalized tree are “small enough” and after that point switch to DFV
Experiments (cont.) transaction=100k
Conclusion • Speed up many other application: • incremental mining (SWIM) • enhancing static algorithms (counting phase) • privacy preserving techniques (long transaction) • monitoring /concept shift detection • Hybrid : no exactly point to switch DTV to DFV