1 / 38

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

This paper presents algorithms for efficiently mining frequent patterns from data streams, specifically focusing on mining windows over streams. The proposed algorithms address challenges related to computation, storage, real-time response, customization, and integration with the data stream management system (DSMS).

pbernard
Download Presentation

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Verifying and MiningFrequent PatternsfromLarge Windows over Data Streams Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo ICDE 2008 Cancun, Mexico

  2. Finding Frequent Patterns for Association Rule Mining • Given a set of transactions T and a support threshold s, find all patterns with support >= s • Apriori [Agrawal’ 94], FP-growth [Han’ 00] • Fast & light algorithms for data streams • More than 30 proposals [Jiang’ 06] • For mining windows over streams • In particular DSMSs divide windows into panes, a.k.a. slides • As in our Stream Mill Miner system

  3. Moment (Maintaining Closed Frequent Itemsets over a Stream Sliding Window) • Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz • Collaboration of UCLA + IBM

  4. Closed Enumeration Tree (CET) • Very similar to FP-tree, except that keeps a dynamic set of items: • Closed freq itemsets • Boundary itemsets

  5. Moment Algorithm (I) • Hope: In the absence of cocncept drifts, not many changes in status • Maintains two types of boundary nodes; • Freq / non-freq • Closed / non-closed Taking specific actions to maintain a shifting boundary whenever a concept shift occurs

  6. Moment Algorithm (II) • Infreq gateway nodes • Infreq + its parent freq + result of a candidate join • Unpromising gateway nodes • Freq + prefix of a closed w/ same support • Intermiddiate nodes • Freq + has a child w/ same supp + not unpromising • Closed nodes • Closed freq

  7. Moment Algorithm (III) Increments: • Add/Delete to/from CET upon arrival/expiration of each transaction. Downside: • Batch operations not applicable, suffers from big slide sizes Advantage: • Efficient for small slides

  8. CanTree [Leung’ 05] • Use a fixed canonical order according to decreasing single freq. • Use a single-round version of FP-growth Algorithm: Upon each window move: • Add/Remove new/expired trans to/from FP-tree (using the same item order) • Run FP-growth! (Without any pruning)

  9. CanTree (cont.) • Pros: • Very efficient for large slides • Cons: • Inefficient for small slides • Not scallable for large windows • Needs memory for entire window

  10. Frequent Patterns Mining overData Streams Expired New … ………. S4 S5 S6 S7 W4 W5 • Challenges • Computation • Storage • Real-time response • Customization • Integration with the DSMS

  11. Frequent Patterns Mining over Data Streams • Difficult problem:[Chi’ 04, Leung’ 05, Cheung’ 03, Koh’ 04, …] • Mining each window from scratch - too expensive • Subsequent windows have many freq patterns in common • Updating frequent patterns every new tuple, also too expensive • SWIM’s middle-road approach: incrementally maintain frequent patterns over sliding windows • Desiderata: scalability with slide size and window size

  12. Count/Update frequencies Count/Update frequencies Add F7 to PT SWIM (Sliding Window Incremental Miner) • If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all slides (PT) Expired New … ………. S4 S5 S6 S7 W4 W5 Mine Mining Alg. PT Prune PT PT = F5 U F6 U F7 PT = F4 U F5 U F6

  13. SWIM • For each new slide Si • Find all frequent patterns in Si (using FP-growth) • Verify frequency of these new patterns in each window slide • Immediately or • With delay (< N slides) • Trade-off: max delay vs. computation. • No false negatives or false positives!

  14. SWIM – Design Choices • Data Structure for Si’s: FP-tree [Han’ 00] • Data Structure for PT: FP-tree • Mining Algorithm: FP-growth • Count/Update frequencies: Naïve? Hash-tree? • Counting is the bottleneck  • New and improved counting method named Conditional Counting

  15. Conditional Counting • Verification • Given a set of transactions T, a set of patterns P, and a threshold s • Goal: Find the exact freq of each pP w.r.t. to T, IF AND ONLY IF its freq is  s • If s=0, verification = counting, but if s>0 extra computation can be avoided • Proposed fast verifiers • DTV, DFV, hybrid

  16. Conditionalization on FP-trees FP-tree FP-tree | g FP-tree | gd

  17. Attempt I: DTV (Double-Tree Verifier) • Not only conditionalize the fp-tree, but also the pattern tree

  18. root root b:? b:? g:1 b:? g:2 b:3 d:2 g:1 h:1 d:2 e:? g:? d:? e:1 d:? g:2 e:1 e:1 d:4 e:? a:5 a:3 d:? b:1 b:5 b:3 b:1 g:? g:4 c:3 c:5 f:? f:1 f:? root:? root:? root:? root:4 FP-tree FP-tree | g Header Table (a:2,b:2,c:2,d:2) (a:1,b:1,c:1) (b:1,e:1) Conditional pattern base of “g” Header Table Header Table Header Table Header Table Header Table pattern tree | “g”, after verification against FP-tree Filling original pattern tree using reverse pointers Initial pattern tree pattern tree | “g”

  19. DTV (cont.) • Scales up well on large trees • Much pruning from conditionalization • However, for smaller trees • Less pruning • Overhead of conditionalization not always worth it

  20. Attempt II: DFV(Depth-First Verifier) • Each node n inPT corresponds to a unique pattern pn, therefore: • For each node n in PT • Traverse the FP-tree and count the occurrence of pn in a depth-first order • Keep the nodes marked as FAIL/OK while visiting their children • Utilize these marks for optimized execution More efficient when both trees are small

  21. DFV (cont.)

  22. DFV (cont.)

  23. Comparing Verifiers

  24. Hybrid Verifier • Start with performing DTV recursively • Until the resulting trees are small enough, then perform DFV

  25. Comparing Verifiers

  26. Verifiers vs. Hash Trees (Counting)

  27. SWIM with Hybrid Verifier (I)

  28. SWIM with Hybrid Verifier (II)

  29. Applications of Verifiers (I) • Improving counting in static mining methods • Candidate-generation (and pruning) phase • Example: Toivonen Approach [Toivonen’ 96] • Maintain a boundary of smallest non-frequent and largest frequent patterns • Check the frequency of boundary patterns

  30. Applications of Verifiers (II) In case resources are limited • Mine once • Keep monitoring the current patterns (by verifying them) • Since verifying is computationally cheaper • Whenever a significant concept shift is detected, mine again!

  31. Monitoring/Concept Shift Detection • Verification is much faster than mining (when it suffices)

  32. Privacy Preserving Applications • Random noise methods: • Add many fake items into the transactions to increase the variance [Evfimievski’ 03] • Overhead: • Long transactions (in the order of the no of items) Lemma: Max depth of the recursion in DTV is <= the max len of the patterns to be verified. • Run-time independent of the transaction length

  33. Optimization when integrated into a DSMS • Stream Mill Miner (SMM) provides integrated support for online mining algorithms by • User Define Aggregates (UDAs) • Definition of Mining Models • Constraints used for optimization • Max allowed delay • Interesting/Uninteresting items • Interesting/Uninteresting patterns • These are turned from post-conditions into pre-conditions

  34. Conclusions • SWIM for incremental mining over large windows • More efficient than existing approaches on data streams • Trade-off between real-time response, efficiency, memory, etc. • Efficient algorithms for verification/conditional counting • DTV, DFV, and Hybrid • These can be used to speed-up many applications: • Incremental mining, enhancing static algorithms, privacy preserving techniques, … Implementations of SWIM and the verifiers available athttp://wis.cs.ucla.edu/swim/index.htm

  35. References [Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994. [Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003. [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004. [Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003. [Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. [Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004. [Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005. [Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145.

  36. Thank you! Questions?

  37. DFV (cont.)

  38. DFV (cont.)

More Related