Continuous Data Stream Processing

Continuous Data Stream Processing Music Virtual Channel – extensions Data Stream Monitoring – tree pattern mining Continuous Query Processing – sequence queries Post-Excellence Project Subproject 6 Date: 2005/10/21

Peer search engine Profile database Music channel simulator XML Filtering engine MusicXML database Music Virtual Channel  Extensions Clustering engine Cluster coordinator Interface Channel monitor Cluster monitor Profile monitor Favorite channel 1 Internet V.C. player … 2 V.C. player Filtering engine … N Music metadata Music collections

An Extension on Virtual Channel • After a player starts a range (or kNN) search, • It updates its profile periodically • The search results are continuously maintained V.C. player (query) V.C. player (peer)

An Extension on Virtual Channel • Compared with the clustering engine • A flexible definition of “clusters” • Update is more natural than insertion/deletion • No need of parameter setting and re-clustering • Indexing can relieve the pain of frequent update • Compared with the problem of moving objects • Movements in a high-dimensional feature space • In most cases every object is also a query • Prediction of object movement is possible

An Extension on Favorite Channel • When a music piece is played on a channel, • The corresponding musicXML file can be obtained • A query can be a portion of musicXML or XQuery

An Extension on Favorite Channel • Compared with query segments • More musical semantic in a query • Do not interfere the music playback • Matching on complex tree-structures • Common subquery is still useful

Research Issues • Peer Search Engine • An indexing method to support continuous query processing for high-dimensional moving objects • A prediction-based bounding mechanism to reduce the frequency of profile update • XML Filtering Engine • An online method to enable tree pattern mining over a data stream • An indexing mechanism to support XML filtering

Discovering Frequent Tree Patterns over Data Streams Submitted for publication

T3 T2 Problem Definition • As the query trees stream in, find out the subtrees which occur more then θ·N times, where N is the number of trees received so far and 0≦θ≦1 Frequent Tree Patterns T1 STMer

B differs from C D Problem Definition (Cont.) • Labeled ordered tree • Induced subtree Tree pattern Query Tree B A D C B E C D

A A A A A B A C B C B C B B B A A B B D E F C An Example • Given θ = 0.6 Frequent Tree Patterns (occurrence > 0.6*3) : Frequent Tree Patterns (occurrence > 0.6*2) : Frequent Tree Patterns (occurrence > 0.6*1) : B STMer

Main Difficulties • The properties of data streams: • One pass  Traditional tree mining methods fail • Fast input rate  Efficiency issue is critical • Incremental  An incremental algorithm is required • Unbounded  Approximate counting is needed

T1 Requests on demand A candidate pool An Overview of Our Method • Subtree generation • Subtree maintenance STMer

String Representation • DFS order on T  (label, level) node sequence S

Buffer A1B2 Buffer A1 TD A TD A A B B A A1 B B1 A1B2 B,2 A,1 t2 t1 Subtree Generation Data stream

Buffer A1B2C2 TD A A A A C C B C B C1 B C A1C2 A1B2C2 A1B2 C,2 t3 Subtree Generation (Cont.) B A B1 A1 B,2 A,1 Data stream t2 t1

C2 D3 E4 C1 D1 E1 D2 E2 C2 C E3 D3 C2 D E4 D3 E E4 Subtree Generation (Cont.) APT Φ Buffer A1B2 F2 B1 A1 TD A B2 B

Φ APT Φ (A1, 5, 0) B1 A1 E1 (B2, 4, 1) B2 E2 (C3, 2, 1) (E2, 1, 3) E2 Subtree Maintenance +1 #query trees received = 321 Buffer A1B2E2 GPT +1 +1

Experiments on Sensitivity Minimum support Error parameter

Experiments on Comparison • StreamT (ICDM’02)

A A C 5 2 Conclusion • Contribution • A novel technique is proposed for efficient subtree generation • A compact structure is employed to reduce the the memory requirement of the candidate pool • Current work • Mining closed frequent subtrees over data streams A A B C B 5 2

Continuous Data Stream Processing