200 likes | 375 Views
Continuous Data Stream Processing. MAKE Lab. Post-Excellence Project Subproject 6. Date: 2006/03/07. Peer search engine. Profile database. Cluster coordinator. Cluster monitor. Music channel simulator. XML Filtering engine. MusicXML database. Music Virtual Channel. Clustering
E N D
Continuous Data Stream Processing MAKE Lab Post-Excellence Project Subproject 6 Date: 2006/03/07
Peer search engine Profile database Cluster coordinator Cluster monitor Music channel simulator XML Filtering engine MusicXML database Music Virtual Channel Clustering engine Interface Channel monitor Profile monitor Favorite channel 1 Internet V.C. player … 2 V.C. player Filtering engine … N Music metadata Music collections
Research Directions Sequence Query Matching Temporal Query Processing Episode Query Matching Range Search Filtering Spatial Query Processing KNN Search Aggregate Query Processing Streaming Data Management Top-K Search Closed Tree Pattern Mining Frequent Tree Pattern Mining Mining Frequent Itemset Mining (sliding window) Frequent Itemset Mining (landmark model)
Sequence Query Matching • Given a set of sequence queries (SQs), how to continuously monitor the event stream for them and report the segments that are approximate answers of certain queries as soon as the segments arrive according to the error bounds of the queries? • Event Stream • <a,b,c,d><c,e><a,b,c><b,d><a,d><e,f><a,e><a,b,c><e,f><a,b,c><e><b,c,e><d,f>······················ • Sequence Query • <a,b,c><b,d><a,c,d><e,f><a,e>, ε=1
15 seconds 5 seconds Episode Query Matching • Knowledge Discovery from Telecommunication Network Alarm Databases [ICDE96] • If an alarm of type A occurs, then an alarm of type B occurs within 30 seconds with probability 0.8 • If alarms of types A and B occurs within 5 seconds, then a alarm of type C occurs within 60 seconds with probability 0.7 • If an alarm of type A precedes an alarm of type B, and C precedes D, all within 15 seconds, then E will follow within 4 minutes with probability 0.6 B A A A B C D
Top-K Query • Suppose there are two continuous queries and . Then, another continuous query is registered. • Which two web documents are the most popular across the first and second servers? • Which two web documents are the most popular across the third and fourth servers? • Which two web documents are the most popular across the second and third servers? Coordinator Queries Server 1 Server4 Server 2 Server 3
Main Difficulties • Heavy Communication Cost • The serve only updates its current data when necessary • Multiple Continuous Queries • Most papers focus on one-time top-k queries or single continuous top-k query • Information sharing is necessary
Spatial Query Processing • Continuous queries for moving objects in high-dimensional space • Range search • KNN search user profile Search engine V.C. player recommended channel user profile, channel Vote Mechanism V.C. player V.C. player V.C. player V.C. player selected channel
Problem Definition • Given a set of objects with their positions on a N-dimension (N>20) region. The set of objects is highly dynamic: each object can move in an unrestricted fashion, i.e., we do not assume any pattern of motion • Continuously monitoring the results of each query point • Range Query • KNN Query
Q1 Q2 Q1 Q1 Q2 Q2 Main Difficulties • Heavy Communication Cost • The object updates occur only when the results for some queries might change • Safe Region [SIGMOD05] • Incremental Update • Efficiently maintain the effective results • Multiple Continuous Queries • Decide the quarantine area for each query • Mixed Types of Queries • Support both the range query and the KNN query
Query Q: (x,y), r Range Query Cell C A: max < r B: min r max C: min > r max: dis(query,cell) min: dis(query,cell)
Range Query (Cont.) Moving Query MQ How to maintain the Result for a MQ?
Server Q1 Q2 Q3 flag = 0/1 Client No update and no recalculate Update and recalculate for some queries No update and no recalculate We only need to consider those objects marked with B Range Query (Cont.) When to update? Q1 Q2 Q3 A A A A A B A A C
Query Motion C2 C2 Range Query (Cont.) For a range query Q C3 C4 C5 A Result list O3 O5 O7 Covered cells C2 C7 C9 B For a cell C Q2 Q4 Q7 Affected queries A Q3 Q6 Q9 B
Object Update update the order update the order re-computation KNN Query Query Q: (x,y), 3
KNN Query (Cont.) Query Q: (x,y), 3 Query Q’: (x’,y’), r r = d’max d’max
dmax dquery KNN Query (Cont.) Query Q: (x,y), 3 Query Q’: (x’,y’), r r = dmax+dquery
dmax dcell KNN Query (Cont.) Query Q: (x,y), 3 Query Q’: (x’,y’), r r = dmax+dcell
T3 T2 Tree Pattern Mining • As the trees stream in, find out the subtrees that occur more than θ·N times, where N is the number of trees received so far and 0≦θ≦1 Frequent Tree Patterns T1 STMer
A A B B B C D C D C A B B B B A C D C D C frequent subtrees B A B C D 2 3 3 2 2 3 2 2 2 closed Closed Tree Pattern Mining • Mining closed frequent subtrees over data streams • a subtree is closed if none of its proper supertrees has the same support as its