150 likes | 254 Views
Panel on Stream Query Languages The Aurora View. Stan Zdonik Brown University. Aurora Queries. We do not have an SQL-like language. We have a GUI for dataflow diagrams . Boxes = operators Arrows = streams Rationale: CSE is tough for thousands of queries. Workflow is more natural.
E N D
Panel onStream Query LanguagesThe Aurora View Stan Zdonik Brown University
Aurora Queries • We do not have an SQL-like language. • We have a GUI for dataflow diagrams. • Boxes = operators • Arrows = streams • Rationale: • CSE is tough for thousands of queries. • Workflow is more natural. • Easier for users to extend what’s been done. • Best to understand implementation first.
Aurora Operators • Very relational in spirit. • Filter, Map, Union, Join, Aggregate • Adds Windows (everyone seems to agree). … with some wrinkles that we will get to. • Adds a few operators. • Wsort • Resample
Simple Aggregation A B C 1, 1, 1 1, 1, 1 1, 1, 3 1, 1, 2 Aggregate Agg(init,incr,final) Window(on C, size = 2 offset = 1) GroupBy A,B 1, 2, 2 1, 2, 1 A B C 1, 2, 1 1, 1, 1 . . . • init:called when window opens • incr: called for each new value • final: called when window closes • One or more open window per group. • Size and Offset given in: • #tuples, attribute interval, or time interval Generalized aggregate
Query 1 Generate the stream of packets whose length is greater than twice the average packet length over the last 1 hour. (pID, length, time) Join Match ( length > 2 * avgLen and time=time2) Map f(t): (t.ID, t.length, t.time) Aggregate agg(init,incr,final) Window(on time, size = 1 hr, offset=1 tuple) State = (sum int, num int, endtime int)) init = {sum :=0, num :=0} incr (p) ={sum := sum+p.length; num:=num+1; endtime := p.time} final= emit (time2=endtime, avgLen=sum/num)
Query 2 Create an alert when more than 20 type 'A' squirrels are in Jennifer's backyard. Assume squirrels report every p sec (sID1, region, time) Join Match (sID1=sID2) Filter region = JWY and type = “A” ST (sID2, type) Aggregate agg (count) Window(on time, size=p sec, offset=p sec) Filter count > 20
Query 3 Stream an event each time 3 different squirrels within a pairwise distance of 5 meters from each other chirp within 10 seconds of each other. (sID, loc, time) Join Match (1.sID not= 2.sID and dist(1.loc, 2.loc) < 5 m) Window (on time, size = 5 sec, offset = 1 tuple) Join Match (dist(1.1.loc, 2.loc) < 5 m and dist(1.2.loc, 2.loc) < 5 m and 1.1.sID not= 2.sID and 1.2.sID not= 2.sID) Window ( on time, size = 5 sec, offset = 1 tuple) 1 1 (sID, loc, time) 2 (sID, loc, time) 2
Super-bonus Query Create a log of flow information from a stream of packets. A flow (simple definition) from a source S to a destination D ends when no packet from S to D is seen for at least 2 minutes after the last packet from S to D. The next packet from S to D starts a new flow. The flow log contains the source, destination, count of packets, and total length of packets for each flow. Are you kidding!!!!
Actually, it’s Pretty Easy 2 min S D Aurora Aggregate Aggr = (init1, incr1, final1) Window (size = 2 tuples, offset = 1) GroupBy (src, dest) Aggregate Aggr = (init2, incr2, final2) Window (on flow#, size = 1, offset = 1) GroupBy (src, dest) (pID, src, dest, length, time) State1 = (flow#: int, first packet, second packet) ) State2 = (count int, len int) init1 = {flow# :=0;first:=null;second:=null} init2 = {count :=0; len := 0} Incr1(p) ={first:=second, second:=p; if second.time-first.time > 2 then flow# := flow# + 1} incr2 (p) ={count =: count + 1 len := len + p.length} final2 = emit (src,dest,len, count) final1= emit (second.src,second.dest, second.length, second.time, flow#)
… but this is not enough! • What if it was really important that I know about the squirrels within 1 minute of the intrusion? => Queries need Quality-of-Service support. In fact, QoS is an integral part of the declarative spec. of the query.
…but it gets worse! • Networks (e.g., mobile) can arbitrarily delay or lose tuples. => Operators can’t block arbitrarily waiting. A corollary of latency-based Qos.
…and worse! • Tuples may not arrive at an operator in sort order. • The network can reorder them • Operators themselves can shuffle them. • Priority scheduling might force them out of order. • This complicates things. • windows • aggregates
Our Solution • Problem has to do with when to close windows. • Tradeoff: Latency (QoS) vs. Accuracy • Define additional parameters on windows that determine termination. • might result in lost data.
time 1 1 1 1 1 1 1 timeout interval (time) slack time 1 1 1 2 1 1 timeout interval (#tuples) Our Solution (cont.) • For blocking (late tuples) => Timeout • For disorder (early tuples) => Slack
Status • Now: • users supply values for timeout and slack. • As in examples, not always needed. • Goal: • automatically insert / adjust these values based on QoS specs.