Window Aggregates in NiagaraST

Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST Group @ PSU Data Streams: Lecture 10

Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Disordered Data and Out-of-order Processing Data Streams: Lecture 10

window window t4 (a5, 47, 01:10:10) t6 (a6, 46, 01:11:02) t4 (a5, 47, 01:10:10) t5 (a6, 48, 01:10:40) Window Aggregate – Buffering SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid ( aid, max)( a5, 47 )( a6, 48 ) (aid, amt ts (hh:mm:ss)) windowMax t1 (a5, 40, 01:06:30) t2 (a6, 42, 01:07:45) t3 (a5, 45, 01:08:15) window windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 ( aid, amt, ts ) t6 ( a6, 46, 01:11:02) t5 ( a6, 48, 01:10:30) t4 ( a5, 47, 01:10:10) Data Streams: Lecture 10

wid max aid 70 70 70 70 70 a6 a5 a5 a5 a6 max: 45 max: 42 max: 40 max: 48 max: 47 … … … … … … 74 74 74 74 74 a5 a5 a6 a6 a5 max: 42 max: 45 max: 48 max: 40 max: 47 71 a6 max: 48 70 a6 max: 48 a6 max: 48 74 a6 max: 48 74 … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … a6 75 max: 46 … … … … … … … … … … … … Window Aggregate Evaluation in NiagaraST -WID (aid, window-id, max) ( a6, 70, 48 ) Max groups on: aid, wid (aid, amt, ts,window-id) t5 ( a6, 48, 01:10:30, 70-74 ) p1 ( a6, *, *, 70 ) t4 ( a5, 47, 01:10:10, 70-74 ) t6 ( a6, 46, 01:11:02, 71-75 ) t1 ( a5, 40, 01:06:30, 70-74 ) t3 ( a5, 45, 01:08:15, 70-74 ) t2 ( a6, 42, 01:07:45, 70-74 ) Bucket Range 5 minutes Slide 1 minute Wattr ts windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 (aid, amt, ts ) p1( a6, *, 01:11:00) t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t1 ( a5, 40, 01:06:30) t6 ( a6, 46, 01:11:02) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) Data Streams: Lecture 10

What’s the Difference • Window Semantics • Assumptions of data arrival order vs. Window-id • Data arrival (query answer and result production) vs. Punctuation • Query evaluation performance • Space • Time • Latency Data Streams: Lecture 10

Bucket • Bucket maps each tuple to windows ( aid, amt, ts ) windows: A 01:03:00 – 01:08:00 B 01:04:00 – 01:09:00 C 01:05:00 – 01:10:00 D 01:06:00 – 01:11:00 E 01:07:00 – 01:12:00 F 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10

Window Semantics Framework in NiagaraST • T: the set of all tuples in the input stream • S: a window specification • W: a set of window-ids • In this lecture, we assume time starts at 0 • windows: (T, S)  W • Defines the set of window ids to be used, e.g., 0, 1, 2, … • extent:(T, S, w) U  T, where w  W • Specifies which tuples belong to a given window • wids:(T, S, t) V  W, where t  T • Determines the set of window-ids to which a tuple belongs • Is the dual of extent Data Streams: Lecture 10

Extent • Window 74’s content ( aid, amt, ts ) windows: 71 01:03:00 – 01:08:00 72 01:04:00 – 01:09:00 73 01:05:00 – 01:10:00 7401:06:00 – 01:11:00 75 01:07:00 – 01:12:00 76 01:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10

Wids • t3’s window membership ( aid, amt, ts ) windows: 71 01:03:00 – 01:08:00 7201:04:00 – 01:09:00 7301:05:00 – 01:10:00 7401:06:00 – 01:11:00 7501:07:00 – 01:12:00 7601:08:00 – 01:13:00 t3 ( a5, 45, 01:08:15) t2 ( a6, 42, 01:07:45) t6 ( a6, 46, 01:11:02) t1 ( a5, 40, 01:06:30) t5 ( a6, 48, 01:10:30) t4 ( s5, 47, 01:10:10) [Range 5 minutes Slide 1 minute Wattr ts] Data Streams: Lecture 10

windows (T, S [RANGE, SLIDE, WATTR]) = {0, 1, 2, …} windows (T, S [5, 1, ts]) = {0, 1, 2, …} extent (w, T, S[RANGE, SLIDE, WATTR]) = { t T | ((w+1) * SLIDE)-RANGE) ≤ t.WATTR < (w+1) *SLIDE} extent (w, T, S[5,1, ts]) = { t T | ((w+1) * 1) − 5) ≤ t.ts < (w+1) * 1} wids (t, T, S [5, 1, ts]) = {w W| t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W| t.WATTR / SLIDE– 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Defining Window Semantics - sliding window Q1:SELECT aid, max(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid For t4(s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w W | 69.17 < w ≤ 74.17 } = {w W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ 70.17 minute Data Streams: Lecture 10

Partitioned Window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid Q1': SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts Pattr aid] Q1 == Q1' Data Streams: Lecture 10

Defining Window Semantics - partitioned window Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i  {0, 1, 2, …}, p  T.PATTR} extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤rank(t, row-num, PATTR, T) < (i+1) *SLIDE} Data Streams: Lecture 10

Rank function rank ( aid, amt, ts) t1 ( a5, 40, 01:06:30) 0 1 t1 ( a5, 40, 01:06:30) t3 ( a5, 45, 01:08:15) a5 2 t2 ( a6, 42, 01:07:45) t4 ( s5, 47, 01:10:10) t3 ( a5, 45, 01:08:15) t4 ( s5, 47, 01:10:10) t2 ( a6, 42, 01:07:45) 0 t5 ( a6, 48, 01:10:30) a6 t5 ( a6, 48, 01:10:30) 1 t6 ( a6, 46, 01:11:02) t6 ( a6, 46, 01:11:02) 2 rank(t, row-num, PATTR, T): tuple t’s arrival position in partition PATTR Data Streams: Lecture 10

Defining Window Semantics - partitioned window (cont.) wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p)W | t.PATTR = p, r / SLIDE – 1 i  (r + RANGE) / SLIDE –1}wherer = rank (t, row-num, PATTR, T) Q2: SELECT aid, MAX(amt ) FROM Bids [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR aid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids Data Streams: Lecture 10

Defining Window Semantics - slide-by-tuple window Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid windows (T, S [RANGE, RATTR, 1, row-num]) = {w | t T, w = t.RATTR} extent (t, T, S[RANGE, RATTR,1, row-num]) = { t T | (w+1) -RANGE ) ≤t.RATTR< (w+1)} wids (t, T, S[RANGE, RATTR,1, row-num]) = { wW | t.RATTR≤w< t.RATTR + RANGE} Data Streams: Lecture 10

Discussion: Landmark Windows • “Incrementally compute the max bid price of each auction; update the results every 1 minute.” • Assume time starts from 0 • windows (T, S [-inf, 1, ts]) = {0, 1, 2, …} Data Streams: Lecture 10

Implementation of wids functions -sliding window • No state, no delay to determine the window-ids for each tuple • Context-free Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP BY aid wids (t, T, S [RANGE, SLIDE, WATTR]) = {w W| t.WATTR / SLIDE– 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } wids (t, T, S [5, 1, ts]) = {w W| t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } Data Streams: Lecture 10

Implementation of wids functions -partitioned window • Need to maintain the number of arrived tuples for each partition • Backward Context Q2: SELECT aid, MAX(amt ) FROM Bids [Range 1000 rows Slide 100 rows Wattr row-num Pattr aid] wids (t, T, S[1000, 100, row-num, aid]) = {(i, p)W | t.aid = p, r / 100 – 1 i  (r + 1000) / 100 –1}wherer = rank (t, row-num, PATTR, T) Data Streams: Lecture 10

Implementation of wids functions -slide-by-tuple window • Forward context Q3: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Rattr ts Slide 1 row Sattr row-num] GROUP BY aid wids (t, T, S[5, ts,1, row-num]) = { wW | t.ts≤w< t.ts + 5} windows (T, S [5, ts, 1, row-num]) = {w | t T, w = t.ts} Data Streams: Lecture 10

Backward Context and Forward Context • Backward context • state to be kept • Forward context • the mapping from tuples to window-ids has to be delayed Data Streams: Lecture 10

WID vs. Buffering – execution time comparison (overview) Data Streams: Lecture 10

WID vs. Buffering – execution time comparison (zoom-in) Data Streams: Lecture 10

Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Disordered Data and Out-of-order Processing Data Streams: Lecture 10

W1 W2 Windows W3 W4 W5 … … P1 P2 P3 P4 P5 P6 P7 P8 Panes Sharing Panes Li, J. Maier, D., Tufte, K., Papadimos, V., Tucker, P. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams. http://www.cs.pdx.edu/datalab/niagara/nopanenogain.pdf Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts]GROUP BY aid • Pane size = GCD (SLIDE, RANGE) Data Streams: Lecture 10

Query Transformation Q4: SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Q5: SELECT aid, sum(cnt) FROM Q6 [Range 4 Slide 1 Wattr pid] GROUP BY aid Q6: SELECT aid, count(*) as cnt, window-id as pid FROM Bids [Range 1 minute Slide 1 minute Wattr ts] GROUP BY aid Data Streams: Lecture 10

Pane Implementation (aid, sum, window-id) ( a5, 8, 70 ) s0 sum(*) (group on window-id, aid) (aid, pane-id, count, window-id) ( a5, 70, 8, 70-74 ) m0 bucket B2 as window-id RANGE 4 SLIDE 1 WATTR pane-id Q5 (aid, pane-id, count) ( a5, 70, 8 ) m0 count (*) (group on pane-id, aid) (aid, amt, pane-id) ( a5, 47, 70-70 ) t1 ( a5, *, 70 ) p1 ( a6, 48, 70-70 ) t2 Q6 bucket B1 as pane-id RANGE 1 min SLIDE 1 min WATTR ts (aid, amt, ts ) ( a5, 47, 01:10:10 ) t1 ( a5, *, 01:11:00 ) p1 ( a6, 48, 01:10:30 ) t2 streamscan Data Streams: Lecture 10

Aggregate Functions • Distributive, Algebraic, Holistic • Distributive • f(X1), f(X2)  f(X1 U X2) • Algebraic • g(X)  f(X), g is a “synopsis function” • g(X) can be stored in constant memory • g(X1), g(X2)  g(X1 U X2) • Holistic: otherwise • Differential • g(Y), g(X)  f(Y – X) • g(Y-X), g(X)  f(Y) • g(X) can be stored in constant memory • distributive or algebraic Data Streams: Lecture 10

Pane Implementation - Another Window Sum Window operator Range 4 rows Slide 1 row Wattr row-num Tumbling Count Range 1 minute Slide 1 minute Wattr ts SELECT aid, count(*) FROM Bids [Range 4 minutes Slide 1 minute Wattr ts] GROUP BY aid Data Streams: Lecture 10

When are panes better than windows? SELECT aid, max(amt) FROM Bids [Range X rows Slide Y rows Wattr row-num] GROUP BY aid Panes are better when cost ratio is less than 1 The number of tuples per pane affects whether using panes is better Data Streams: Lecture 10

Outline • Review of Window Aggregate Evaluation • WID Window Semantics • Panes • Handling Disorder Data Streams: Lecture 10

Sources of Disorder • Sources of disorder • Merging different data sources • Various network transmission delay • Data prioritization • Query processing algorithms, e.g., shared window joins [Hammad, et al.] • Multiple possible windowing attributes, e.g., two timestamps Data Streams: Lecture 10

Example of Disorder: Band Disorder • Network flow: source IP + port  destination IP + port • Timestamp of 8th packet of network flow vs. flow start time • Passive Measurement and Analysis project, San Diego Supercomputer Center, http://pma.nlanr.net/PMA Data Streams: Lecture 10

Example of Disorder: Block-sorted Disorder Network flow start time (s) • Scatter plot of netflow records emitted by a router in the Abilene Network (http://abilene.internet2.edu/observatory) • X-axis is position of flow record in stream, y-axis is network flow start time Data Streams: Lecture 10

Handling Disorder • Sort (In-order processing) • Slack – BSort in Aurora • Heartbeat in STREAM • Sort-based Merge in Gigascope • Output buffering and sorting in a shared-window join • Space and time cost Data Streams: Lecture 10

Out-of-Order Processing • Out-of-order processing in Join • M. A. Hammad, W. G. Aref, and A. K. Elmagarmid Optimizing In-Order Execution of Continuous Queries over Streamed Sensor Data. SSDBM 2005 • NiagaraST: Punctuation + Window-Id • Analogue to CPU’s out-of-order processing of instructions Data Streams: Lecture 10

wid max aid 70 70 s5 s5 max: 47 max: 52 … … … … … … 74 74 s5 s5 max: 47 max: 52 71 s6 max: 48 70 70 s6 s6 max: 48 max: 48 s6 max: 48 74 74 s6 s6 max: 48 max: 48 74 … … … … … … … … … … … … s6 75 max: 46 … … … … … … … … … … … … … … … … … … Disorder Handling - WID Q1: SELECT aid, MAX(amt) FROM Bids [Range 5 minutes Slide 1 minute Wattr ts] GROUP-BY aid (aid, window-id, max) ( a6, 70, 48 ) Max Group on: wid, aid (aid, amt, ts,window-id) p1( a6, *, *, 70 ) t7 ( a5, 52, 01:10:15, 70-74) t6 ( a6, 46, 01:11:02, 71-75) bucket (aid, amt, ts ) p1( a6, *, 01:11:00) t6 ( a6, 46, 01:11:02) t7 ( a5, 52, 01:10:15) Data Streams: Lecture 10

Sources of Punctuation • External Punctuation • Data sources, e.g., Gigascope • Internal Punctuation • Mechanisms that can be used to generate punctuations • Slack • Heartbeat Data Streams: Lecture 10

Latency vs. AccuracyBand Disorder • Compare external punctuation and two flavors of slack • As slack increases, error decreases and latency increases • External punctuation has better latency and accuracy than slack Data Streams: Lecture 10

Latency vs. AccuracyBlock-Sorted Disorder SELECT count (*) From Bids [Range 10 minutes Slide 1 minute Wattr ts] Latency vs. Accuracy Block-Sorted-Disorder (percentage of incorrect answers) Data Streams: Lecture 10

Window Aggregates in NiagaraST

Window Aggregates in NiagaraST

Presentation Transcript

Aggregates in Civil Engineering

AGGREGATES

Aggregates

Recycled Aggregates

Aggregates

Hybrid Aggregates

Money Aggregates

Aggregates’ Integration

AGGREGATES There are two types of aggregates Coarse Aggregates Fine Aggregates

Aggregates

AGGREGATES

Semantics and Evaluation Techniques for Window Aggregates in Data Streams

aggregates

Aggregates

Semantics and Evaluation Techniques for Window Aggregates in Data Stream

Key Aggregates

Key Aggregates

CONCRETE AGGREGATES

Granite Aggregates

Aggregates in Essex

AGGREGATES

Resin Aggregates