1 / 34

Semantics and Evaluation Techniques for Window Aggregates in Data Streams

Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. Introduction. Window aggregation is an important query capacity. Evaluating window aggregate queries over streams is non-trivial.

usoa
Download Presentation

Semantics and Evaluation Techniques for Window Aggregates in Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD 2005 Semantics and Evaluation Techniques for Window Aggregates in Data Streams

  2. Introduction • Window aggregation is an important query capacity. • Evaluating window aggregate queries over streams is non-trivial. • Overlapping subsets (window extents) • Confusion by window definition with physical stream properties • Out-of-order data arrival. • Hurt performance. Execution time and Memory Bandwidth

  3. Introduction • High arrival rates, huge volumes of data and real time requirements make execution time and memory requirements very critical • Bursty out of order arrival of data makes detection of window extents very difficult • Also leads to inaccurate results with higher latencies Need for window semantics

  4. Introduction • Problems faced currently • Lack of explicit semantics • Lack of implementation efficiency wrt execution time and memory requirements • Most implementations keep active input tuples in memory, thereby increasing memory bandwidth • Further each tuple is reprocessed multiple times as a part of multiple extents it belongs to • Also most implementations assume that the input stream is ordered

  5. Techniques • Window-ID (WID): • On the fly processing • Does not keep tuples in memory • No reprocessing of tuples • Processes out of order tuples on the fly without sorting them • Does not require ordering of the data stream • Uses punctuations to encode whatever kind of ordering information available • Punctuation: • Out-of-order data arrival

  6. Example 1 • Q1:SELECT seg-id, max(speed), min(speed) FROM Traffic [RANGE 300 seconds SLIDE 60 seconds WATTR ts] GROUP BY seg-id

  7. tuple Example 1

  8. Window Semantics • Previous works often describe window semantics operationally, leading to confusion with physical properties of the stream • Example: some window query operators process window extents sequentially, but data arrivals without in window extent’s order. In such cases some sorting mechanisms like that in Aurora's BSort scheme is used to order the data. Leads to high execution time and bandwidths

  9. Window Specification • Window specification: a window type and a set of parameters that defines a window to be used by a query. • ex: RANGE, SLIDE and WATTR in Q1. • Different window aggregate query has different window specification. • Sliding window aggregate query. • Time based sliding window query • Row based • Slide by tuple based query • Partitioned window based query • Using functions

  10. Window Specification • Similar to the CQL (Continuous Query Language). • Different: user specified WATTR and SLIDE parameters.

  11. Sliding Window Aggregate • Time-based: • Q1 • Row-based: • RANGE and SLIDE are different attributes:

  12. Sliding Window Aggregate • Partitioned Window Aggregate: • Using function: a variation of Q3 `

  13. Window Semantic Framework Defines window semantics using mappings between window-ids and tuples in both directions Three functions for mapping between window-ids and tuples in both directions • windows, extent and wids. • T : a set of tuples. • S : window specification • windows (T,S): set of window-ids that identify window extents to which tuples in T may belongs. • extent (w,T,S): the set of tuples in T belonging to the window extent identified by w,

  14. windows, extent • queries in which RANGE and SLIDE are specified on the WATTR attribute: • slide-by-tuple:

  15. slide-by-n_tuples: • slide-by-n_tuples over logical order: • partitioned tuple-based:

  16. Mapping Tuples to Window-ids • wids: Function for identifying window extent to which tuple t belongs. • queries in which RANGE and SLIDE are specified on the WATTR attribute: • slide-by-tuple (and variations):

  17. Partitioned tuple-base: r=rank(t,row-num,PATTR,T)

  18. Towards Window Query Evaluation • Backward-context • Given a tuple t, it’s backward-context is information about tuples that have arrived before t . • ex: partitioned tuple-based window. • Forward-context • Given a tuple t, it’s forward-context is information about tuples that have arrived after t. • ex: slide-by-tuple. • FCF( forward-context free) • FCA (forward-context award)

  19. Disorder • Merging unsynchronized streams, network delays. • ex: network flow sometimes use start time as timestamp. • Methods: slack , BSort, heartbeats.

  20. wids function punctuation FCF Window with WID Approach • Punctuation: A message embedded in a data stream indicating that a certain subset of data is complete. WID uses punctuations to signal the end of window extents.

  21. FCA Windows with WID Approach • FCB (forward-context bounded) • FCU (forward-context unbounded)

  22. Performance • Environment: • Data generator: XMark data generator, and network analysis tool. • 1. data in generated order. • 2. data in bounded-disorder • 3. data in block-sorted-disorder. • Comparison: buffering mechanism.

  23. Result • WID V.S. Buffering

  24. Conclusion Continuing with larger picture: • We show the issues with a broader base. • Approaches to solve the problem. • Few examples which illustrate the problems and solutions.

  25. Issues • Many systems have the bottleneck of managing continuous data streams like financial data auction system etc. • The current systems for evaluating sliding window aggregate queries, buffer each input tuple until it is no longer needed. • Each tuple is accessed multiple times once for each window that it participates in.

  26. Issues Contd … • There are few problems with it: • The buffer size required is unbounded. • Processing each tuple multiple times leads to high computation cost.

  27. Approaches • An approach that reduces both space and computation time for query execution. • It follows the concept of dis-joint panes and calculate the sub aggregates over each pane. • This gives us significant performance benefits

  28. Contd… • New technique reduces the required buffer size by sub-aggregating the input stream and reduces buffer size by sub-aggregating the input stream and reduces computing window aggregates.

  29. Sliding Window Tuples

  30. Semantics • To evaluate a sliding-window aggregate query using panes, the query is decomposed into two sub-queries: • A Pane level sub query PLQ, which is a tumbling window aggregate query, separating input stream into non overlapping panes. • A window level query WLQ which is a sliding window query over the result of PLQ which returns the window aggregate.

  31. Evaluation

  32. Details • There are two types of aggregates that affect the evaluation of sliding window aggregates: • Holistic: For a sub aggregate function L there is no constant bound on the size of storage needed to store the result of L. • Differential: Two types, • Full differential-If bounded storage • Pseudo-differential: if it cannot be stored in a constant bound, like heavy hitter queries.

  33. Contd… • Panes for Holistic Aggregates: • Despite not having constant bound on buffer size in many cases it will reduce the amount of buffer space needed. • The PLQ Using a hashtable improves the overall performance, by sharing each hastable entry between multiple windows there by reducing the computation cost.

  34. Example • PLQ maintains Hashtable with (item-id, count). • Non empty hashtable entries are output • WLQ buffers each hashtable entry to update the sketches. • Using Panes the PLQ compresses all the bids to a single hash entry to reduce the storage space.

More Related