1 / 48

PIPES: A Resource Adaptive Data Stream Management System

PIPES: A Resource Adaptive Data Stream Management System. Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research Society (DFG) grant Se 553/4-2. Input. Output. Information Landscape. Outline. Motivation and problem definition Sliding Windows

brian
Download Presentation

PIPES: A Resource Adaptive Data Stream Management System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research Society (DFG) grant Se 553/4-2

  2. Input Output Information Landscape

  3. Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions

  4. Example Application • Traffic monitoring • Data format • Continuous dataflow  streams • Variable stream rates • Time + location dependence • Queries • Continuous, long-running HighwayStream( lane, speed, length, timestamp ) “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”

  5. Data Streams • Continuously Arriving Sequence of Records • time as an integral component • Autonomous Data Sources • sensors, mobile devices,software agents, … • Important Type of Data • miniaturization of hardware • ubiquitous networks … o o o o o

  6. Requirements • Declarative Query Language • Expressive like (Temporal) SQL • join of data streams according to time • combination of data streams with persistent databases • assigns meaning to data • query results as a data stream • Publish/Subscribe Paradigm • Subscribe: users register new queries • Publish: continous report of results • Quality of Service (QoS) • e. g. at least one record per second • scalability • number of data sources • number of subscribed queries

  7. Stream Query Processing • Similar to Traditional DBMS • Queries expressed in CQL • SQL-like query language • Logical Query Plan • algebra with „relational“ operators • Query Optimization • algebraic rules • simple, but accurate cost model • Physical Query Plan • select physical operators • Processing of the Query

  8. What is special about PIPES? • PIPES provides an Infrastructure for DSMS • DSMS= Data Stream Management System • PIPES = Public Infrastructure for Processing and Exploring Data Streams • Differences to DBMS • Semantics is borrowed from Temporal Databases  Expressiveness  Query Optimization • Data Driven Query Processing  Publish/Subscribe • Adaptive Runtime Environment • Dynamic assignment of resources at runtime  scalability, QoS • Continuous Optimization of Queries von Anfragen • plan migration  scalability, QoS

  9. Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions

  10. 2. Sliding Windows • Requirement of Users • no impact of outdated data on our result • integration of different streams according to time  Moving Temporal Windows • Finite subsequence of an infinite stream • Query processing is restricted to the most recent data • Important for an expressive and efficient query processing • Options • Count-based windows • FIFO queue of size w • Time-based windows • ttime stamp of an element • t + w + 1 end of the validity of an element

  11. a3b1a3b2 a1b3a2b3 a2b3a3b3 a3b2a3b3 Problem: Determinism • Data-driven Processing • Count-based Windows • w = 2 • Non-Determinism • Result of a query depends on scheduling Example: Symetric Join a3b1a3b2 a2b3 a3b3 a1b3a2b3 a3b2 a3b3 a1b3a2b3 a3b1a3b2 a3b2a3b3 a2b3a3b3 Reset a1 b2 a2 b1 a2 a3 b2 b3 a3 b3

  12. “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes]) WHERE avgSpeed < 15; Temporal Windows in CQL

  13. Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions

  14. 3. Query Processing in PIPES • Data Streams Model • Input Streams • Autonomous Source • Logical Streams • Semantics • Physical Streams • Implementation of the Semantics, but more expressive

  15. HighwayStream( short lane, float speed, float length, Timestamp timestamp ) Schema: Input Stream: (5; 18.28; 5.27; 5:00:08)(2; 21.33; 4.62; 5:01:32)(4; 19.69; 9.97; 5:02:16)… Input Streams • Sequence of Records • Arbitrary, but fixed schema • No limitation to the relational model • Records with timestamps • Temporal ordered

  16. Transformation: input stream  physical stream ((5; 18.28; 5.27; 5:00:08), [5:00:08, 5:00:09))((2; 21.33; 4.62; 5:01:32), [5:01:32, 5:01:33))((4; 19.69; 9.97; 5:02:16),[5:02:16, 5:02:17))… Physical Stream • PIPES: Time Intervals instead of Points • Validity of an element e • Processing of e restricted to its time interval • Removal of invalid records • Sequence of tuples (e, [tS, tE)) • Ordered by tS and tE

  17. Data Stream Operators • Window Operator • Relational Operator • „relational“ algebra on data streams • projection • selection • Cartesian product • union • difference • temporal extension of operators

  18. w+1 Sliding window: 15 minutes (e1, [5:00:08, 5:15:09))(e2, [5:01:32, 5:16:33))(e3,[5:02:16, 5:17:17))… tS tS+1+w Window Operator • Purpose • Extension of the validity of an element by w time units. • Overlap of windows of elements  Elements need to be processed together • Window: w = 15 minutes

  19. Relational Stream Operators Snapshot-Reducibility • Snapshot • Mapping of a physical stream to a non-temporal relation. • Relation comprises all valid elements at point t S1, …, Sn R1, …, Rn RelationalStreamOperator Relational Operator Sout Rout

  20. Query Optimization • Application of Well-known Rules from Temporal Databases • Slivinskas, Jensen, Snodgrass (ICDE 2000) • Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering  many rules directly applicable to streams • conventional + temporal rules Basis for Effective Query Optimization

  21. Map: projection on sectionID Filter: avgSpeed < 15 Union: merge of data streams Aggregation: averagespeed (avgSpeed) Map: projection on speed., assigning sectionID Window: w = 15 minutes Steps “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes]) WHERE avgSpeed < 15; 2) Logical Query Plan 1)Query 3)Query Optimization 4) Physical Query Plan

  22. Physical Operators • Stateless Operators • Processing of an element is independent from the previous ones. • Examples: filter, map • Stateful Operators • Processing of an element depends on previous elements • Restrict to elements in sliding window • Explicit management of status • Examples:join, aggregation

  23. (a,b) tS tE Data-driven Joins • Input • streams A and B and sliding window of size w • join predicate P • Output • records ((a,b), [tS,tE)) • P(a,b) • overlapping intervals of a und b a b

  24. StatusA StatusB Methodology • Adaptation of Sweepline Technique tA = Start time of last element of A tB = Start time of last element of B • Status for each input • Status of A: elements of A with end time ≥ tB • Status of B: elements of B with end time ≥ tA • Continuous Processing insertion probing & reorganisation A B

  25. Runtime Environment of PIPES Sinks Query graph PIPES Sources

  26. Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions

  27. 4. Plan Migration • Re-Optimization of Query Plans at Runtime • Identification of poorly performing subgraphs in the query graph • Plan Migration • Substitution of old plan by a new one Requirements • Preserving of snapshot reducibility • Continuous production of results • Short migration time

  28. C1 C2 Beispiel Sinks Sources T U R S

  29. Semantics Problems • Duplicates • Parallel insertion of new elements into both plans • Loss of Results • Exclusive insertion of new element in the new plan

  30. Split Approach in PIPES • Assumptions • Streams A and B • Window of length w • equivalent query plans Palt and Pneu • Earliest split time • tsplit = max {tA, tB} + w • Splitting of the input at split time tsplit

  31. Split Split Approach in PIPES • Production of Results • Acceptance of all results received from the old plan Pold • Selection of results received from the new plan Pnew • Acceptance only if start time > tsplit   Pold Pnew A B

  32. Properties • Method is broadly applicable • Arbitrary plans • Many data streams • Different window sizes • Migration Time • Worst-case: w time units

  33. Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions

  34. 5. Conclusions • Applications • Traffic management • Alarming systems • Observation of production lines • Basic ideas of stream processing in PIPES • Temporal Databases • Data-driven query processing • Adaptivity at runtime • Continuous Optimization at runtime • Dynamic Plan Migration • Broadly applicable approach

  35. Current Work • Problems • Cost models for optimization • New techniques • Strategies for adaptation • Memory • CPU • QoS • Runtime environment • Realtime applications • Real applications for DSMS • Observation of patients in hospitals • Processing of sensor data • Coupling of PIPES and commercial products

  36. Related Work • Abadi, Carney, Cetintemel et al. • Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120-139, 2003. • Arasu, Babu, and Widom • The CQL continuous query language: Semantic foundations and query execution. Technical Report 2003-67, Stanford University, 2003. • Tucker, Maier, Sheard, and Faragas • Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowledge and Data Eng., 15(3):555-568, 2003. • Law, Wang, and Zaniolo • Query languages and data models for database sequences and data streams. In VLDB,pages 492-503, 2004.

  37. Papers on PIPES/XXL • Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel: An Approach to Adaptive Memory Management in Data Stream Systems , will appear in Proc. ICDE 2006. • Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Sortierbasierte Joins über Datenströmen,BTW 2005, Karlsruhe - Germany, March, 2-4. • Björn Blohsfeld, Christoph Heinz, Bernhard Seeger:Maintaining Nonparametric Estimators over Data Streams,BTW 2005, Karlsruhe - Germany, March, 2-4. • Christoph Heinz, Bernhard Seeger: Wavelet Density Estimators over Data Streams (Extended Abstract),ACM Symposium on Applied Computing, Santa Fe - New Mexico, 2005. • Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Anfrageverarbeitung auf Datenströmen,Datenbank-Spektrum 11: 5-13, (2004). • Jürgen Krämer, Bernhard Seeger:PIPES–A Public Infrastructure for Processing and Exploring Data Streams. Proc. SIGMOD 2004 (Demo) • Jochen Van den Bercken, Björn Blohsfeld, Jens-Peter Dittrich, Jürgen Krämer, Tobias Schäfer, Martin Schneider, Bernhard Seeger: XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries,In Proc. of the Conf. on Very Large Databases (VLDB), 39-48, September 2001.

  38. Future Work • Query optimization • Adequate cost model • Not only stream rates • Runtime statistics: delays, memory usage, etc. • Static query optimization • Multi query optimization • Subquery sharing • Dynamic query optimization • Detection of suitable subgraphs • Plan migration at runtime • Temporal aspects • Coalesce

  39. Thank you ! Any questions ? For moreinformation check our website: http://dbs.mathematik.uni-marburg.de/Home/Research/Projects/PIPES

  40. Reorganization • Restriction of memory usage • All elements where tE£ min{tSj} • tSj: latest start timestamp of input stream j • Ordering invariant  no temporal overlap with future stream elements Which elements can be discarded in internal data structures ? Why ?

  41. Example: Sum 4 9 5 4 5 2 3 7 Aggregation • Incremental computation • Efficient implementation • Aggregation segment-tree • Amortized logarithmic costs per element Reorganization Insertion new element current state(aggregates) T

  42. Outline • Motivation and problem definition • Query formulation • Our temporal approach • Stream types • Logical query plans • Query optimization • Physical query plans • Query execution • Exploration of Data Streams • Conclusions

  43. Exploration of Data Streams • Example • Estimation of selectivity during runtime of continuous range queries: select * from Stream S where S.measure between min and max • Our Approach • Exploit the density p of the distribution • Represents all information about the distribution • Suitable for estimating the selectivity multiple queries

  44. Requirement • Problem • Density is unknown • Adaptation of a non-parametric density estimation technique • Kernels • Wavelets • Sampling and CDF • Requirements • Low resource consumption (memory and CPU) • Memory and CPU adaptive • Increasing memory size  higher accuracy • Valid estimation at each point in time • Adapt to a changing distribution

  45. Reservoir Sampling • CDF is built on top of the iid samples • Disadvantages • Estimation relies on a few elements • No advantage from an increasing memory • Advantage • Low processing overhead

  46. Blockwise Estimation • Stream is transformed into blocks • For simplicity: blocks are of the same size • Idea • Estimation of the first k blocks is available • Compute the estimation of k+1 blocks iteratively • Example (Average) • Generalization for density functions • Straightforward Extension • Problem: Violates the requirement of limited memory

  47. Cumulative-Compressed Estimation • Compression • Cubic splines • Weighting strategies • Amortized cost for updates • O(log M)

  48. Experimental Comparison • Streaming data from a real traffic data set • Arithmetic weights • Memory size: 5000

More Related