480 likes | 656 Views
PIPES: A Resource Adaptive Data Stream Management System. Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research Society (DFG) grant Se 553/4-2. Input. Output. Information Landscape. Outline. Motivation and problem definition Sliding Windows
E N D
PIPES: A Resource Adaptive Data Stream Management System Bernhard Seeger Philipps-University Marburg, Germany Research supported by the German Research Society (DFG) grant Se 553/4-2
Input Output Information Landscape
Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions
Example Application • Traffic monitoring • Data format • Continuous dataflow streams • Variable stream rates • Time + location dependence • Queries • Continuous, long-running HighwayStream( lane, speed, length, timestamp ) “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”
Data Streams • Continuously Arriving Sequence of Records • time as an integral component • Autonomous Data Sources • sensors, mobile devices,software agents, … • Important Type of Data • miniaturization of hardware • ubiquitous networks … o o o o o
Requirements • Declarative Query Language • Expressive like (Temporal) SQL • join of data streams according to time • combination of data streams with persistent databases • assigns meaning to data • query results as a data stream • Publish/Subscribe Paradigm • Subscribe: users register new queries • Publish: continous report of results • Quality of Service (QoS) • e. g. at least one record per second • scalability • number of data sources • number of subscribed queries
Stream Query Processing • Similar to Traditional DBMS • Queries expressed in CQL • SQL-like query language • Logical Query Plan • algebra with „relational“ operators • Query Optimization • algebraic rules • simple, but accurate cost model • Physical Query Plan • select physical operators • Processing of the Query
What is special about PIPES? • PIPES provides an Infrastructure for DSMS • DSMS= Data Stream Management System • PIPES = Public Infrastructure for Processing and Exploring Data Streams • Differences to DBMS • Semantics is borrowed from Temporal Databases Expressiveness Query Optimization • Data Driven Query Processing Publish/Subscribe • Adaptive Runtime Environment • Dynamic assignment of resources at runtime scalability, QoS • Continuous Optimization of Queries von Anfragen • plan migration scalability, QoS
Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions
2. Sliding Windows • Requirement of Users • no impact of outdated data on our result • integration of different streams according to time Moving Temporal Windows • Finite subsequence of an infinite stream • Query processing is restricted to the most recent data • Important for an expressive and efficient query processing • Options • Count-based windows • FIFO queue of size w • Time-based windows • ttime stamp of an element • t + w + 1 end of the validity of an element
a3b1a3b2 a1b3a2b3 a2b3a3b3 a3b2a3b3 Problem: Determinism • Data-driven Processing • Count-based Windows • w = 2 • Non-Determinism • Result of a query depends on scheduling Example: Symetric Join a3b1a3b2 a2b3 a3b3 a1b3a2b3 a3b2 a3b3 a1b3a2b3 a3b1a3b2 a3b2a3b3 a2b3a3b3 Reset a1 b2 a2 b1 a2 a3 b2 b3 a3 b3
“At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes]) WHERE avgSpeed < 15; Temporal Windows in CQL
Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions
3. Query Processing in PIPES • Data Streams Model • Input Streams • Autonomous Source • Logical Streams • Semantics • Physical Streams • Implementation of the Semantics, but more expressive
HighwayStream( short lane, float speed, float length, Timestamp timestamp ) Schema: Input Stream: (5; 18.28; 5.27; 5:00:08)(2; 21.33; 4.62; 5:01:32)(4; 19.69; 9.97; 5:02:16)… Input Streams • Sequence of Records • Arbitrary, but fixed schema • No limitation to the relational model • Records with timestamps • Temporal ordered
Transformation: input stream physical stream ((5; 18.28; 5.27; 5:00:08), [5:00:08, 5:00:09))((2; 21.33; 4.62; 5:01:32), [5:01:32, 5:01:33))((4; 19.69; 9.97; 5:02:16),[5:02:16, 5:02:17))… Physical Stream • PIPES: Time Intervals instead of Points • Validity of an element e • Processing of e restricted to its time interval • Removal of invalid records • Sequence of tuples (e, [tS, tE)) • Ordered by tS and tE
Data Stream Operators • Window Operator • Relational Operator • „relational“ algebra on data streams • projection • selection • Cartesian product • union • difference • temporal extension of operators
w+1 Sliding window: 15 minutes (e1, [5:00:08, 5:15:09))(e2, [5:01:32, 5:16:33))(e3,[5:02:16, 5:17:17))… tS tS+1+w Window Operator • Purpose • Extension of the validity of an element by w time units. • Overlap of windows of elements Elements need to be processed together • Window: w = 15 minutes
Relational Stream Operators Snapshot-Reducibility • Snapshot • Mapping of a physical stream to a non-temporal relation. • Relation comprises all valid elements at point t S1, …, Sn R1, …, Rn RelationalStreamOperator Relational Operator Sout Rout
Query Optimization • Application of Well-known Rules from Temporal Databases • Slivinskas, Jensen, Snodgrass (ICDE 2000) • Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering many rules directly applicable to streams • conventional + temporal rules Basis for Effective Query Optimization
Map: projection on sectionID Filter: avgSpeed < 15 Union: merge of data streams Aggregation: averagespeed (avgSpeed) Map: projection on speed., assigning sectionID Window: w = 15 minutes Steps “At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?” SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes]) WHERE avgSpeed < 15; 2) Logical Query Plan 1)Query 3)Query Optimization 4) Physical Query Plan
Physical Operators • Stateless Operators • Processing of an element is independent from the previous ones. • Examples: filter, map • Stateful Operators • Processing of an element depends on previous elements • Restrict to elements in sliding window • Explicit management of status • Examples:join, aggregation
(a,b) tS tE Data-driven Joins • Input • streams A and B and sliding window of size w • join predicate P • Output • records ((a,b), [tS,tE)) • P(a,b) • overlapping intervals of a und b a b
StatusA StatusB Methodology • Adaptation of Sweepline Technique tA = Start time of last element of A tB = Start time of last element of B • Status for each input • Status of A: elements of A with end time ≥ tB • Status of B: elements of B with end time ≥ tA • Continuous Processing insertion probing & reorganisation A B
Runtime Environment of PIPES Sinks Query graph PIPES Sources
Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions
4. Plan Migration • Re-Optimization of Query Plans at Runtime • Identification of poorly performing subgraphs in the query graph • Plan Migration • Substitution of old plan by a new one Requirements • Preserving of snapshot reducibility • Continuous production of results • Short migration time
C1 C2 Beispiel Sinks Sources T U R S
Semantics Problems • Duplicates • Parallel insertion of new elements into both plans • Loss of Results • Exclusive insertion of new element in the new plan
Split Approach in PIPES • Assumptions • Streams A and B • Window of length w • equivalent query plans Palt and Pneu • Earliest split time • tsplit = max {tA, tB} + w • Splitting of the input at split time tsplit
Split Split Approach in PIPES • Production of Results • Acceptance of all results received from the old plan Pold • Selection of results received from the new plan Pnew • Acceptance only if start time > tsplit Pold Pnew A B
Properties • Method is broadly applicable • Arbitrary plans • Many data streams • Different window sizes • Migration Time • Worst-case: w time units
Outline • Motivation and problem definition • Sliding Windows • Query Processing in PIPES • Data Stream Model • Logical Operators • Algebraic Query Optimization • Physical Operators • Runtime Environment • Dynamic Plan Migration • Conclusions
5. Conclusions • Applications • Traffic management • Alarming systems • Observation of production lines • Basic ideas of stream processing in PIPES • Temporal Databases • Data-driven query processing • Adaptivity at runtime • Continuous Optimization at runtime • Dynamic Plan Migration • Broadly applicable approach
Current Work • Problems • Cost models for optimization • New techniques • Strategies for adaptation • Memory • CPU • QoS • Runtime environment • Realtime applications • Real applications for DSMS • Observation of patients in hospitals • Processing of sensor data • Coupling of PIPES and commercial products
Related Work • Abadi, Carney, Cetintemel et al. • Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120-139, 2003. • Arasu, Babu, and Widom • The CQL continuous query language: Semantic foundations and query execution. Technical Report 2003-67, Stanford University, 2003. • Tucker, Maier, Sheard, and Faragas • Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowledge and Data Eng., 15(3):555-568, 2003. • Law, Wang, and Zaniolo • Query languages and data models for database sequences and data streams. In VLDB,pages 492-503, 2004.
Papers on PIPES/XXL • Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel: An Approach to Adaptive Memory Management in Data Stream Systems , will appear in Proc. ICDE 2006. • Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Sortierbasierte Joins über Datenströmen,BTW 2005, Karlsruhe - Germany, March, 2-4. • Björn Blohsfeld, Christoph Heinz, Bernhard Seeger:Maintaining Nonparametric Estimators over Data Streams,BTW 2005, Karlsruhe - Germany, March, 2-4. • Christoph Heinz, Bernhard Seeger: Wavelet Density Estimators over Data Streams (Extended Abstract),ACM Symposium on Applied Computing, Santa Fe - New Mexico, 2005. • Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Anfrageverarbeitung auf Datenströmen,Datenbank-Spektrum 11: 5-13, (2004). • Jürgen Krämer, Bernhard Seeger:PIPES–A Public Infrastructure for Processing and Exploring Data Streams. Proc. SIGMOD 2004 (Demo) • Jochen Van den Bercken, Björn Blohsfeld, Jens-Peter Dittrich, Jürgen Krämer, Tobias Schäfer, Martin Schneider, Bernhard Seeger: XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries,In Proc. of the Conf. on Very Large Databases (VLDB), 39-48, September 2001.
Future Work • Query optimization • Adequate cost model • Not only stream rates • Runtime statistics: delays, memory usage, etc. • Static query optimization • Multi query optimization • Subquery sharing • Dynamic query optimization • Detection of suitable subgraphs • Plan migration at runtime • Temporal aspects • Coalesce
Thank you ! Any questions ? For moreinformation check our website: http://dbs.mathematik.uni-marburg.de/Home/Research/Projects/PIPES
Reorganization • Restriction of memory usage • All elements where tE£ min{tSj} • tSj: latest start timestamp of input stream j • Ordering invariant no temporal overlap with future stream elements Which elements can be discarded in internal data structures ? Why ?
Example: Sum 4 9 5 4 5 2 3 7 Aggregation • Incremental computation • Efficient implementation • Aggregation segment-tree • Amortized logarithmic costs per element Reorganization Insertion new element current state(aggregates) T
Outline • Motivation and problem definition • Query formulation • Our temporal approach • Stream types • Logical query plans • Query optimization • Physical query plans • Query execution • Exploration of Data Streams • Conclusions
Exploration of Data Streams • Example • Estimation of selectivity during runtime of continuous range queries: select * from Stream S where S.measure between min and max • Our Approach • Exploit the density p of the distribution • Represents all information about the distribution • Suitable for estimating the selectivity multiple queries
Requirement • Problem • Density is unknown • Adaptation of a non-parametric density estimation technique • Kernels • Wavelets • Sampling and CDF • Requirements • Low resource consumption (memory and CPU) • Memory and CPU adaptive • Increasing memory size higher accuracy • Valid estimation at each point in time • Adapt to a changing distribution
Reservoir Sampling • CDF is built on top of the iid samples • Disadvantages • Estimation relies on a few elements • No advantage from an increasing memory • Advantage • Low processing overhead
Blockwise Estimation • Stream is transformed into blocks • For simplicity: blocks are of the same size • Idea • Estimation of the first k blocks is available • Compute the estimation of k+1 blocks iteratively • Example (Average) • Generalization for density functions • Straightforward Extension • Problem: Violates the requirement of limited memory
Cumulative-Compressed Estimation • Compression • Cubic splines • Weighting strategies • Amortized cost for updates • O(log M)
Experimental Comparison • Streaming data from a real traffic data set • Arithmetic weights • Memory size: 5000