480 likes | 1.11k Views
Stream Data Management System Prototypes. Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004. Outline. Motivation of DSMS Aurora (Brown, Brandeis, MIT) Model Operator Scheduling Storage/Memory Management QoS issue STREAM (Stanford) System Architecture
E N D
Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004
Outline • Motivation of DSMS • Aurora (Brown, Brandeis, MIT) • Model • Operator Scheduling • Storage/Memory Management • QoS issue • STREAM (Stanford) • System Architecture • Query Language • Query Plans and Execution • Performance Issues • Approximation Techniques • STREAM Interface • Conclusion
Motivation • HADP DAHP • Continuous data and static queries • Monitoring using sensor • Military • Traffic • Environment • Financial analysis • Object tracking
Aurora – Model • General Purpose DSMS • Continuous stream data comes • Flow through a set of operators • Output to application or materialized
Aurora – Model • Components • Storage manager • Scheduler • Load Shedder • Router • QoS Monitor • GUI
Aurora – Model • 3 kinds of query supported • Continuous • View • Ad-Hoc Query
Aurora – Model • 8 primitive operators (Box) • Windowed • Slide • Tumble • Latch • Resample • Non-windowed • Filter • Map • GroupBy • Join
Aurora – Operator Optimization • Each operator associated with • Selectivity: s(b), sel(b) • Computation time: c(b), cost(b) • General Optimization Techniques • Pushing projection upstream • Combining boxes • Reordering boxes
Aurora – Operator Optimization • Case 1 : cost of ab • c(a) + s(a)c(b) • Case 2: cost of ba • c(b) + s(b)c(a) • Criteria for switching box position • c(a)+s(a)c(b) > c(b)+s(b)c(a) a b b a
Aurora – Operator Scheduling • Scheduling by OS • One thread per box, shift the job to OS • Easier to program • Aurora Scheduler • Single thread for the scheduler • The scheduler pick a box with highest priority and call the box to consume tuples from queue • Allow finer control of resource • Scalable !
Aurora – Operator Scheduling • Problem: which box to execute next? • Min-Cost (MC) • Reduce computation cost • Min-Latency (ML) • Return result as soon as possible • Min-Memory (MM) • Reduce memory usage of queue
Aurora – Operator Scheduling • Example b4 b2 streams application b5 b3 b1 b6 Downstream
Aurora – Operator Scheduling • Min-Cost • Objective: avoid overhead of calling boxes • Min-Latency • Prefer box which can produce tuples in the output at a shorter period of time • Min-Memory • Give preference to box which will consume more tuples with less computation time • Similar to “Chain Operator Scheduling” • More at:Operator Scheduling in a Data Stream Manager, VLDB 2003
Aurora – Storage/Memory Management • Manage the queue in front of each box • 2 boxes sharing the same queue • windowed operator • The initial queue size is 128 KB • Queues are managed as a circular queue • If overflow, double the queue size, or vice versa
Aurora – Storage/Memory Management • Swap in/out between memory / disk based on priority of boxes using it • Work with Operator Scheduler to exchange box priority and buffer-state information • Connection Point Management • A B-tree indexed on timestamp is built to support random access of tuples by ad-hoc query
Aurora – QoS Issue • Different queries/applications have different QoS requirement • Stock market monitoring • Average temperature of a set of sensor • QoS Graph
Latency-based QoS Graph Critical Point QoS cost(D(b)) est(b) 0 time eol(b) latency(b) b D(b)
Aurora – QoS-driven Scheduling • Assign priority to each box based on • priority (b) = [utility (b), est (b)] • utility (b) = gradient (eol (b)) • How is the QoS degrading by the time the tuple leave the system when we process it now. • est (b) • How soon it will exhibit another performance degradation if we don’t process it now. • Performance • 200 queries/application, each with 5 boxes • Round robin - 0.43 • QoS driven scheduling – 0.85
Aurora – Current Status • Main components of a DSMS are introduced • Operator scheduler • Memory/storage management • QoS concept in stress environment • Load shedding • Implemented in C++, with Java-based GUI • Dependent on a few software/library • More? • Distributed architecture – Aurora* • Fault tolerance or disaster recovery ?
STREAM – Introduction • General-purpose prototype DSMS • Supports data streams and stored relations • Declarative language for registering continuous queries • Flexible query plans and execution strategies • Aggressive sharing of state and computation among queries
STREAM – Introduction • Designed to cope with • Stream rates that may be high, variable, bursty • Continuous query loads that may be high, volatile • Primary coping techniques • Graceful approximation as necessary • Careful resource allocation and use • Continuous self-monitoring and reoptimization
Streamed Result Stored Result Register Query Input streams Archive STREAM – System Architecture DSMS Scratch Store Stored Relations
STREAM – Query Language • Continuous Query Language – CQL • Extends SQL with • Streams as new data type • Stream: Unbounded bag of pairs <tuple, timestamp> • Relation: time-varying bags of tuples • Continuous instead of one-time semantics • Three classes of operators • Relation-to-relation • Stream-to-relation • Relation-to-stream
STREAM – CQL Operators • Relation-to-relation • SQL constructs • Stream-to-relation • Tuple-based sliding window: [Rows N], [Rows Unbounded] • Time-based sliding window: [Range ω], [Now] • Partitioned sliding window: [Partition By A1,…Ak Rows N] • Relation-to-stream • Istream: insert stream • Dstream: delete stream • Rstream: relation stream
STREAM – Example Query 1 • Two example streams: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) • Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”: Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
STREAM – Example Query 2 • Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk
STREAM – Simplified Query 2 • Result is a relation, updated as stream elements arrive: Select F.clerk, Max(O.cost) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk
STREAM – Simplified Query 2 • Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk): Select Istream(F.clerk, Max(O.cost)) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk
STREAM – Example Query 3 • Relation: CurPrice(stock, price) • Average price over last day for each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock • Istream provides history of CurPrice • Window on history (back to relation), group and aggregate
STREAM – Query plans and Execution • When a continuous query is registered, generate a query plan • New plan merged with existing plans • Users can also create & manipulate plans directly • Plans composed of three main components: • Operators • Flag: insertion(+), deletion (-) • Elements: tuple-timestamp-flag tuples • Streams: only + elements • Relations: both + and - elements • Queues • Enforce nondecreasing timestamps (“heartbeats”) • Mechanisms for buffering tuples • States (Synopses) • Global scheduler for plan execution
State1 ⋈ State2 STREAM – States • States (Synopses) • Summarize elements seen so far (exact or approximate) for operators requiring history • To implement windows • Example: synopsis join • Sliding-window join • Approximation of full join
STREAM – Simple Query Plan Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10
STREAM – Performance Issues • Synopsis Sharing • Eliminate data redundancy • Exploiting Constraints • Selectively discard data to reduce state • Operator Scheduling • Reduce queue sizes
STREAM – Synopsis Sharing • Eliminate redundancy by • replacing the nearly identical synopses with light weight stubs • a single store to hold the actual tuples • Store tracks the progress of each stub, presents the appropriate view to each stub. • The store contains the union of its corresponding stubs
STREAM – Synopsis Sharing Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10 Select A, Max(B) From S1 [Rows 200] Group By A
STREAM – Exploiting Constraints • Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type • Referential integrity k-constraint • Ordered-arrival k-constraint • Clustered-arrival k-constraint • Query execution plans reduce or eliminate sate based on k-constraints • If constraint violated, get approximate result
STREAM – Operator Scheduling • Goal: minimize total queue size for unpredictable, bursty stream arrival patterns • Chain Scheduling Algorithm: • Mark the first operator in the plan as the “current” operator • Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time. • Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains. • Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order. • Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries • Empirical results: large savings over naive strategies for many queries • But minimizing queue sizes is at odds with minimizing latency
STREAM – Approximation • CPU-Limited Approximation • Insufficient CPU time to process each stream element due to the high data arrival rate. • load-shedding • sampling operators • Approximate by probabilistically dropping elements before they are processed • Memory-Limited Approximation • The total state required for all registered queries exceeds available memory. • The system selectively shrinks or discards synopses.
STREAM – Query Interface • View the structure of query plans the their component entities. • View the detailed properties of each entity. • Dynamically adjust entity properties. • View monitoring graphs that display time-varying entity properties plotted dynamically against time. • Queue sizes, throughput, overall memory usage, and join selectivity.
STREAM – Current Status • Version 1.0 up and running • Includes a new monitoring and adaptive query processing infrastructure – StreaMon • Executor runs query plans to produce results. • Profiler collects and maintains statistics about stream and plan characteristics. • Reoptimizer ensures that the plans and memory structures are the most efficient for current characteristics. • Web demo available at http://shark.stanford.edu:8080/ • Future Directions: • Distributed Stream Processing • Crash Recovery • Improved Approximation • Classification of Applications
Conclusion • Ideal DSMS • Well defined and flexible query language • User-friendly interface • Scalable • Operator scheduling • Storage management • Synopsis sharing • Approximation • Quality assurance • Fault tolerant
References • R. Motwani et al., “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003. • S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002 • D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002. • D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003 • Stanford STREAM Project Website: http://www-db.stanford.edu/stream/index.html • Aurora Project Website: http://www.cs.brown.edu/research/aurora