Stream Data Management System Prototypes

Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Outline • Motivation of DSMS • Aurora (Brown, Brandeis, MIT) • Model • Operator Scheduling • Storage/Memory Management • QoS issue • STREAM (Stanford) • System Architecture • Query Language • Query Plans and Execution • Performance Issues • Approximation Techniques • STREAM Interface • Conclusion

Motivation • HADP  DAHP • Continuous data and static queries • Monitoring using sensor • Military • Traffic • Environment • Financial analysis • Object tracking

Aurora

Aurora – Model • General Purpose DSMS • Continuous stream data comes • Flow through a set of operators • Output to application or materialized

Aurora – Model • Components • Storage manager • Scheduler • Load Shedder • Router • QoS Monitor • GUI

Aurora – Model • 3 kinds of query supported • Continuous • View • Ad-Hoc Query

Aurora – Model • 8 primitive operators (Box) • Windowed • Slide • Tumble • Latch • Resample • Non-windowed • Filter • Map • GroupBy • Join

Aurora – Operator Optimization • Each operator associated with • Selectivity: s(b), sel(b) • Computation time: c(b), cost(b) • General Optimization Techniques • Pushing projection upstream • Combining boxes • Reordering boxes

Aurora – Operator Optimization • Case 1 : cost of ab • c(a) + s(a)c(b) • Case 2: cost of ba • c(b) + s(b)c(a) • Criteria for switching box position • c(a)+s(a)c(b) > c(b)+s(b)c(a) a b b a

Aurora – Operator Scheduling • Scheduling by OS • One thread per box, shift the job to OS • Easier to program • Aurora Scheduler • Single thread for the scheduler • The scheduler pick a box with highest priority and call the box to consume tuples from queue • Allow finer control of resource • Scalable !

Aurora – Operator Scheduling

Aurora – Operator Scheduling • Problem: which box to execute next? • Min-Cost (MC) • Reduce computation cost • Min-Latency (ML) • Return result as soon as possible • Min-Memory (MM) • Reduce memory usage of queue

Aurora – Operator Scheduling • Example b4 b2 streams application b5 b3 b1 b6 Downstream

Aurora – Operator Scheduling • Min-Cost • Objective: avoid overhead of calling boxes • Min-Latency • Prefer box which can produce tuples in the output at a shorter period of time • Min-Memory • Give preference to box which will consume more tuples with less computation time • Similar to “Chain Operator Scheduling” • More at:Operator Scheduling in a Data Stream Manager, VLDB 2003

Aurora – Storage/Memory Management • Manage the queue in front of each box • 2 boxes sharing the same queue • windowed operator • The initial queue size is 128 KB • Queues are managed as a circular queue • If overflow, double the queue size, or vice versa

Aurora – Storage/Memory Management • Swap in/out between memory / disk based on priority of boxes using it • Work with Operator Scheduler to exchange box priority and buffer-state information • Connection Point Management • A B-tree indexed on timestamp is built to support random access of tuples by ad-hoc query

Aurora – Storage/Memory Management

Aurora – QoS Issue • Different queries/applications have different QoS requirement • Stock market monitoring • Average temperature of a set of sensor • QoS Graph

Latency-based QoS Graph Critical Point QoS cost(D(b)) est(b) 0 time eol(b) latency(b) b D(b)

Aurora – QoS-driven Scheduling • Assign priority to each box based on • priority (b) = [utility (b), est (b)] • utility (b) = gradient (eol (b)) • How is the QoS degrading by the time the tuple leave the system when we process it now. • est (b) • How soon it will exhibit another performance degradation if we don’t process it now. • Performance • 200 queries/application, each with 5 boxes • Round robin - 0.43 • QoS driven scheduling – 0.85

Aurora – Current Status • Main components of a DSMS are introduced • Operator scheduler • Memory/storage management • QoS concept in stress environment • Load shedding • Implemented in C++, with Java-based GUI • Dependent on a few software/library • More? • Distributed architecture – Aurora* • Fault tolerance or disaster recovery ?

STREAM

STREAM – Introduction • General-purpose prototype DSMS • Supports data streams and stored relations • Declarative language for registering continuous queries • Flexible query plans and execution strategies • Aggressive sharing of state and computation among queries

STREAM – Introduction • Designed to cope with • Stream rates that may be high, variable, bursty • Continuous query loads that may be high, volatile • Primary coping techniques • Graceful approximation as necessary • Careful resource allocation and use • Continuous self-monitoring and reoptimization

Streamed Result Stored Result Register Query Input streams Archive STREAM – System Architecture DSMS Scratch Store Stored Relations

STREAM – Query Language • Continuous Query Language – CQL • Extends SQL with • Streams as new data type • Stream: Unbounded bag of pairs <tuple, timestamp> • Relation: time-varying bags of tuples • Continuous instead of one-time semantics • Three classes of operators • Relation-to-relation • Stream-to-relation • Relation-to-stream

STREAM – CQL Operators • Relation-to-relation • SQL constructs • Stream-to-relation • Tuple-based sliding window: [Rows N], [Rows Unbounded] • Time-based sliding window: [Range ω], [Now] • Partitioned sliding window: [Partition By A1,…Ak Rows N] • Relation-to-stream • Istream: insert stream • Dstream: delete stream • Rstream: relation stream

STREAM – Example Query 1 • Two example streams: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) • Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”: Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

STREAM – Example Query 2 • Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

STREAM – Simplified Query 2 • Result is a relation, updated as stream elements arrive: Select F.clerk, Max(O.cost) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

STREAM – Simplified Query 2 • Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk): Select Istream(F.clerk, Max(O.cost)) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

STREAM – Example Query 3 • Relation: CurPrice(stock, price) • Average price over last day for each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock • Istream provides history of CurPrice • Window on history (back to relation), group and aggregate

STREAM – Query plans and Execution • When a continuous query is registered, generate a query plan • New plan merged with existing plans • Users can also create & manipulate plans directly • Plans composed of three main components: • Operators • Flag: insertion(+), deletion (-) • Elements: tuple-timestamp-flag tuples • Streams: only + elements • Relations: both + and - elements • Queues • Enforce nondecreasing timestamps (“heartbeats”) • Mechanisms for buffering tuples • States (Synopses) • Global scheduler for plan execution

State1 ⋈ State2 STREAM – States • States (Synopses) • Summarize elements seen so far (exact or approximate) for operators requiring history • To implement windows • Example: synopsis join • Sliding-window join • Approximation of full join

STREAM – Simple Query Plan Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10

STREAM – Performance Issues • Synopsis Sharing • Eliminate data redundancy • Exploiting Constraints • Selectively discard data to reduce state • Operator Scheduling • Reduce queue sizes

STREAM – Synopsis Sharing • Eliminate redundancy by • replacing the nearly identical synopses with light weight stubs • a single store to hold the actual tuples • Store tracks the progress of each stub, presents the appropriate view to each stub. • The store contains the union of its corresponding stubs

STREAM – Synopsis Sharing Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10 Select A, Max(B) From S1 [Rows 200] Group By A

STREAM – Exploiting Constraints • Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type • Referential integrity k-constraint • Ordered-arrival k-constraint • Clustered-arrival k-constraint • Query execution plans reduce or eliminate sate based on k-constraints • If constraint violated, get approximate result

STREAM – Operator Scheduling • Goal: minimize total queue size for unpredictable, bursty stream arrival patterns • Chain Scheduling Algorithm: • Mark the first operator in the plan as the “current” operator • Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time. • Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains. • Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order. • Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries • Empirical results: large savings over naive strategies for many queries • But minimizing queue sizes is at odds with minimizing latency

STREAM – Approximation • CPU-Limited Approximation • Insufficient CPU time to process each stream element due to the high data arrival rate. • load-shedding • sampling operators • Approximate by probabilistically dropping elements before they are processed • Memory-Limited Approximation • The total state required for all registered queries exceeds available memory. • The system selectively shrinks or discards synopses.

STREAM – Query Interface • View the structure of query plans the their component entities. • View the detailed properties of each entity. • Dynamically adjust entity properties. • View monitoring graphs that display time-varying entity properties plotted dynamically against time. • Queue sizes, throughput, overall memory usage, and join selectivity.

STREAM – Query Plan Monitoring

STREAM – Current Status • Version 1.0 up and running • Includes a new monitoring and adaptive query processing infrastructure – StreaMon • Executor runs query plans to produce results. • Profiler collects and maintains statistics about stream and plan characteristics. • Reoptimizer ensures that the plans and memory structures are the most efficient for current characteristics. • Web demo available at http://shark.stanford.edu:8080/ • Future Directions: • Distributed Stream Processing • Crash Recovery • Improved Approximation • Classification of Applications

Conclusion • Ideal DSMS • Well defined and flexible query language • User-friendly interface • Scalable • Operator scheduling • Storage management • Synopsis sharing • Approximation • Quality assurance • Fault tolerant

References • R. Motwani et al., “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003. • S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002 • D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002. • D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003 • Stanford STREAM Project Website: http://www-db.stanford.edu/stream/index.html • Aurora Project Website: http://www.cs.brown.edu/research/aurora

End

Stream Data Management System Prototypes