1 / 48

Stream Data Management System Prototypes

Stream Data Management System Prototypes. Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004. Outline. Motivation of DSMS Aurora (Brown, Brandeis, MIT) Model Operator Scheduling Storage/Memory Management QoS issue STREAM (Stanford) System Architecture

nasya
Download Presentation

Stream Data Management System Prototypes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

  2. Outline • Motivation of DSMS • Aurora (Brown, Brandeis, MIT) • Model • Operator Scheduling • Storage/Memory Management • QoS issue • STREAM (Stanford) • System Architecture • Query Language • Query Plans and Execution • Performance Issues • Approximation Techniques • STREAM Interface • Conclusion

  3. Motivation • HADP  DAHP • Continuous data and static queries • Monitoring using sensor • Military • Traffic • Environment • Financial analysis • Object tracking

  4. Aurora

  5. Aurora – Model • General Purpose DSMS • Continuous stream data comes • Flow through a set of operators • Output to application or materialized

  6. Aurora – Model • Components • Storage manager • Scheduler • Load Shedder • Router • QoS Monitor • GUI

  7. Aurora – Model • 3 kinds of query supported • Continuous • View • Ad-Hoc Query

  8. Aurora – Model • 8 primitive operators (Box) • Windowed • Slide • Tumble • Latch • Resample • Non-windowed • Filter • Map • GroupBy • Join

  9. Aurora – Operator Optimization • Each operator associated with • Selectivity: s(b), sel(b) • Computation time: c(b), cost(b) • General Optimization Techniques • Pushing projection upstream • Combining boxes • Reordering boxes

  10. Aurora – Operator Optimization • Case 1 : cost of ab • c(a) + s(a)c(b) • Case 2: cost of ba • c(b) + s(b)c(a) • Criteria for switching box position • c(a)+s(a)c(b) > c(b)+s(b)c(a) a b b a

  11. Aurora – Operator Scheduling • Scheduling by OS • One thread per box, shift the job to OS • Easier to program • Aurora Scheduler • Single thread for the scheduler • The scheduler pick a box with highest priority and call the box to consume tuples from queue • Allow finer control of resource • Scalable !

  12. Aurora – Operator Scheduling

  13. Aurora – Operator Scheduling • Problem: which box to execute next? • Min-Cost (MC) • Reduce computation cost • Min-Latency (ML) • Return result as soon as possible • Min-Memory (MM) • Reduce memory usage of queue

  14. Aurora – Operator Scheduling • Example b4 b2 streams application b5 b3 b1 b6 Downstream

  15. Aurora – Operator Scheduling • Min-Cost • Objective: avoid overhead of calling boxes • Min-Latency • Prefer box which can produce tuples in the output at a shorter period of time • Min-Memory • Give preference to box which will consume more tuples with less computation time • Similar to “Chain Operator Scheduling” • More at:Operator Scheduling in a Data Stream Manager, VLDB 2003

  16. Aurora – Storage/Memory Management • Manage the queue in front of each box • 2 boxes sharing the same queue • windowed operator • The initial queue size is 128 KB • Queues are managed as a circular queue • If overflow, double the queue size, or vice versa

  17. Aurora – Storage/Memory Management • Swap in/out between memory / disk based on priority of boxes using it • Work with Operator Scheduler to exchange box priority and buffer-state information • Connection Point Management • A B-tree indexed on timestamp is built to support random access of tuples by ad-hoc query

  18. Aurora – Storage/Memory Management

  19. Aurora – QoS Issue • Different queries/applications have different QoS requirement • Stock market monitoring • Average temperature of a set of sensor • QoS Graph

  20. Latency-based QoS Graph Critical Point QoS cost(D(b)) est(b) 0 time eol(b) latency(b) b D(b)

  21. Aurora – QoS-driven Scheduling • Assign priority to each box based on • priority (b) = [utility (b), est (b)] • utility (b) = gradient (eol (b)) • How is the QoS degrading by the time the tuple leave the system when we process it now. • est (b) • How soon it will exhibit another performance degradation if we don’t process it now. • Performance • 200 queries/application, each with 5 boxes • Round robin - 0.43 • QoS driven scheduling – 0.85

  22. Aurora – Current Status • Main components of a DSMS are introduced • Operator scheduler • Memory/storage management • QoS concept in stress environment • Load shedding • Implemented in C++, with Java-based GUI • Dependent on a few software/library • More? • Distributed architecture – Aurora* • Fault tolerance or disaster recovery ?

  23. STREAM

  24. STREAM – Introduction • General-purpose prototype DSMS • Supports data streams and stored relations • Declarative language for registering continuous queries • Flexible query plans and execution strategies • Aggressive sharing of state and computation among queries

  25. STREAM – Introduction • Designed to cope with • Stream rates that may be high, variable, bursty • Continuous query loads that may be high, volatile • Primary coping techniques • Graceful approximation as necessary • Careful resource allocation and use • Continuous self-monitoring and reoptimization

  26. Streamed Result Stored Result Register Query Input streams Archive STREAM – System Architecture DSMS Scratch Store Stored Relations

  27. STREAM – Query Language • Continuous Query Language – CQL • Extends SQL with • Streams as new data type • Stream: Unbounded bag of pairs <tuple, timestamp> • Relation: time-varying bags of tuples • Continuous instead of one-time semantics • Three classes of operators • Relation-to-relation • Stream-to-relation • Relation-to-stream

  28. STREAM – CQL Operators • Relation-to-relation • SQL constructs • Stream-to-relation • Tuple-based sliding window: [Rows N], [Rows Unbounded] • Time-based sliding window: [Range ω], [Now] • Partitioned sliding window: [Partition By A1,…Ak Rows N] • Relation-to-stream • Istream: insert stream • Dstream: delete stream • Rstream: relation stream

  29. STREAM – Example Query 1 • Two example streams: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) • Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”: Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

  30. STREAM – Example Query 2 • Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

  31. STREAM – Simplified Query 2 • Result is a relation, updated as stream elements arrive: Select F.clerk, Max(O.cost) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

  32. STREAM – Simplified Query 2 • Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk): Select Istream(F.clerk, Max(O.cost)) From O, F [Rows 100] Where O.orderID = F.orderID Group By F.clerk

  33. STREAM – Example Query 3 • Relation: CurPrice(stock, price) • Average price over last day for each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock • Istream provides history of CurPrice • Window on history (back to relation), group and aggregate

  34. STREAM – Query plans and Execution • When a continuous query is registered, generate a query plan • New plan merged with existing plans • Users can also create & manipulate plans directly • Plans composed of three main components: • Operators • Flag: insertion(+), deletion (-) • Elements: tuple-timestamp-flag tuples • Streams: only + elements • Relations: both + and - elements • Queues • Enforce nondecreasing timestamps (“heartbeats”) • Mechanisms for buffering tuples • States (Synopses) • Global scheduler for plan execution

  35. State1 ⋈ State2 STREAM – States • States (Synopses) • Summarize elements seen so far (exact or approximate) for operators requiring history • To implement windows • Example: synopsis join • Sliding-window join • Approximation of full join

  36. STREAM – Simple Query Plan Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10

  37. STREAM – Performance Issues • Synopsis Sharing • Eliminate data redundancy • Exploiting Constraints • Selectively discard data to reduce state • Operator Scheduling • Reduce queue sizes

  38. STREAM – Synopsis Sharing • Eliminate redundancy by • replacing the nearly identical synopses with light weight stubs • a single store to hold the actual tuples • Store tracks the progress of each stub, presents the appropriate view to each stub. • The store contains the union of its corresponding stubs

  39. STREAM – Synopsis Sharing Select * From S1 [Rows 1000], S2 [Range 2 Minutes] Where S1.A = S2.A And S1.A > 10 Select A, Max(B) From S1 [Rows 200] Group By A

  40. STREAM – Exploiting Constraints • Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type • Referential integrity k-constraint • Ordered-arrival k-constraint • Clustered-arrival k-constraint • Query execution plans reduce or eliminate sate based on k-constraints • If constraint violated, get approximate result

  41. STREAM – Operator Scheduling • Goal: minimize total queue size for unpredictable, bursty stream arrival patterns • Chain Scheduling Algorithm: • Mark the first operator in the plan as the “current” operator • Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time. • Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains. • Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order. • Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries • Empirical results: large savings over naive strategies for many queries • But minimizing queue sizes is at odds with minimizing latency

  42. STREAM – Approximation • CPU-Limited Approximation • Insufficient CPU time to process each stream element due to the high data arrival rate. • load-shedding • sampling operators • Approximate by probabilistically dropping elements before they are processed • Memory-Limited Approximation • The total state required for all registered queries exceeds available memory. • The system selectively shrinks or discards synopses.

  43. STREAM – Query Interface • View the structure of query plans the their component entities. • View the detailed properties of each entity. • Dynamically adjust entity properties. • View monitoring graphs that display time-varying entity properties plotted dynamically against time. • Queue sizes, throughput, overall memory usage, and join selectivity.

  44. STREAM – Query Plan Monitoring

  45. STREAM – Current Status • Version 1.0 up and running • Includes a new monitoring and adaptive query processing infrastructure – StreaMon • Executor runs query plans to produce results. • Profiler collects and maintains statistics about stream and plan characteristics. • Reoptimizer ensures that the plans and memory structures are the most efficient for current characteristics. • Web demo available at http://shark.stanford.edu:8080/ • Future Directions: • Distributed Stream Processing • Crash Recovery • Improved Approximation • Classification of Applications

  46. Conclusion • Ideal DSMS • Well defined and flexible query language • User-friendly interface • Scalable • Operator scheduling • Storage management • Synopsis sharing • Approximation • Quality assurance • Fault tolerant

  47. References • R. Motwani et al., “Query Processing, Approximation, and Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003. • S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002 • D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002. • D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003 • Stanford STREAM Project Website: http://www-db.stanford.edu/stream/index.html • Aurora Project Website: http://www.cs.brown.edu/research/aurora

  48. End

More Related