1 / 53

Query Processing, Resource Management, and Approximation in a Data Stream Management System

Query Processing, Resource Management, and Approximation in a Data Stream Management System. Selected subset of slides taken from talk by Jennifer Widom at NEDS. st anfordst re amdat am anager. Data Streams. Continuous, unbounded, rapid, time-varying streams of data elements

arich
Download Presentation

Query Processing, Resource Management, and Approximation in a Data Stream Management System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Processing, Resource Management, and Approximation in a Data Stream Management System Selected subset of slides taken from talk by Jennifer Widom at NEDS. stanfordstreamdatamanager

  2. Data Streams • Continuous, unbounded, rapid, time-varying streams of data elements • Occur in a variety of modern applications • Network monitoring and traffic engineering • Sensor networks • Telecom call records • Financial applications • Web logs and click-streams • Manufacturing processes • DSMS = Data Stream Management System

  3. The STREAM System • Declarative language for registering continuous queries; considering data streams and stored relations

  4. Contributions to Date • Semantics for continuous queries • Query plans • Exploiting stream constraints • Operator scheduling • Approximation techniques

  5. Streamed Result Stored Result Register Query Input streams Archive The (Simplified) Big Picture DSMS Scratch Store Stored Relations

  6. Archive (Simplified) Network Monitoring Intrusion Warnings Online Performance Metrics Register Monitoring Queries DSMS Network measurements, Packet traces Scratch Store Lookup Tables

  7. Declarative Language for Continuous Queries • A distinction between STREAM and Aurora : • Aurora users directly manipulate one large execution plan • STREAM compiles declarative queries into individual plans, system may merge plans • STREAM also supports direct entry of plans • Syntax based on SQL, additional constructs for sliding windows and sampling

  8. Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

  9. Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderIDAnd F.clerk = “Sue” And O.customer = “Joe”

  10. Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O,Fulfillments F [Range 1 Day] Where O.orderID = F.orderIDAnd F.clerk = “Sue” And O.customer = “Joe”

  11. Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderIDAnd F.clerk = “Sue” And O.customer = “Joe”

  12. Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderIDAnd F.clerk = “Sue” And O.customer = “Joe”

  13. Example Query 1 Two streams, contrived for ease of examples: Orders (orderID, customer, cost) Fulfillments (orderID, clerk) Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe” Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day] Where O.orderID = F.orderIDAnd F.clerk = “Sue” And O.customer = “Joe”

  14. Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

  15. Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

  16. Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

  17. Example Query 2 Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% Sample Where O.orderID = F.orderID Group By F.clerk

  18. Semantics of Database Languages • An often neglected topic • Traditional relational databases are in reasonable shape • Relational algebra  SQL • But triggers were a mess • The semantics of an innocent-looking continuous query over data streams may not be obvious

  19. A Nonobvious Continuous Query • Stream of stock quotes: Stocks(ticker,price) • Monitor last 10 minutes of quotes: Select  From Stocks [Range 10 minutes] • Is result a relation, a stream, or something else? • If a relation, what exactly does it contain? • If a stream, how does query differ from: Select  From Stocks [Range 1 minute] or Select  From Stocks []

  20. Our Semantics and Language for Continuous Queries • Abstract: interpretation for CQs based on certain “black boxes” • Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences • Goals • CQs over multiple streams and relations • Exploit relational semantics to the extent possible • Easy queries should be easy to write, simple queries should do what you expect

  21. Relations and Streams • Assume global, discrete, ordered time domain (more on this later) • Relation • Maps time Tto set-of-tuples R • Stream • Set of (tuple,timestamp) elements

  22. Window specification Any relational query language Special operators: Istream, Dstream, Rstream Conversions Streams Relations

  23. Conversion Definitions • Stream-to-relation • S [W] is a relation — at time T itcontains all tuples in window W applied to stream S up to T • When W = , contains all tuples in stream S up to T • Relation-to-stream • Istream(R) contains all (r,T ) where rR at time T but rR at time T–1 • Dstream(R) contains all (r,T ) where rR at time T–1 but rR at time T • Rstream(R) contains all (r,T ) where rR at time T

  24. Abstract Semantics • Take any relational query language • Can reference streams in place of relations • But must convert to relations using any window specification language ( default window = [] ) • Can convert relations to streams • For streamed results • For windows over relations (note: converts back to relation)

  25. Query Result at Time T • Use all relations at time T • Use all streams up to T, converted to relations • Compute relational result • Convert result to streams if desired

  26. Time • Easiest: global system clock • Stream elements and relation updates timestamped on entry to system • Application-defined time • Streams and relation updates contain application timestamps, may be out of order • Application generates “heartbeat” • Or deduce heartbeat from parameters: stream skew, scrambling, latency, and clock progress • Query results in application time

  27. Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk • Maximum-cost order fulfilled by each clerk in last 1000 fulfillments

  28. Abstract Semantics – Example 1 Select F.clerk, Max(O.cost) From O [], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk • At time T: entire stream O and last 1000 tuples of F as relations • Evaluate query, update result relation at T

  29. Abstract Semantics – Example 1 Select Istream(F.clerk, Max(O.cost)) From O [], F [Rows 1000] Where O.orderID = F.orderID Group By F.clerk • At time T: entire stream O and last 1000 tuples of F as relations • Evaluate query, update result relation at T • Streamed result: New element (<clerk,max>,T) whenever <clerk,max> changes from T–1

  30. Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock • Average price over last day for each stock

  31. Abstract Semantics – Example 2 Relation CurPrice(stock, price) Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock • Istream provides history of CurPrice • Window on history, back to relation, group and aggregate

  32. Concrete Language – CQL • Relational query language: SQL • Window spec. language derived from SQL-99 • Tuple-based, time-based, partitioned • Syntactic shortcuts and defaults • So easy queries are easy to write and simple queries do what you expect • Equivalences • Basis for query-rewrite optimizations • Includes all relational equivalences, plus new stream-based ones

  33. Two Extremely Simple CQL Examples Select  From Strm • Had better return Strm (It does) • Default  window for Strm • Default Istream for result Select  From Strm, Rel Where Strm.A = Rel.B • Often want “NOW” window for Strm • But may not want as default

  34. Query Execution • When a continuous query is registered, generation a query plan • Users can also register plans directly • Plans composed of three main components: • Operators (as in most conventional DBMS’s) • Inter-operator Queues (as in many conventional DBMS’s) • State (synopses) • Global scheduler for plan execution

  35. State1 ⋈ State2 Operators and State • State (synopses) • Summarize tuples seen so far (exact or approximate) for operators requiring history • To implement windows • Example: synopsis join • Sliding-window join • Approximation of full join

  36. Simple Query Plan Q1 Q2  State3 ⋈ State4 Scheduler State1 ⋈ State2 Stream3 Stream1 Stream2

  37. Some Issues in Query Plan Generation • Compatibility and conversions for streams and relations (+/- streams) • State sharing, incremental computation • Windowed joins: Multiway versus 2-way • Windows in general: push down, pull up, split, merge, … • Time coordination, operator-level heartbeats

  38. Memory Overhead in Query Processing • Queues + State • Continuous queries keep state indefinitely • Online requirements suggest using memory rather than disk • But we realize this assumption is shaky

  39. Memory Overhead in Query Processing • Queues + State • Continuous queries keep state indefinitely • Online requirements suggest using memory rather than disk • But we realize this assumption is shaky • Goal: minimize memory use while providing timely, accurate answers

  40. Reducing Memory Overhead Two main techniques to date • Exploit constraints on streams to reduce state • Clever operator scheduling to reduce queue sizes

  41. Exploiting Stream Constraints • For most queries, unbounded memory is required for arbitrary streams [PODS ’01]

  42. Exploiting Stream Constraints • For most queries, unbounded memory is required for arbitrary streams [PODS ’01] • But streams may exhibit constraints that reduce, bound, or even eliminate state

  43. Exploiting Stream Constraints • For most queries, unbounded memory is required for arbitrary streams [PODS ’01] • But streams may exhibit constraints that reduce, bound, or even eliminate state • Conventional database constraints • Redefined for streams • “Relaxed” for stream environment

  44. Stream Constraints • Each constraint type defines adherence parameterk • Clustered(k) for attributeS.A • Ordered(k) for attributeS.A • Referential-Integrity(k) for join S1 S2

  45. Algorithm for Exploiting Constraints • Input • Any Select-Project-Join query over streams • Any set of k-constraints • Output • Query execution plan that reduces or eliminates state based on k-constraints • If constraints violated, get approximate result

  46. Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O

  47. Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O • Clustered(k) on F.orderID • Matched O tuples discarded after k arrivals of non-matching F’s

  48. Constraint Examples Orders (orderID, cost) Fulfillments (orderID, portion, clerk) Query: Many-one join F  O • Clustered(k) on F.orderID • Matched O tuples discarded after k arrivals of non-matching F’s • Referential-Integrity(k) • F tuples retained for at most k arrivals of O tuples

  49. Operator Scheduling • Global scheduler invokes run method of query plan operators with “timeslice” parameter • Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, … • First scheduler: round-robin • Second scheduler: minimize queue sizes • Third scheduler: minimize combination of queue sizes and latency

  50. Approximation • Why approximate? • Memory requirement too high, even with constraints and clever scheduling • Can’t process streams fast enough for query load

More Related