1 / 14

Consistent Streaming Through Time: A Vision for Event Stream Processing

Consistent Streaming Through Time: A Vision for Event Stream Processing. by Jonathan Goldstein (speaker), Roger Barga, Mohamed Ali, and Mingsheng Hong Microsoft Research. Are StreamSQL semantics ok?. Suppose we want to monitor the bandwidth of a device:

essien
Download Presentation

Consistent Streaming Through Time: A Vision for Event Stream Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Consistent Streaming Through Time: A Vision for Event Stream Processing by Jonathan Goldstein (speaker), Roger Barga, Mohamed Ali, and Mingsheng Hong Microsoft Research

  2. Are StreamSQL semantics ok? • Suppose we want to monitor the bandwidth of a device: • We create an input stream which has one field: bytes sent • We create an output stream which computes a windowed sum • What are the StreamSQL semantics when the system gets overloaded (strange question to ask)? • Either events must be dropped, or they must be queued at the receiver or sender for later processing • Since window semantics are based on system time (StreamSQL server time), if the device has constant bandwidth, apparent bandwidth will decrease! • In StreamSQL, the user has no reasonable way of knowing! • Conclusion: Something is deeply wrong with the use of time in StreamSQL query semantics!

  3. What’s in the paper? • Laundry list of CEDR features either unsupported or poorly supported in existing streaming systems (Read the paper) • Some of these features come from event processing • Some come from specific scenarios which we believe to be important • These features are described formally through a query language description

  4. What’s in the talk (and the paper)? • Formal definitions of CEDR streams and operator semantics • Provides a clear and intuitive framework for discussing subtle semantic issues • Formalization of materialized view update semantics in standing queries and discuss why they are inadequate in isolation • Definition of a non-view update compliant operator which can express a very wide range of seemingly disparate streaming features • A myriad of window types, the separation of inserts and deletes, etc… • We discuss theoretically both the expression and correct handling of both data delivered out of order and data retraction • Different formal notions of correctness lead to different consistency levels and associated performance tradeoffs

  5. What is a stream and a standing query? • A stream is a (possibly infinite) collection of events, where each event contains: • A payload (P) • A key which uniquely identifies the event (K) • An interval of time (application) for which the payload is valid [Vs, Ve) • A time at which it arrives at a listener (C for CEDR time) • A standing query is an operator graph, where each operator takes 0 or more input streams and produces 0 or more output streams Acknowledgement: This is inspired by and built on Rick Snodgrass’s temporal work

  6. What properties do operators have? • All operators should be well behaved: • Definition 6: A CEDR operator O is well behaved iff for all (combinations of) inputs to O which are logically equivalent to infinity, O’s outputs are also logically equivalent to infinity • Any well behaved operator, when given 2 identical sets of input streams, except for CEDR time, should produce identical sets of output streams, except for CEDR time • Query semantics are independent of CEDR time

  7. What properties do operators have? • Some operators are also view update compliant: • Definition 11:A unary CEDR operator O is view update compliant iff for all R, S s.t. *(R) and *(S) are identical, *(O(R)) and *(O(S)) are also identical • If we interpret the stream as describing a changing relation where each row’s lifetime is specified by valid time, then: • A view update compliant operator produces snapshot identical output for snapshot identical input

  8. What are our operators? • We may now happily use all our favorite relational operators: • Definition 9: Join ⋈f(P1,P2)(S1, S2): ⋈θ(P1,P2)(S1, S2) = {(Vs, Ve, (e1.Payload concantenated with e2.Payload)) | e1  E(S1), e2  E(S2), Vs=max{ e1.Vs, e2.Vs}, Ve=min{ e1.Ve, e2.Ve}, where Vs < Ve, and θ(e1.Payload, e2.Payload)} • These operators’ output streams describe the changing contents of a materialized view computed over the changing input relation(s) described by the input streams

  9. Non-view update compliant operators • Moving window – all output valid end times are set to their valid start times plus the window size • insert separation (CQL) – all output valid end times are set to infinity • The semantics of these operations plus many more can be easily captured using AlterLifetime: • Definition 12: AlterLifetime Πfvs, fΔ(S) Πfvs, fΔ(S)={(|fVs(e)|, |fVs(e)| + |fΔ (e)|, e.Payload) | e  E(S}} • Allows the lifetime of input events to be recomputed • It is not view update compliant, but it is well behaved

  10. But is this implementable? Input: • Avg(P) – The usual average operator in materialized view update compliant form • But how could CEDR know it needed to wait for K2 (to produce output) when it saw K1? • It couldn’t have without waiting indefinitely or without some external guarantee Correct Output:

  11. But is this implementable? • We need the ability to retract previously output results in the stream: is logically equivalent to:

  12. But is this implementable? • Our real definition of well behavedness: Any well behaved operator, when given logically equivalent sets of input streams, produces logically equivalent sets of output streams • Avg may now fully retract incorrect previous output and issue new correct output for the appropriate time period • We can denote operator semantics in a very clean manner even in a system with arbitrarily out of order data • The use of retractions to handle out of order data induces a spectrum of formally defined consistency levels for operators • These levels expose interesting tradeoffs between various aspects of performance and correctness (much more in the paper)

  13. Imperfections in Event Streaming • How do current systems cope: • Wait until we’re sure we have all data that affects our results up to a point in time (High consistency) • High latency • Requires application and network guarantee • Requires high memory • Absolutely correct answers • Useful for standing queries that result in some expensive form of corrective or examination action: • A human must examine something because some aggregation (avg) or negation based alert tripped • Provide an answer quickly as of the current time, but ignore late arriving data (Low Consistency) • Low latency • No application or network guarantee required • Low memory • Sacrifices answer correctness • Useful in applications which are unable to provide guarantees about data arrival timeliness and where exact answers aren’t required: • E.g. Aggregations in internet scale monitoring

  14. Imperfections in Event Streaming • With retractions: • Compute our output early in an optimistic fashion and retract later if necessary (Middle Consistency) • Low latency • Doesn’t require application and network guarantees • High memory requirements: equal to the high consistency case if we have guarantees • May produce more output • Useful in situations where we don’t want to block, but where we want eventual correctness • Stock ticker data example. We want to compute real time info about stock data, but compensate when a correction is issued. • Shared expressions between two queries, one running at the high level of consistency and one at the low

More Related