140 likes | 366 Views
Consistent Streaming Through Time: A Vision for Event Stream Processing. by Jonathan Goldstein (speaker), Roger Barga, Mohamed Ali, and Mingsheng Hong Microsoft Research. Are StreamSQL semantics ok?. Suppose we want to monitor the bandwidth of a device:
E N D
Consistent Streaming Through Time: A Vision for Event Stream Processing by Jonathan Goldstein (speaker), Roger Barga, Mohamed Ali, and Mingsheng Hong Microsoft Research
Are StreamSQL semantics ok? • Suppose we want to monitor the bandwidth of a device: • We create an input stream which has one field: bytes sent • We create an output stream which computes a windowed sum • What are the StreamSQL semantics when the system gets overloaded (strange question to ask)? • Either events must be dropped, or they must be queued at the receiver or sender for later processing • Since window semantics are based on system time (StreamSQL server time), if the device has constant bandwidth, apparent bandwidth will decrease! • In StreamSQL, the user has no reasonable way of knowing! • Conclusion: Something is deeply wrong with the use of time in StreamSQL query semantics!
What’s in the paper? • Laundry list of CEDR features either unsupported or poorly supported in existing streaming systems (Read the paper) • Some of these features come from event processing • Some come from specific scenarios which we believe to be important • These features are described formally through a query language description
What’s in the talk (and the paper)? • Formal definitions of CEDR streams and operator semantics • Provides a clear and intuitive framework for discussing subtle semantic issues • Formalization of materialized view update semantics in standing queries and discuss why they are inadequate in isolation • Definition of a non-view update compliant operator which can express a very wide range of seemingly disparate streaming features • A myriad of window types, the separation of inserts and deletes, etc… • We discuss theoretically both the expression and correct handling of both data delivered out of order and data retraction • Different formal notions of correctness lead to different consistency levels and associated performance tradeoffs
What is a stream and a standing query? • A stream is a (possibly infinite) collection of events, where each event contains: • A payload (P) • A key which uniquely identifies the event (K) • An interval of time (application) for which the payload is valid [Vs, Ve) • A time at which it arrives at a listener (C for CEDR time) • A standing query is an operator graph, where each operator takes 0 or more input streams and produces 0 or more output streams Acknowledgement: This is inspired by and built on Rick Snodgrass’s temporal work
What properties do operators have? • All operators should be well behaved: • Definition 6: A CEDR operator O is well behaved iff for all (combinations of) inputs to O which are logically equivalent to infinity, O’s outputs are also logically equivalent to infinity • Any well behaved operator, when given 2 identical sets of input streams, except for CEDR time, should produce identical sets of output streams, except for CEDR time • Query semantics are independent of CEDR time
What properties do operators have? • Some operators are also view update compliant: • Definition 11:A unary CEDR operator O is view update compliant iff for all R, S s.t. *(R) and *(S) are identical, *(O(R)) and *(O(S)) are also identical • If we interpret the stream as describing a changing relation where each row’s lifetime is specified by valid time, then: • A view update compliant operator produces snapshot identical output for snapshot identical input
What are our operators? • We may now happily use all our favorite relational operators: • Definition 9: Join ⋈f(P1,P2)(S1, S2): ⋈θ(P1,P2)(S1, S2) = {(Vs, Ve, (e1.Payload concantenated with e2.Payload)) | e1 E(S1), e2 E(S2), Vs=max{ e1.Vs, e2.Vs}, Ve=min{ e1.Ve, e2.Ve}, where Vs < Ve, and θ(e1.Payload, e2.Payload)} • These operators’ output streams describe the changing contents of a materialized view computed over the changing input relation(s) described by the input streams
Non-view update compliant operators • Moving window – all output valid end times are set to their valid start times plus the window size • insert separation (CQL) – all output valid end times are set to infinity • The semantics of these operations plus many more can be easily captured using AlterLifetime: • Definition 12: AlterLifetime Πfvs, fΔ(S) Πfvs, fΔ(S)={(|fVs(e)|, |fVs(e)| + |fΔ (e)|, e.Payload) | e E(S}} • Allows the lifetime of input events to be recomputed • It is not view update compliant, but it is well behaved
But is this implementable? Input: • Avg(P) – The usual average operator in materialized view update compliant form • But how could CEDR know it needed to wait for K2 (to produce output) when it saw K1? • It couldn’t have without waiting indefinitely or without some external guarantee Correct Output:
But is this implementable? • We need the ability to retract previously output results in the stream: is logically equivalent to:
But is this implementable? • Our real definition of well behavedness: Any well behaved operator, when given logically equivalent sets of input streams, produces logically equivalent sets of output streams • Avg may now fully retract incorrect previous output and issue new correct output for the appropriate time period • We can denote operator semantics in a very clean manner even in a system with arbitrarily out of order data • The use of retractions to handle out of order data induces a spectrum of formally defined consistency levels for operators • These levels expose interesting tradeoffs between various aspects of performance and correctness (much more in the paper)
Imperfections in Event Streaming • How do current systems cope: • Wait until we’re sure we have all data that affects our results up to a point in time (High consistency) • High latency • Requires application and network guarantee • Requires high memory • Absolutely correct answers • Useful for standing queries that result in some expensive form of corrective or examination action: • A human must examine something because some aggregation (avg) or negation based alert tripped • Provide an answer quickly as of the current time, but ignore late arriving data (Low Consistency) • Low latency • No application or network guarantee required • Low memory • Sacrifices answer correctness • Useful in applications which are unable to provide guarantees about data arrival timeliness and where exact answers aren’t required: • E.g. Aggregations in internet scale monitoring
Imperfections in Event Streaming • With retractions: • Compute our output early in an optimistic fashion and retract later if necessary (Middle Consistency) • Low latency • Doesn’t require application and network guarantees • High memory requirements: equal to the high consistency case if we have guarantees • May produce more output • Useful in situations where we don’t want to block, but where we want eventual correctness • Stock ticker data example. We want to compute real time info about stock data, but compensate when a correction is issued. • Shared expressions between two queries, one running at the high level of consistency and one at the low