Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo
Blocking Operators • A blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS02] • But continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used. • Only non-blocking (nb) queries and operators can be used on data streams (i.e. those that return their results before they have detected the end of the input). • Current DBMSs make heavy usage of blocking computations: • For operators that are intrinsically blocking: e.g., SQL aggregates, • And for those that are not: e.g., sort-based implementation of joins and group by • We only need to be concerned with 1: find a characterization for blocking & nonblocking independent of implementation.
Partial Ordering Let S = [ t1, ¼, tn] be a sequence and 0 £k £n. Then [t1, ¼, tk ] is said to be the presequence of S, of length k, denoted by Sk. We write L S to denote that L is a presequence of S, Defines a Partial Order: reflexive, antisymmetric and transitive. generalizes to the subset notion when order and duplicates are immaterial The empty sequence,[ ], is a subsequence of every other sequence.
employees(E#, Dept) select dept, count(E#) from employees group by dept Traditional SQL-2 aggregates: Blocking select dept, count(E#) over (partition bydeptrange unbounded preceding) from employees SQL:2003 OLAP functions: Non-Blocking Continuous countreturns, for each new tuple, the count so far. Consider a sequence of length n. At each step j<n, j is returned cumulative return up to j: sumj(S)= [1,2, …, j]independent on whether j=n or j<n. Traditional count: For each j<n --nothing: sumj(S)=[] Final: sumn(S)=[n]
Examples • Traditional SQL-2 aggregates are blocking—SQL:2003 OLAP functions are not. • Selection is nonblocking. • Continuous count (i.e., the unlimited preceding count of OLAP functions) is non-blocking • Also window aggregates are non-blocking • In between cases: e.g., traditional aggregates on input that is already sorted on group-by values.
Characterization of NonBlocking (nb) • Many functions expressible by nb-computations can also be expressed by blocking ones. E.g., joins can be implemented using sorting. Ditto for projections with duplicate elimination. • But many functions implemented using blocking computation cannot be given an nb-implementation. • We must distinguish between the two kinds of functions, since one can be supported in our DSMS (via suitable nb-implementation) and the other cannot. Theorem: Queries can be expressed via nb computations iff they are monotonic w.r.t. the presequence ordering.
NB-completeness • A query language Lcan express a given set of functions F on its input (DB, sequences, data streams)---the larger F, the greater the expressive power of L. • Non-monotonic functions are intrinsically blocking and they cannot be used on data streams. Thus, if we use L in a DSMS, we give up the non-monotonic subset of F with no regret. However, let us make sure that we do not give up anything more! • More? Yes, because for continuous queries of streams, we will normally disallow L’sblocking (i.e. nonmonotonic) operators & constructs, and only allow nb (i.e., monotonic ) operators. • But are ALL the monotonic functions expressible by L using the nb-operators of L ? Or by disallowing blocking operators did we also lose the ability of expressing some monotonic queries? Definition:L is said to be nb-complete when it can express all the monotonic queries expressible by L using only its nb-operators.
Expressive Power and NB-Completeness • NB-completeness is a test that a language is as suitable for continuous queries on data streams as it is on stored database. • In a language L lacking nb-completeness, there are monotonic functions that L cannot express as continuous queries, that L can express if the stream had been stored in a database. • For instance, Relational Algebra and SQL are not nb-complete (in addition to the shortcomings they might have on DBs).
Sets versus Sequences • Sets are sequences where duplicates are allowed and order is immaterial. • Thus S1 is a subset of S2 iff S1 can be reordered in a presequence of S2. • Theorem [Lifted from sequences to sets]. A function is is nb iff it is monotonic. • NB=monotonic: selection, projection, and OLAP functions • Blocking=Non-Monotonic: e.g. Traditional aggregates. • Operators of more than one argument: • Join are monotonic (i.e., NB) in both arguments. • R-S is monotonic on R and antimonotonic on S: i.e., will block on S but not on R (after it has seen the whole S, though)
Relational Algebra (RA) • Set difference can produce monotonic queries: Intersection: R1Ç R2= R1- (R1- R2) • Are these still expressible without set diff? • Intersection can be expressed as a joins: product+select • But interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. • Example: Temporal domain isomorfic to nonnegative integers.Intervals closed to the left but open to the right: p(0, 3). % 0,1, and 2 are in p but 3 is not p(2, 4). % 3 is not a hole because is covered by this p(4, 5). % 5 is a hole because not covered by any other interval p(6, 8).
Coalesce p (cp) & p Until q p(0, 3). p(2, 4). p(4, 5). p(6, 8). cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8). cp(0, 4). cp(2, 5). cp(0,5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of an interval in between is a hole. cp(I1, J2) ¬ p(I1, J1), p(I2, J2), J1 < J2, Øhole(I1, J2). hole(I1, J2) ¬ p(I1, J1), p(I2, J2), p(_,K), J1 £ K, K < I2, Øcep(K). cep(K) ¬ p(_, K), p(I, J), I £ K, K < J. q(5,_) holdsif cp has an interval that starts at 0 & contains 5pUntil q(yes) ¬ q(0, J). pUntil q(yes) ¬ cp(0, I), q(J, _), I ³ J .
Relational Algebra • NonMonotonic (i.e., blocking) RA operators: set difference and division • We are left with: select, project, join, and union. Can these express all FO monotonic queries? • Some interesting temporal queries: coalesce and until • They are expressible in RA (by double negation) • They are monotonic • They cannot be expressed in nb-RA. Theorem: RA and SQL are not nb-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates.
E-Bay Example • Auctions: a stream of bids on an item. bidStream(Item#, BidValue, Time) • Items for which sum of bids is > 100K SELECT Item# FROM bidStream GROUP BY Item# HAVING SUM(BidValue) > 100000; • This is a monotonic query. Thus it can be expressed in a language containing suitable query operators, but not in SQL-2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. • So SQL-2 is not nb-complete because of its blocking aggregates. What about relational algebra?
Incompleteness of Relational QL • The coalesce and until queries • can be expressed in safe nonrecursive Datalog, thus • They are expressible in RA, • They are monotonic • They cannot be expressed in nb-RA Theorem: RA and SQL are not nb-complete. • A new limitation for DB query languages (which were already severely challenged in terms of expressive power)
Embedding SQL Queries in a PL • In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a Get Next of Cursor statement. • Operations that could not be expressed in SQL can then be expressed in the PL: • an effective remedy for the lack of expressive power of SQL • But cursors is a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them. • The DSMS can only deliver its output to the PL as a stream—This is OK to drive a GUI. But if most of the work has not been done yet, who is the DSMS? • Contrast this to DBMS who are useful even with a weak QL.
Reviewing the Situation • SQL’s lack of expressive power is a major problem for database-centric applications. • These problems are significantly more serious for data streams since: • Only monotonic queries can be used, • Actually, not even all the monotonic ones since SQL is not nb-complete, • These problems cannot be really by using PLs with embedded SQL statements on streams • DSMS will be impaired--unless significant improvements can be made.
UDAs to the Rescue • Full support for UDAs with all window combinations—effective on UDAs written in SQL, PLs, and even built-ins • Support for continuous queries and ad hoc queries, under a simple and unified semantics • Turing completeness --all possible queries • nb-completeness all monotonic queries using only non-blocking operators (e.g., window UDAs & those without TERMINATE) • Effective on a broad range of data-intensive applications: data/stream mining, approximate queries, sequential patters (XML not there) • Making a strong case for the DB-oriented approach to data streams.
Conclusion • Language Technology: • ESL a very powerful language for data stream and DB applications • Simple semantics and unified syntax conforming to SQL:2003 standards • Strong case for the DB-oriented approach to data streams • System Technology: • Some performance-oriented techniques well-developed—e.g., buffer management for windows • For others: work is still in progress—stay tuned for latest news • Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill
CQLs for DSMS • Most of DSMS projects use SQL for continuous queries—for good reasons, since • Many applications span data streams and DB tables • A CQL based on SQL will be easier to learn & use • Moreover: the fewer the differences the better! • But DSMS were designed for persistent data and transient queries---not for persistent queries on transient data • Adaptation of SQL and its enabling technology presents difficult research challenges • These combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i.e., lack of expressive power
Language Problems • Most DSMS projects use SQL — queries spanning both data streams and DBs will be easier. But … • Even for persistent data, SQL is far from perfect.Important application areas poorly supported include: • Data Mining, and we need to mine data streams, • Sequence queries, and data streams are infinite time series! • Major new problems for SQL on data stream applications. (After all, it was designed for persistent data on secondary store, not for streaming data) • Only NonBlocking operators in DSMS: blocking forbidden • Distinction not clear in DBMS which often use blocking implementations for nonblocking operators • The distinction needs to formally characterized • and so is the loss of query power of the QL.