230 likes | 241 Views
This article discusses the use of non-blocking operators for continuous queries in data stream management systems, contrasting them with blocking operators. It also explores the characterization of non-blocking queries and the expressive power of non-blocking languages.
E N D
Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo
Blocking Operators • A blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS02] • But continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used. • Only non-blocking (nb) queries and operators can be used on data streams (i.e. those that return their results before they have detected the end of the input). • Current DBMSs make heavy usage of blocking computations: • For operators that are intrinsically blocking: e.g., SQL aggregates, • And for those that are not: e.g., sort-based implementation of joins and group by • We only need to be concerned with 1: find a characterization for blocking & nonblocking independent of implementation.
Partial Ordering Let S = [ t1, ¼, tn] be a sequence and 0 £k £n. Then [t1, ¼, tk ] is said to be the presequence of S, of length k, denoted by Sk. We write L S to denote that L is a presequence of S, Defines a Partial Order: reflexive, antisymmetric and transitive. generalizes to the subset notion when order and duplicates are immaterial The empty sequence,[ ], is a subsequence of every other sequence.
employees(E#, Dept) select dept, count(E#) from employees group by dept Traditional SQL-2 aggregates: Blocking select dept, count(E#) over (partition bydeptrange unbounded preceding) from employees SQL:2003 OLAP functions: Non-Blocking Continuous countreturns, for each new tuple, the count so far. Consider a sequence of length n. At each step j<n, j is returned cumulative return up to j: sumj(S)= [1,2, …, j]independent on whether j=n or j<n. Traditional count: For each j<n --nothing: sumj(S)=[] Final: sumn(S)=[n]
Examples • Traditional SQL-2 aggregates are blocking—SQL:2003 OLAP functions are not. • Selection is nonblocking. • Continuous count (i.e., the unlimited preceding count of OLAP functions) is non-blocking • Also window aggregates are non-blocking • In between cases: e.g., traditional aggregates on input that is already sorted on group-by values.
Characterization of NonBlocking (nb) • Many functions expressible by nb-computations can also be expressed by blocking ones. E.g., joins can be implemented using sorting. Ditto for projections with duplicate elimination. • But many functions implemented using blocking computation cannot be given an nb-implementation. • We must distinguish between the two kinds of functions, since one can be supported in our DSMS (via suitable nb-implementation) and the other cannot. Theorem: Queries can be expressed via nb computations iff they are monotonic w.r.t. the presequence ordering.
NB-completeness • A query language Lcan express a given set of functions F on its input (DB, sequences, data streams)---the larger F, the greater the expressive power of L. • Non-monotonic functions are intrinsically blocking and they cannot be used on data streams. Thus, if we use L in a DSMS, we give up the non-monotonic subset of F with no regret. However, let us make sure that we do not give up anything more! • More? Yes, because for continuous queries of streams, we will normally disallow L’sblocking (i.e. nonmonotonic) operators & constructs, and only allow nb (i.e., monotonic ) operators. • But are ALL the monotonic functions expressible by L using the nb-operators of L ? Or by disallowing blocking operators did we also lose the ability of expressing some monotonic queries? Definition:L is said to be nb-complete when it can express all the monotonic queries expressible by L using only its nb-operators.
Expressive Power and NB-Completeness • NB-completeness is a test that a language is as suitable for continuous queries on data streams as it is on stored database. • In a language L lacking nb-completeness, there are monotonic functions that L cannot express as continuous queries, that L can express if the stream had been stored in a database. • For instance, Relational Algebra and SQL are not nb-complete (in addition to the shortcomings they might have on DBs).
Sets versus Sequences • Sets are sequences where duplicates are allowed and order is immaterial. • Thus S1 is a subset of S2 iff S1 can be reordered in a presequence of S2. • Theorem [Lifted from sequences to sets]. A function is is nb iff it is monotonic. • NB=monotonic: selection, projection, and OLAP functions • Blocking=Non-Monotonic: e.g. Traditional aggregates. • Operators of more than one argument: • Join are monotonic (i.e., NB) in both arguments. • R-S is monotonic on R and antimonotonic on S: i.e., will block on S but not on R (after it has seen the whole S, though)
Relational Algebra (RA) • Set difference can produce monotonic queries: Intersection: R1Ç R2= R1- (R1- R2) • Are these still expressible without set diff? • Intersection can be expressed as a joins: product+select • But interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. • Example: Temporal domain isomorfic to nonnegative integers.Intervals closed to the left but open to the right: p(0, 3). % 0,1, and 2 are in p but 3 is not p(2, 4). % 3 is not a hole because is covered by this p(4, 5). % 5 is a hole because not covered by any other interval p(6, 8).
Coalesce p (cp) & p Until q p(0, 3). p(2, 4). p(4, 5). p(6, 8). cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8). cp(0, 4). cp(2, 5). cp(0,5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of an interval in between is a hole. cp(I1, J2) ¬ p(I1, J1), p(I2, J2), J1 < J2, Øhole(I1, J2). hole(I1, J2) ¬ p(I1, J1), p(I2, J2), p(_,K), J1 £ K, K < I2, Øcep(K). cep(K) ¬ p(_, K), p(I, J), I £ K, K < J. q(5,_) holdsif cp has an interval that starts at 0 & contains 5pUntil q(yes) ¬ q(0, J). pUntil q(yes) ¬ cp(0, I), q(J, _), I ³ J .
Relational Algebra • NonMonotonic (i.e., blocking) RA operators: set difference and division • We are left with: select, project, join, and union. Can these express all FO monotonic queries? • Some interesting temporal queries: coalesce and until • They are expressible in RA (by double negation) • They are monotonic • They cannot be expressed in nb-RA. Theorem: RA and SQL are not nb-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates.
E-Bay Example • Auctions: a stream of bids on an item. bidStream(Item#, BidValue, Time) • Items for which sum of bids is > 100K SELECT Item# FROM bidStream GROUP BY Item# HAVING SUM(BidValue) > 100000; • This is a monotonic query. Thus it can be expressed in a language containing suitable query operators, but not in SQL-2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. • So SQL-2 is not nb-complete because of its blocking aggregates. What about relational algebra?
Incompleteness of Relational QL • The coalesce and until queries • can be expressed in safe nonrecursive Datalog, thus • They are expressible in RA, • They are monotonic • They cannot be expressed in nb-RA Theorem: RA and SQL are not nb-complete. • A new limitation for DB query languages (which were already severely challenged in terms of expressive power)
Embedding SQL Queries in a PL • In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a Get Next of Cursor statement. • Operations that could not be expressed in SQL can then be expressed in the PL: • an effective remedy for the lack of expressive power of SQL • But cursors is a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them. • The DSMS can only deliver its output to the PL as a stream—This is OK to drive a GUI. But if most of the work has not been done yet, who is the DSMS? • Contrast this to DBMS who are useful even with a weak QL.
Reviewing the Situation • SQL’s lack of expressive power is a major problem for database-centric applications. • These problems are significantly more serious for data streams since: • Only monotonic queries can be used, • Actually, not even all the monotonic ones since SQL is not nb-complete, • These problems cannot be really by using PLs with embedded SQL statements on streams • DSMS will be impaired--unless significant improvements can be made.
UDAs to the Rescue • Full support for UDAs with all window combinations—effective on UDAs written in SQL, PLs, and even built-ins • Support for continuous queries and ad hoc queries, under a simple and unified semantics • Turing completeness --all possible queries • nb-completeness all monotonic queries using only non-blocking operators (e.g., window UDAs & those without TERMINATE) • Effective on a broad range of data-intensive applications: data/stream mining, approximate queries, sequential patters (XML not there) • Making a strong case for the DB-oriented approach to data streams.
Conclusion • Language Technology: • ESL a very powerful language for data stream and DB applications • Simple semantics and unified syntax conforming to SQL:2003 standards • Strong case for the DB-oriented approach to data streams • System Technology: • Some performance-oriented techniques well-developed—e.g., buffer management for windows • For others: work is still in progress—stay tuned for latest news • Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill
References [1]ATLaS user manual. http://wis.cs.ucla.edu/atlas. [2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of Lecture Notes in Computer Science. Springer, 1999. [4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University, 2002. [5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. [9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. [10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995. [11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002. [12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, pages 379-390, May 2000. [13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003. [14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record, 32(2):5-14, 2003. [15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. [16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 [17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in Data Stream Management Systems. ICDE 2007.
References (Cont.) [18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492-503 [19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In SIGMOD, pages 49-61, 2002. [20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003. [21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted relational query language, 1998. [23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001. [24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001. [25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998. [26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232-239, Taipei, Taiwan, March 1995. [27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM SIGMOD 1994, pages 430-441. ACM Press, 1994. [28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996. [29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In SIGMOD, pages 321-330, 6 1992. [30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003. [31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.
DSMS Research Projects • Aurora (Brandeis/Brown/MIT) http://www.cs.brown.edu/research/aurora/ • Cougar (Cornell) http://www.cs.cornell.edu/database/cougar/ • Telegraph (Berkeley)- http://telegraph.cs.berkeley.edu • STREAM (Stanford) –http://www-db.stanford.edu/stream • Niagara (OGI/Wisconsin)-http://www.cs.wisc.edu/niagara/ • OpenCQ (Georgia Tech) – http://disl.cc.gatech.edu/CQ • Tapestry (Xerox) – electronic documents stream filtering • Hancock (AT&T) http://www.research.att.com/~kfisher/hancock/ • Cape (WPI) http://davis.wpi.edu/dsrg/CAPE/home.html • Tribeca (Bellcore) – network monitoring • Stream Mill (UCLA) – http://wis.cs.ucla.edu/stream-mill • Gigascope …
CQLs for DSMS • Most of DSMS projects use SQL for continuous queries—for good reasons, since • Many applications span data streams and DB tables • A CQL based on SQL will be easier to learn & use • Moreover: the fewer the differences the better! • But DSMS were designed for persistent data and transient queries---not for persistent queries on transient data • Adaptation of SQL and its enabling technology presents difficult research challenges • These combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i.e., lack of expressive power
Language Problems • Most DSMS projects use SQL — queries spanning both data streams and DBs will be easier. But … • Even for persistent data, SQL is far from perfect.Important application areas poorly supported include: • Data Mining, and we need to mine data streams, • Sequence queries, and data streams are infinite time series! • Major new problems for SQL on data stream applications. (After all, it was designed for persistent data on secondary store, not for streaming data) • Only NonBlocking operators in DSMS: blocking forbidden • Distinction not clear in DBMS which often use blocking implementations for nonblocking operators • The distinction needs to formally characterized • and so is the loss of query power of the QL.