Continuous Queries over Data Streams

Continuous Queries over Data Streams Vitaly Kroivets, Lyan MarinaPresentation for The Seminar on Database and InternetThe Hebrew University of Jerusalem, Fall 2002

Contents of the lecture • Introduction • Proposed Architecture of Data Stream Management System • Research problems • Query Optimization • Bibliography

Data Sets:Data Streams: Updates infrequent Data Streams vs. Data Sets • Data changed constantly (sometimes additions only) • Old data required many times • Mostly only freshest data used • Example: employees personal data table • Examples: financial tickers, data feeds from sensors, network monitoring, etc

Query Result Query … Result … Using Traditional Database User/Application Loader

Register Query Result Data Streams Paradigm User/Application Stream Query Processor

Register Query Data Stream Management System (DSMS) Scratch Space (Memory and/or Disk) Data Streams Paradigm User/Application Result Stream Query Processor

What Is A Continuous Query ? Query which is issued once and logically run continuously.

What is Continuous Query ? Query which is issued once and run continuously. Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.

What is Continuous Query ? Query which is issued once and run continuously. More examples: Continues queries used to support load balancing, online automatic trading at Stock Exchange

Special Challenges • Timely online answers even for rapid data streams • Ability of fast access to large portions of data • Processing of multiple streams simultaneously

Central Office Central Office DSMS Making Things Concrete BOB ALICE Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = startorend

Making Things Concrete • Database = two streams of mobile call records • Outgoing(connectionID, caller, start, end) • Incoming(connectionID, callee, start, end) • Query language = SQL FROM clauses can refer to streams and/or relations

Query 1 (self-join) Find alloutgoing callslonger than2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) • Result requiresunbounded storage • Can provideresult as data stream • Can output after 2 min,without seeing end

Query 2 (join) Pair upcallersand callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID • Can still provideresult as data stream • Requiresunbounded temporary storage … • … unless streams arenear-synchronized

Query 3 (group-by aggregation) Total connection timefor each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller • Cannot provide result in (append-only) stream. Alternatives: • Output stream with updates • Provide current value on demand • Keep answer in memory

Conclusions • Conventional DBMS technology is inadequate • We need reconsider all aspects of data management and processing in presence of data streams

Persistent relations Transient streams (and persistent relations) DBMS versus DSMS

Persistent relations Transient streams (and persistent relations) DBMS versus DSMS • One-time queries • Continuous queries

Persistent relations Transient streams (and persistent relations) DBMS versus DSMS • One-time queries • Continuous queries • Random access • Sequential access

Persistent relations Transient streams (and persistent relations) DBMS versus DSMS • One-time queries • Continuous queries • Random access • Sequential access • Access plan determined by query processor and physical DB design • Unpredictable data arrival and characteristics

Persistent relations Transient streams (and persistent relations) DBMS versus DSMS • One-time queries • Continuous queries • Random access • Sequential access • Access plan determined by query processor and physical DB design • Unpredictable data arrival and characteristics • “Unbounded” disk store • Bounded main memory

Relatedwork • Tapestry system Content-based filtering of email messages. Restricted subset of SQL append-only query results • Cronicle data model Append-only ordered sequences of tuples restricted view-definition language doesnt store any cronicles • Alert system Event-condition Action triggers in conventional SQL DB Continuous Queries over append-only "active tables".

RelatedworkMaterialized Views • Materialized Views are queries which need to be reevaluated whenever database changes. • Materialized Views vs. Continuous Queries: Continuous Queries • May stream rather then store result • May deal with append only relations • May provide approximate answers • Processing strategy may adapt characteristics of data stream

Architecture for continuous queries Single stream of tuples D, single continuous Query Q and Answer to the query A Q is issued once and operates continuously Q A? Answer <A,B><A,B><A,B> Data Stream Continuous Query

Architecture for continuous queries We consider data streams that adhere to the relation model (i. e. streams of tuples), although many of the ideas and techniques are independent of the data model being considered Q A? Answer <A,B><A,B><A,B> Data Stream Continuous Query

Architecture for continuous queries Scenario 1 (simplest): Data stream D is append only - no updates or deletions. How to handle Q? 1) Always store current answer A to Q . D is of unbounded size => A may be too. 2) Not to store A, but make new tuples in A available as another continuous stream. No need for unbounded storage for A, but may need unbounded storage to determine new tuples in A.

Architecture for continuous queries Scenario 2 • Input stream is append-only, but may cause updates and deletions in answer A. • => May need to update/delete tuples in output data stream Scenario3 (most general) • Input stream D includes updates and deletions. • => Much data of stream should be stored to determine answer.

Architecture for continuous queries How to solve? 1) Restrict expressiveness of Q. 2) Impose constrains on data stream to guarantee that answer to Q is bounded and amount of data needed to compute Q . 3) Provide approximate answer.

Arcitecture for processing continuous queries Stream Stream 1 Stream 2 Store Stream Query Processor . . . Scratch Stream N Throw

Architecture for continuous queries • STREAM is data stream containing tuples appended to A. It is append-only stream (shouldnt include updates/deletions) • STREAM and STORE define current answer A.

Architecture for continuous queries Stream When query Q is notified of new tuple t in a relevant data stream, it can perform number of actions, which are not mutually exclusive 1) t causes new tuples in A if tuple a will remain in A forever: send a to STREAM 2) if a should be in A, but may be removed at some moment: add a to STORE Stream Query Processor Throw Scratch Store Stream

Architecture for continuous queries Stream When query Q is notified of new tuple t in a relevant data stream, it can perform number of actions, which are not mutually exclusive 3) t may cause update or deletion of answer tuples in Store. Answer tuples may be moved from STORE to STREAM 4) May need to save t or derived data to ensure in future can compute query result send t to SCRATCH Stream Query Processor Throw Scratch Store Stream

Architecture for continuous queries Stream When query Q is notified of new tuple t in a relevant data stream, it can perform number of actions, which are not mutually exclusive 5) t not needed and will not be needed. Send it to THROW (unless we like to archive it) 6) As a result of t we may move data from STORE or SCRATCH to THROW Stream Query Processor Throw Scratch Store Stream

Architecture for continuous queries Scenario1 Data stream D is append only - no updates or deletions. Always store current answer A to Q . STREAM empty STORE always contain A SCRATCH contains whatever needed to to keep answer in STORE up to date

Architecture for continuous queries Scenario2 Answer A exclusively as data stream D. STREAM stream answer A STORE empty SCRATCH contains whatever needed to to keep answer in STORE up to date

Architecture for continuous queries Scenario 3 Input stream append only, answer A may have updates and deletions Example : Q is group-by with Min aggregation function. Answer A maintained in STORE SCRATCH is empty

Architecture for continuous queries Scenario 4 Input streams may include updates and deletions Unbounded storage required for SCRATCH to ensure that Min always will be computed Both in 3 and 4: data moved to STREAM only whenever known that no further updates/deletions etc of tuples of this group will occur.

The Architecture and Related Work • Implementing Triggers in terms of proposed architecture (for launching triggered actions assume actions performed by SQL stored-procedures.) • STREAM and STORE empty. • SCRATCH used for data required to moniotor complex events • Benefits: complex multitable events & conditions to be monitored • Trigger processing benefit from efficient data management / processing • Techniques ( see below)

The Architecture and Related Work Implementing Materialized views in terms of proposed architecture • View itsef is maintained in STORE • Base data: in SCRATCH • Data expiration : to expedite cleanup of SCRATCH • No way to ensure bounding of size of STORE and SCRATCH

End of Part I

Research Problems • Designing Query Language • Online processing of rapid streams • Approximation techniques • Storage constrains vs. performance requirements • Summarization • Query Planning / Optimization • Building good Query Plan • Scheduling • Sub-Plans Sharing • Resource Management • Adaptation

Research Problems: Languages for Continuous Queries • Bounding the size of scratch/store • Open problem : to determine for arbitrary SQL query whether properties satisfied

Query Language Query language allows both streams and relations Assumptions: Streams: • Ordered • Append-only • Unbounded Multiple streams allowed Relations: • Unordered • Support updates and deletions

SQL ExtensionsFor Continuous Queries • FROM allowed both to Streams and Relations • Sliding Window forFROMclause (for streams) • Optional "Partitioning" clause • Mandatory "Window size" • Optional "Filtering predicate"

Windows specification • UsingROWS ROWS 50 PRECEEDING • UsingRANGE RANGE 15 minutes PRECEEDING

Example 1 Clients DSMS .NF CL1 CL7 S ( Client_id, URL, domain, time ) .il CL2 CL5 Internet Web Server .com CL3 CS web Math web CL4

Example 1 (CQL)“From” with “Range” • Stream "Requests" of requests to web server with attributes: (client_id, URL, domain, time) • Query counting number of request of pages from domain “cs.huji.ac.il” in the last day: SELECT COUNT(*) FROM Request S[RANGE 1 DAY PRECEEDING] WHERE S.domain= "cs.huji.ac.il"

Partitioning Clause • Partitions data in several groups • Computesseparate windowfor each group • Merges windows into single result • Is syntactically same asGROUP BYclause • Example :

Example 2 “Partition By” • How many pages served (only each clients 10 most recent requests) by request from domain CS.HUJI.AC.IL from CS website ? SELECT COUNT (*) FROM requests S [PARTITION BY s.Client_id Rows 10 PRECEEDING Where s.Domain = ‘CS.HUJI.AC.IL’ ] Where s.URL LIKE 'http://cs.huji.Ac.Il/%'

Example 3 Join with relation • Classify domain by primary type of web content they serve • .ac.il EDUCATION • .gov.il Government • .co.il COMMERCE • .com COMMERCE • Count number of requests from "commerce" domains out of last 10000 records • 10% sample of requests stream is used

Continuous Queries over Data Streams