Data Stream Management Systems--Supporting Stream Mining Applications

Data Stream Management Systems--Supporting Stream Mining Applications Carlo Zaniolo CS240B 2018

Three Main Topics • Data Stream Management Systems (DSMS): • efficient support for continuous queries • With Quality of Service (QoS) Guarantees • Data Stream Mining Algorithms • fast & light algorithms to support classical mining tasks (e.g. classification, association, clustering) with online response • Combining these into a Data Stream Mining workbench • E.g. an online version of Weka.

Data Stream Management Systems--Supporting Data Stream Mining Applications In the age of the Internet, massive amounts of information are continuously exchanged as data streams that are then processed by on-line applications of increasing complexity. For such advanced applications, a store-now and process-later approach cannot be used because of real time (or quasi real-time) requirements and excessive data rates. Therefore, current research seeks to develop a new generation of information management systems, called Data Stream Management Systems (DSMS), that can support complex applications on massive data streams with Quality of Service (QoS) guarantees. This work has produced novel techniques, research prototypes, startup companies, and the successful deployment of DSMS in many applications, including network traffic analysis, transaction log analysis, intrusion detection, credit-card fraud detection, click stream analysis, and algorithmic trading.

DSMS Research Issues • only monotonic queries and non-blocking operators can be used. • Also, the unbounded streams must be represented by synopses, such as windows containing the most recent tuples in the streams. Thus the semantics of basic operators such as joins and aggregates must be revised for windows. • At the implementation level, we have new query optimization techniques that seek to minimize response time and memory utilization. Load shedding techniques based on samples and sketches are used to achieve QoS under overload conditions. • The first part of the course, will cover these techniques and the architectures of the main DSMS systems.

Data Streams Research Issues (cont.) The first part of the course, will cover these techniques and the architectures of the main DSMS systems. The second part of the course will focus on the data stream mining problem that represents a vibrant area of new research. Past work concentrated on devising data mining algorithms that (i) are fast and light enough for on line applications, and (ii) can cope with the concept shifts and drifts that are often present in data streams

Topic 1: The Big Picture DBMS from 9000 meters • Data Stream Management Systems (DSMS) • Overview of Applications, Systems, and commercial ventures • Main Research Challenges: • Data models and timestamps • Query constructs and execution models • Optimization and Scheduling • Approximation and Load Shedding • The bigger Picture • XML • CEP

Data Streams • Unbounded, rapid, time-varying streams of data elements, continuous flowing on the internet and broad-band communication channels • Data Stream Management Systems (DSMS) are designed to process them continuously, since a store now and process later approach will not work due to: • Response requirements: real time or quasi real-time • Streams are too massive, and also bursty. Online filtering of interesting data for in-depth analysis later. • Many applications similar to those of DBMS. Most DSMS use some form of SQL • Computing environments quite different. E.g., persistent queries on transient data, vs. transient queries on persistent data

Systems and Technology • Several Research prototypes (next slide) • High-tech ventures startups+ DB vnedors • Apama (Software AG) • Truviso: network analytics (CISCO) • SQLstream (Amazon) • AURORA,StreamBase Systems, Inc. • PIPES, webMethods Business Events • StreamInsight (Microsoft) • InfoSphere Streams (IBM) • SAS Event Stream Processing Engine • Pipeline DB • Complex Event Processing (CEP) systems: • More general functionality than continuous queries • Building on middleware/Java rather than DBs and SQL

Many Research Projects … • Amazon/Cougar (Cornell) – sensors • Aurora (Brown/MIT) – sensor monitoring, dataflow • Hancock (AT&T) – Telecom streams • Niagara (OGI/Wisconsin) – Internet DBs & XML • OpenCQ (Georgia) – triggers, view maintenance • Stream (Stanford) – general-purpose DSMS • Tapestry (Xerox) – pubish/subscribe filtering • Telegraph (Berkeley) – adaptive engine for sensors • Tribeca (Bellcore) – network monitoring • Stream Mill (UCLA) - power & extensibility • Gigascope: AT&T Labs – Network Monitoring ... All SQL-based, except Hancock

Client-Server Architecture Register Query Input streams Clients Streamed Result DSMS Server Scratch Store Stored Relations

Databases vs. DSMS DBMS Model: persistent data Table: set|bag of tuples Updates: All Query: transient Query Answer: exact Query Eval. multi-pass Operator: blocking OK Query Plan: fixed DSMS Model: transient data Infinite sequence of tuples Updates: append only Query: persistent Query Answer: Often approx Query Eval. one-pass Operators: unblocking only Query Plan: adaptive

Research Challenges: Data Models • Relational Data Streams • Each data stream consists of relational tuples • The stream can be modelled as an append-only relation • But repetitions are allowed and order is very important! • Order based on timestamps—or arrival order

Timestamps • Data streams are (basically) ordered according to their timestamps • The meaning of RA operators such as unions an joins is based on timestamps • External timestamps • Injected by data source • Model real-world event represented by tuple • Tuples may be out-of-order--but if near-ordered the DSMS can reorder them using small buffers • Internal timestamps • Introduced as special field by the DSMS • According to the approximate time of arrival • Missing (will call them latent in Stream Mill) • The system assigns no timestamp to arriving tuples, • But tuples are still processed as ordered sequences • Operators whose semantics requires timestamps can generate them as/when needed

Data Stream Query Languages(SQL-like) Continuous queries and Blocking Operators

Cascading of Streams OpenAuction ExpensiveItems Source port4445 σπ σ Sink CREATE STREAM OpenAuction (itemID int, sellerID char(10), price real, starttime timestamp) ORDER BY start time; /*external timestamps*/ SOURCE ’port4445’; CREATE STREAM expensiveItems AS SELECT itemID, sellerID, price, start time FROM OpenAuction WHERE price > 1000 SELECT itemID, price, start time FROM expensiveItems WHERE sellerID= `ABCwarehouse`

A More General Query Graph Q1 Q2 ∑ U ⋈ Stream3 Stream1 Stream2

SQL: DSMS versus DBMS • Cascading of streams similar to composition of views: optimizer can combine them as virtual views or pipeline results through queues • Syntax/semantic uniformity (between DSMS and DBMS) highly desirable, since it • Minimizes confusion/learning curve • Queries for DSMS can be tested on DBMS: actually might be a practical solution for modest arrival rates • Vice-versa might not work because of: • Three Major Complications:Order is critical • Order is critical • Memory is limited • Blocking queries disallowed • New SQL standards overcame 1: e.g. OLAP Functions with Windows and Sequence Patterns, but 2 and 3 remain.

Surviving the 3 Complications • ORDER: • Union of two or more streams becomes a merge that • Preserves the order of the incoming streams • the order of their timestamps (if these are present) • Limited Memory: • Only a synopsis of each stream can be memorized---e.g., a window containing the last 10000 tuples or those that have arrived in the last 6 minutes. The window defines a time changing table • Join (of each arriving tuple in) the data stream with a table is simple.A widow on a data stream can be viewed as a table. • Symmetric joins of data streams are thus supported by creating wi windows on each streams and computing the join on such window. • Blocking Queries Disallowed: • …next slide

Surviving the 3 Complications (cont.) • ORDER • Limited Memory: • Blocking Queries Disallowed: • If any operator part of the query is blocking the query is also blocking • Blocking operators are NonMonotonic operators—which are thus disallowed (e.g., set diff and SQL-2 aggregagtes) • Significant loss of expressive power

Join of Stream S with a Table T (where T is a DB relation or a Window on a Stream) When a new tuple z with timestamp ts(z) arrives in S, join it with all the tuples in T. • ts(z) is the timestamp of tuples so produced If T is a window on a stream S’ • E.g., 30 minutes preceding current tuple in S’ or • 499 tuples preceding current one • T must contain all the tuples up to ts(z) included. • Thus a window is a synopsis of the past.

Query Operators • Selectionandprojection—same as in relational algebra (no duplicate elimination in projection) • Union of data streams becomes a merge: idle waiting problem! • Join of data streams with tables—no problem • Joins of two or more data streams: must use windows. Idle waiting here two! • Set difference and SQL-2 aggregates are blocking and cannot be used. • Aggregates can be salvaged by using continuousaggregates or windows. More later

Blocking Query Operators • No output until the entire input has been seen—i.e., the last tuple in the input, … often detected after we hit the EOF. • Streams – input never ends: thus blocking operators cannot be used as such. • Traditional SQL aggregates are blocking. • Several SQL operators have DBMS implementations that are blocking but are not intrinsically blocking • group by, join can be implemented in blocking and nonblocking ways • but others operators are intrinsically blocking • We will need to formally characterize which is which! (and we’ll see that nonblocking operators are the monotonic ones)

Problematic Operators for Data Streams • Blocking query operators—i.e., those that must see everything in the input before they can return anything in the output • NonBlocking query operators are those that can return results now, without seeing the rest of the stream • Selection and projection are non-blocking • Set Difference, and Traditional aggregates are blocking • Continuous aggregates are not

Expressive Power Problem SQL: a language of limited expressive power • Non-blocking operator requirement further limit its expressive power when expressing continuous queries • Pull-Based computation of PL and DBMS makes it possible to embed SQL in a procedural language program • But DSMS have a push-based computation model: embedding continuous queries a PL is a problem not a general solution • OR-DBMS extenders that use BLOBs not working either • User Defined Aggregate functions remain a viable extensibility mechanism for continuous queries • However, only non blocking aggregates can be used • Blocking versus non-blocking aggregates.

Aggregates: two Forms • Traditional SQL2 blocking aggregate select X, avg(X)from S where P group by X • NB SQL:2003 OLAP Functions—aggregates on windows traffic (sourceIP, sourcePort, destIP , destPort, length, Time)select sourceIP, Time, avg(lenght)over(order by Time, partition by sourceIP 50 rows preceding) Cumulative (running) window: ... over(order by Time, partition by sourceIP unlimited preceding)

Avoiding Blocking Behavior • Windows: aggregates on a limited size window are approximate and nonblocking • DSMS make extensive use of windows of all kinds: • Sliding windows (same as SQL:2003 OLAP functions) • Tumbles: restart every new window (we can use the traditional definition) • Panes: the window is broken up into panes • Punctuation: • Assertion about future stream contents not being needed to answer correctly. • Unblocks operators, reduces state • Construct used for avoiding blocking are also useful for avoiding infinite memory

Query Plans: Optimization and Scheduling The DSMS optimization problem is quite different from the DBMS query optimization problem.

Query Optimization in DBMS • In DBMS: execution time savings by selecting operator implementation indexes and join reordering--- all measured by page swap counts • Scheduling of various query tasks might be left up to the OS.

DSMS Optimizationof Queries and Schedules In DSMS: data is in memory and execution time is mostly determined by the query graphs and the costs of tuples being processed. But many queries competing for resources: thus schedules must be optimized to minimize latency and memory (similarities with tasks scheduling in OS) Simple DBMS-like query optimization opportunities remain: e.g. pushing selection, composing views Sharing of operators and buffers also important! 29

Scheduler based on Heuristics • Round Robin (RR) is perhaps the most basic • operators in a circular queue are given a fixed time slice. • Starvation is avoided, but little adaptivity • FIFO: takes the first tuple in input and moves it through the chain • Minimal latency, poor memory • Greedy Algorithms: • Buffers with most tuples first • Tuples that waited longest first • Operators that release more memory first

Chain: a Formal Result on Memory Optimization[Babcock, Babu, Datar, Motwani 2003] Output σ1 best slope σ3 selectivity = 0.0 σ2 Net Selectivity σ2 selectivity = 0.6 σ3 σ1 selectivity = 0.2 Time Input

DSMSApproximation and Load Schedding • DSMS: online response on boundless and bursty data streams—How? • by using Approximation and Synopses, or even • Shedding load when arrival rates become impossible.

QOS via Synopses and Approximation • Synopsis:bounded-memory history-approximation • Succinct summary of old stream tuples • Like indexes/materialized-views, but base data is unavailable • Examples • Sliding Windows • Sampling and Bootstrapping • Sketching techniques • Histograms • Wavelet representation • Approximate Algorithms: e.g., median, quantiles,… • Fast and light Data Mining algorithms

Quality of Service (QoS) and Load Schedding • Data streams can be bursty. When the input stream rate exceeds system capacity the QoS will become unacceptable—even with optimal schedules • Then more drastic measures must be taken: something must be cut: Load Shedding • Introducing load shedding in a data stream manager is a challenging problem • Queries can be cut alltogether, or if possible • Some tuples might be dropped from certain streams in ways that only reduce the accuracy of the queries, by • Random shedding, or • Semantic shedding

The Bigger Picture… • XML Data Streams • Complex Event Processing (CEP)

XML Data Streams: Applications • An XML data stream is a sequence of tokens • Typical applications using XML streams involve: • integration o heterogneous data and information • Distributed monitoring of computing systems • Message-based web services • Purchase orders, retail transactions • Personalized content delivery

XML Streams: Data Model XML data: tree structure <Purchase_Doc> <PR_Number val = “50”/> <Supp_Name>ABC</Supp_Name> <Address> <City>Florham Park</City> <State>New Jersey</State> </Address> <Line_Items> <Item> <Part_Number val= “1050”/> <Quantity val=“20”/> </Item> Data stream: ~ SAX events [element Purchase_Doc anyType] [element PR_Number anyType] [attribute val anySimpleType] [chardata 50] [end-attribute] [end-element] [element Supp_Name anyType] [text ABC] [end-element] …

XML Query Languages • XML query languages • Xquery, XSLT, Xpath • Declarative matching of structured data and text • Easy restructuring to meet needs of data consumers

XML Streams: research Issue • Efficient Processing of single/multiple queries (e.g., Xfilters/Yfilters) • Blocking operators/constructs in XQuery—e.g., XQuery new function definition mechanisms are blocking • Optimization, synopses and other QoS techniques for XML data streams • Integration of relational and XML DSMS—just like relational and XML DBMS are now being integrated.

The Even Bigger Picture… XML Data Streams Complex Event Processing (CEP) 40

Research & Commercial Systems • Progress Apama: they claim to be the industry leading platform for Complex Event Processing (CEP) [a UK-trained Chicago team] • Aurora (Brandeis, Brown, MIT) • Borealis: multiprocessor processor version • StreamBase: commercial startup. • Gigascope (AT&T): used with success for network traffic management, continuing projects • STREAM (Stanford Univ) influential but short lived • Coral8: R. Motwani among startup founders • Telegraph (UCB): Quickly completed by now • The Truvisio startup • Stream Mill (UCLA): extending the power and generality of DSMS • integrates XML streams and relational streams • Support for Data Mining applications

More Tutorial Talks Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani,Jennifer Widomhttp://theory.stanford.edu/~rajeev/pods-full-talk.ppt Nick Koudas and Divesh Srivastava. Data stream query processing. Tutorial presented at International Conference on Very Large Databases (VLDB), 1149, 2003. [ PDF | talk slides (PDF) Nick Koudas et al. Matching XML Documents Approximately (with S. Yahia and D. Srivastava) Tutorial delivered at ICDE 2003 Nick Koudas et al. Stream Data Management: Research Directions and Opportunities. Invited Talk at IDEAS 2002. Nick Koudas et al. Mining Data Streams (with S. Guha) Invited Tutorial delivered at PAKDD 2003

Books Nauman A. Chaudhry, Kevin Shaw, Mahdi Abdelguerfi: Stream Data Management. Advances in Database Systems 30, Springer 2005. Charu C. Aggarwal: Data Streams - Models and Algorithms. Advances in Database Systems 31, Springer 2007. Lukasz Golab M. Tamer Ozsu, Data Stream Management, 2010, Morgan & Claypool Publishers. Charu C. Aggarwal: Data Mining - The Textbook. Springer 2015. Minos N. Garofalakis, Johannes Gehrke, Rajeev Rastogi: Data Stream Management - Processing High-Speed Data Streams. Data-Centric Systems and Applications, Springer 2016.

References Overview Papers • B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom: Models and Issues in Data StreamSystems. PODS 2002: 1-16 • Lukasz Golab and M. Tamer ¨Ozsu. Issues in data stream management. ACM SIGMOD Record, 32(2):5–14, 2003. Application Papers 1.Cranor, Johnson, Spatscheck & Shkapenyuk. Gigascope: A Stream Databasefor Network Applications. SIGMOD 2003 2. Joseph M. Hellerstein. From Database to Dataflow: New Directions in IT.Medical Records Institute Health IT Advisory Report 3(6) (2002). 3.Lerner & Shasha. The Virtues and Challenges of Ad Hoc + Streams Queryingin Finance. IEEE Data Engineering Bulletin, March 2003. 4.Sistal, Wolfson, Chamberlain, Dao. Modeling and Querying Moving Objects.ICDE 1997. 5.Yao & Gehrke. Query Processing for Sensor Networks. CIDR 2003 CEP Papers: • Apama; CEP applications www.eventstreamprocessing.com/cepapplications.htm • Hans-Martin Brandl and David Guschakowski, Complex Event Processing in the context of Business Activity Monitoring, Thesis, University of Applied Sciences Regensburg, 2007. • Roy Schulte, Event Processing in Business Applications, Workshop on Event Processing, March 14,15 2007www.complexevents.com/slides/Gartner_Schulte.ppt • Other papers at the same Workshop on Event Processing - Presentations

More about Systems • Sankar Subramanian, Srikanth Bellamkonda, Hua-Gang Li, Vince Liang, Lei Sheng, Wayne Smith, James Terry, Tsae-Feng Yu, Andrew Witkowski: Continuous Queries in Oracle. VLDB 2007: 1173-1184 • Arvind, Shivnath Babu, Jennifer Widom: The CQL continuous query language: semantic foundations anquery execution. VLDB J. 15(2): 121-142 (2006) • Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Eduardo F. Galvez, Jon Salz, Michael Stonebraker, Nesime Tatbul, Richard Tibbetts, Stanley B. Zdonik: Retrospective on Aurora. VLDB J. 13(4): 370-383 (2004). • Jeong-Hyon Hwang, Ugur Çetintemel, Stanley B. Zdonik: Fast and Reliable Stream Processing over Wide Area Networks. ICDE Workshops 2007: 604-613. • Jeong-Hyon Hwang, Magdalena Balazinska, Alex Rasin, Ugur Çetintemel, Michael Stonebraker, Stanley B. Zdonik: High-Availability Algorithms for Distributed Stream Processing. ICDE 2005: 779-790. • Charles D. Cranor, Theodore Johnson, Oliver Spatscheck, Vladislav Shkapenyuk: Gigascope: A Stream Database for Network Applications. SIGMOD Conference 2003: 647-651 • Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, David Maier, Anurag Maskey, Esther Ryvkina, Michael Stonebraker, Richard Tibbetts: Linear Road: A Stream Data Management Benchmark. VLDB 2004.

More about Systems (cont.) • J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In Proc. of the 2000 ACMSIGMOD Intl. Conf. on Management of Data, pages 379-390. • Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. Sailesh Krishnamurthy et al.: TelegraphCQ: An Architectural Status Report. IEEE Data Engineering Bulletin, Vol 26(1), March 2003. • Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, Vijayshankar Raman: Continuously Adaptive Continuous Queries over Streams. SIGMOD 2002, 49-61. • S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002. • J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, pages 379-390, May 2000. • M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996.

Data Stream Management Systems--Supporting Stream Mining Applications

Data Stream Management Systems--Supporting Stream Mining Applications

Presentation Transcript

Stream-based Data Management

More Stream-Mining

Data Stream Mining

More Stream-Mining

Stream Data

Stream Hierarchy Data Mining for Sensor Data

Data Stream Mining and Incremental Discretization

Data Stream Management

Data Stream Management Systems Checkpoint

A survey on stream data mining

MMDSS 2007 Data stream management and mining

Data Stream Mining

Extending DSMS for Data Stream Mining

STREAM: The Stanford Data Stream Management System

Data Stream Management Techniques

Stream Data Management System Prototypes

STREAM: The Stanford Data Stream Management System

Extending DSMS for Data Stream Mining

Data Stream Management Systems

Secure Dependable Stream Data Management

Data Stream Mining

Data Stream Management Systems Checkpoint