370 likes | 522 Views
Extending DSMS for Data Stream Mining. CS240B Notes by Carlo Zaniolo UCLA CSD. Data Streams. Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications Network monitoring and traffic engineering Sensor networks, RFID tags
E N D
Extending DSMS for Data Stream Mining CS240B Notes by Carlo Zaniolo UCLA CSD
Data Streams • Continuous, unbounded, rapid, time-varying streams of data elements • Occur in a variety of modern applications • Network monitoring and traffic engineering • Sensor networks, RFID tags • Telecom call records • Financial applications • Web logs and click-streams • Manufacturing processes • DSMS = Data Stream Management System
Many Research Projects … • Amazon/Cougar (Cornell) – sensors • Aurora(Brown/MIT) – sensor monitoring, dataflow • Hancock (AT&T) – Telecom streams • Niagara (OGI/Wisconsin) – Internet DBs & XML • OpenCQ (Georgia) – triggers, view maintenance • Stream(Stanford) – general-purpose DSMS • Tapestry (Xerox) – pubish/subscribe filtering • Telegraph (Berkeley) – adaptive engine for sensors • Gigascope: AT&T Labs – Network Monitoring • Stream Mill (UCLA) - power & extensibility
Technology Challenges • Data Models • Relational Streams--but XML streams important too • Tuple Time-Stamping • Order is important • Windows • Query Languages: Extensions of SQL or XQUERY • To support continuous (i.e., persistent) queries on transient data—reversal of roles. • Blocking operators excluded • Query Plans: • New execution models (main memory oriented) • Optimized scheduling for response time or memory • Quality of Services (QoS) & Approximation • Synopses • Sampling • Load shedding.
Commercial Developments • Several Startups • Streambase, • Coral8, • Apama, and • Truviso. • Oracle and DBMS companies • Publish/subscribe • Complex Event Processing (CEP) • Limitations: only simple applications—e.g. continuous queries expressed in SQL • No Support for Data Stream Mining queries.
Data Stream Mining • Many applications: click stream analysis, intrusion detection,... • Many fast & light algorithms developed for stream mining. • Ensembles, Moment, SWIM, etc. • Analyst should be able to focus on high-level mining tasks. • Leaving QoS and lower-level issues to the system. • Integration of mining methods into Data Stream Management Systems (DSMS) is required • Many research challenges. • Stream Mill Miner (SMM) is the first DSMS designed for that.
Data Stream Management Systems (DSMS) • Data stream mining applications so far ignored by DSMS … although A. DSMS technology is required for data stream mining • QoS, query scheduling, synopses, sampling, windows, ... B. But supporting DM applications is difficult since current DSMS only support simple query languages based on SQL. Conclusion: either a shotgun wedding ... or a research breakthrough is needed here!
A Difficult Problem: the Inductive DBMS Experience • Initial attempts to support mining queries in relational DBMS: Unsuccessful • OR-DBMS do not fare much better [Sarawagi’ 98]. • In 1996 the ‘high-road’ approach by Imielinski & Mannila who called for a quantum leap in functionality: • High-level declarative languages for DM . • Extensions for query processing and optimization. • The research area of Inductive DBMS was thus born • Inspired DMQL, Mine Rule, MSQL, etc. • Suffer from limited generality and performance issues.
DBMS Vendors • Vendors have taken a `low-road’ approach. • A library of mining functions using a cache-mining approach • IBM DB2 Intelligent Miner • Oracle Data Miner • MS OLE DB for DM: mining models • Closed systems, • Lacking in coverage and user-extensibility. • Not as popular as dedicated, stand-alone mining systems, such as Weka
Weka • A comprehensive set of mining algorithms, and tools. • Generic algorithms over arbitrary data sets. • Independent on the number of columns in tables. • Open and extensible system based on Java. These are the features that we want in our SMM—starting from SQL rather than Java! Not an easy task ...why?
SMM Contributions • Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology. • Language and System Extensions: • Genericity, • Extensibility, and • Performance • A suite of stream mining algorithms. • Existing ones and • Newly developed in this project—e.g., SWIM. • High levelmining model for better • Usability • Control of mining process.
From SQL to Online Mining in SMM:step by step • Naïve Bayesian Classifier (NBC). • Important and frequently used. • Schema-specific NBC. Simple to express in SQL— by count, sum aggregates. But a generci NBC is still preferable. • Genericity: one function independent of number columns involved. • Schema independence in SQL?
Genericity • Weka • Arrays of type real. • SMM • Verticalization. • Similar arrays, but in tables. • Built-in table function to reduce any table to this form. • Thus, generic UDAs work with this schema. • And further improvements are also supported in SMM
Extensibility? • Most mining tasks cannot be implemented in SQL. • Solution: Define complex functions by User Defined Aggregates (UDAs) • Complex mining tasks can be viewed as aggregates • UDAs Natively defined in SQL make the language computationally complete [Wang’ 04] • Turing-complete over static data • Non-blocking complete over data streams • Natural extensions to support windows and delta computations for data streams [Bai’ 06] • UDAs can be defined in a PL, for better performance
Windowed UDA Example – Continuous Count For efficient differential computation WINDOW AGGREGATE sum(val REAL):REAL { TABLE state (tot real); INITIALIZE: { INSERT INTO state VALUES(val); } ITERATE: { UPDATE state SET tot = tot + val; } EXPIRE: { UPDATE state SET tot = tot – oldest().val; } /* No TERMINATE state */ }
Online Mining in SMM • UDAs Invoked with standard SQL:2003 syntax of OLAP functions. SELECT learn(ts.Column, ts.Value, t.dec) OVER (ROWS 1000 PRECEDING) FROM trainingstream AS t, TABLE (verticalize(Outlook, Temp, Humidity, Wind)) AS ts • Powerful framework: • Concept drifts-shifts • Association rule mining
The Slide Construct • A window can be divided into panes (called a slide) • Tumbling windows when the size of the slide is equal or larger than that of the window • The slide/window combination is great for data stream mining. • Simple construct added to support slides in UDAs • Allowed us to build a flexible and efficient library of data stream mining UDAs
SMM Contributions • Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology. • Language and System Extensions: • Genericity, • Extensibility, and • Performance • A suite of stream mining algorithms. • Existing ones and • Newly developed in this project—e.g., SWIM. • High level mining model for better • Usability • Control of mining process.
Association Rule Mining • SWIM [Mozafari’ 08] – Maintaining frequent patterns over large windows with slides. • Differentially computes frequent patterns as slides enter (expire out of) the window. • Uses efficient ‘Verifiers’ based on conditional counting. • Trade-off between Delay and Performance • Performance gain over existing algorithms.
SWIM (Sliding Window Incremental Miner) Count/Update frequencies Count/Update frequencies Add F7 to PT • If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all slides (PT) Expired New … ………. S4 S5 S6 S7 W4 W5 Mine Mining Alg. PT Prune PT PT = F5 U F6 U F7 PT = F4 U F5 U F6
Concept Drifts/Shifts—Complex Processes • Ensemble based methods. • Weighted bagging [Wang’ 03], adaptive boosting [Chu’ 04], inductive transfer [Forman’ 06]. • Generic support, e.g. adaptive boosting (below).
Built-in Online Mining Algorithms In SMM Online classifiers Naïve Bayesian Decision Tree K-nearest Neighbor Online clustering DBScan [Ester’ 96] IncDBScan Windowed K-means* DenStream* [Cao’ 06] CluStream Association rule mining Approximate frequent items SWIM [Mozafari’ 08] Moment [Chi’ 04] AFPIM Time series/sequence queries SQL-TS [Sadri’ 01] Many more … • Already supported • To be supported
SMM Contributions • Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology. • Language and System Extensions: • Genericity, • Extensibility, and • Performance • A suite of stream mining algorithms. • Existing ones and • Newly developed in this project—e.g., SWIM. • High level mining model for better • Usability • Control of mining process.
Usability? • Complex SQL queries to invoke built-in and user-defined mining algorithms. • An open and extensible system • Most analysts would prefer using high-level mining language that • supports uniform invocation of built-in and user-defined mining algorithms (no SQL required) • describes the workflow of the mining process • Is also open and extensible to incorporate newly defined mining algorithms.
Example: Defining a Mining Model CREATE MODELTYPE NaiveBayesianClassifier { SHAREDTABLES (DescriptorTbl), Learn (UDA LearnNaiveBayesian, WINDOW TRUE, PARTABLES(), % names of param tables required by the method PARAMETERS() % additional parameters to be specified for input ), Classify (UDA ClassifyNaiveBayesian, WINDOW TRUE, PARTABLES(), PARAMETERS() ) };
Example: Using a Mining Model • Creating an instance: CREATE MODEL INSTANCE NaiveBayesianInstance AS NaiveBayesianClassifier; • Uniform invocation of mining tasks: RUN NaiveBayesianInstance.Learn WITHTrainingSet;
Performance • SMM Vs. Weka • NBC and decision tree classifier • Datasets [UCI] • Iris: 5 attributes • Heart disease: 13 attributes • Overhead of integrating algorithms into SMM • The SWIM algorithm standalone vs. integrated • Dataset [IBM Quest] • Trans len 20, Pattern len 5, Tuples 50K
The Stream Mill System • One server, multiple clients • Server (on Linux): hosts the ESL language and manages storage and continuous queries • Client (Java based GUI): allows the user to specify streams, queries, etc.
Conclusion • SMM integrates new solutions for several difficult problems: • Usability by high-level mining models • Extensibility by user-defined mining models that call on UDAs with windows • Suite of built-in data stream mining UDAs • Generic mining UDAs by Verticalization & other techniques • Performance • SMM is the first of its kind: more and better systems will follow in its footsteps.
Future Work • Faster & lighter mining algorithms • E.g. online algorithms for clustering • Integration of other mining algorithms • Data flow in mining models • Similar solution for databases
References • [Arasu’ 04] Arvind Arasu and Jennifer Widom. Resource sharing in continuous sliding-window aggregates. In VLDB, pages 336–347, 2004. • [Babcock’ 02] B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. • [Bai’ 06] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo. A data stream language and system designed for power and extensibility. In CIKM, pages 337–346, 2006. • [Cao’ 06] F Cao, M Ester, W Qian, and A Zhou, Density-based Clustering over an Evolving Data Stream with Noise, To appear in Proceedings of SIAM 2006. • [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM’04), November 2004. • [Chu’ 04] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data streams. In PAKDD, volume 3056, 2004. • [Ester’ 96] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996. • [Forman’ 06] George Forman. Tackling concept drift by temporal inductive transfer. In SIGIR, pages 252–259, 2006.
References • [Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM, 39(11):58–64, 1996. • [Law’ 04] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Data models and query language for data streams. In VLDB, pages 492–503, 2004. • [Mozafari’ 08] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. Verifying and mining frequent patterns from large windows over data streams. In International Conference on Data Engineering (ICDE), 2008. • [Sadri’ 01] Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001. • [Sarawagi’ 98] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD, 1998. • [UCI-MLR] http://archive.ics.uci.edu/ml/datasets.html • [Wang’ 03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In SIGKDD, 2003.