340 likes | 349 Views
Scalable Realtime Analytics with declarative, SQL like, Complex Event Processing Scripts. Srinath Perera Director, Research WSO2 Apache Member (@ srinath_perera ) srinath@wso2.com. (Batch) Analytics. Scientists are doing this for 25 year with MPI (1991) on special Hardware
E N D
Scalable Realtime Analytics with declarative, SQL like, Complex Event Processing Scripts • Srinath Perera • Director, Research WSO2 • Apache Member • (@srinath_perera) • srinath@wso2.com
(Batch) Analytics • Scientists are doing this for 25 year with MPI (1991) on special Hardware • Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole eco system created. • It was successful, So we are here!! • But, processing takes time.
Value of Some Insights degrade Fast! • For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrade very quickly with time. • E.g. stock markets and speed of light • We need technology that can produce outputs fast • Static Queries, but need very fast output (Alerts, Realtime control) • Dynamic and Interactive Queries ( Data exploration)
History • Realtime Analytics are not new either!! • Active Databases (2000+) • Stream processing (Aurora, Borealis (2005+) and later Storm) • Distributed Streaming Operators (e.g. Database research topic around 2005) • CEP vendor roadmap ( from http://www.complexevents.com/2014/12/03/cep-tooling-market-survey-2014/)
I. Stream Processing • Program a set of processors and wire them up, data flows though the graph. • A middleware framework handles data flow, distribution, and fault tolerance (e.g. Apache Storm, Samza) • Processors may be in the same machine or multiple machines
III. Micro Batch • Process data in small batches, and then combine results for final results (e.g. Spark) • Works for simple aggregates, but tricky to do this for complex operations (e.g. Event Sequences) • Can do it with MapReduce as well if the deadlines are not too tight.
IV. OLAP Style In Memory Computing • Usually done to support interactive queries • Index data to make them them readily accessible so you can respond to queries fast. (e.g. Apache Drill) • Tools like Druid, VoltDB and SAP Hana can do this with all data in memory to make things really fast.
RealtimeAnalytics Patterns • Simple counting (e.g. failure count) • Counting with Windows ( e.g. failure count every hour) • Preprocessing: filtering, transformations (e.g. data cleanup) • Alerts , thresholds(e.g. Alarm on high temperature) • Data Correlation, Detect missing events, detecting erroneous data (e.g. detecting failed sensors) • Joining event streams (e.g. detect a hit on soccer ball) • Merge with data in a database, collect, update data conditionally
Realtime Analytics Patterns (contd.) • Detecting Event Sequence Patterns (e.g. small transaction followed by large transaction) • Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle, tracking wild life) • Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing) • Learning a Model (e.g. Predictive maintenance) • Predicting next value and corrective actions (e.g. automated car)
Apache Hive • A SQL like data processing language • Since many understand SQL, Hive made large scale data processing Big Data accessible to many • Expressive, short, and sweet. • Define core operations that covers 90% of problems • Lets experts dig in when they like!
CEP = SQL for Realtime Analytics • Easy to follow from SQL • Expressive, short, and sweet. • Define core operations that covers 90% of problems • Lets experts dig in when they like! Lets look at the core operations.
Operators: Filters • Assume a temperature stream • Here weather:convertFtoC()is a user defined function. They are used to extend the language. define stream TempStream (ts long, temp double); from TempratureStream [weather:convertFtoC(temp) > 30.0) and roomNo != 2043] select roomNo, temp insert into HotRoomsStream ; • Usecases: • Alerts , thresholds (e.g. Alarm on high temperature) • Preprocessing: filtering, transformations (e.g. data cleanup)
Operators: Windows and Aggregation • Support many window types • Batch Windows, Sliding windows, Custom windows • Usecases • Simple counting (e.g. failure count) • Counting with Windows ( e.g. failure count every hour) from TempratureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;
Operators: Patterns • Models a followed by relation: e.g. event A followed by event B • Very powerful tool for tracking and detecting patterns from every (a1 = TempratureStream) -> a2 = TempratureStream [temp > a1.temp + 5 ] within 1 day select a2.ts as ts, a2.temp – a1.temp as diff insert into HotDayAlertStream; • Usecases • Detecting Event Sequence Patterns • Tracking • Detect trends
Operators: Joins • Join two data streams based on a condition and windows • Usecases • Data Correlation, Detect missing events, detecting erroneous data • Joining event streams from TempStream[temp > 30.0]#window.time(1 min) as T join RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNo select T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream
Operators: Access Data from the Disk define stream TempStream (ts long, temp double); define table HistTempTable(day long, avgT double); from TempStream #window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT select ts, temp insert into PurchaseUserStream ; • Event tables allow users to map a database to a window and join a data stream with the window • Usecases • Merge with data in a database, collect, update data conditionally
Revisit Patterns • Simple counting • Counting with Windows • Preprocessing: filtering, transformations • Alerts , thresholds • Data Correlation, Detect missing events, • Joining event streams • Tracking • Merge with data in a database, collect, update data conditionally • Detecting Event Sequence Patterns • Detect trends • Learning a Model • Predicting next value and corrective actions
Predictive Analytics • Build models and use them with WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2) • Build model using R, export them as PMML, and use within WSO2 CEP • Call R Scripts from CEP queries • Regression and Anomaly Detection Operators in CEP
Case Study: Realtime Soccer Analysis Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
TFL Traffic Analysis Built using TFL ( Transport for London) open data feeds. http://goo.gl/04tX6k http://goo.gl/9xNiCm
Idea 1: Network of CEP Nodes • For scaling, we arrange CEP processing nodes in a graph like with stream processing. • The Graph can be implemented using an stream processing engine like Apache Storm
Idea II: Compile SQL like Queries to a Network of CEP Nodes from TempStream[temp > 33] insert into HighTempStream; from HighTempStream#window(1h) select max(temp)as max insert into HourlyMaxTempStream;
How do We partition the Data to scale up the Analysis? • Lets follow MapReduce • Map Reduce does not scale itself, it asks users to break the problem to many small independent problems.
Idea III: Let the Users specify Parallelism • Language include parallel constructs: partitions, pipelines, distributed operators • Assign each partition to a different node, and partition the data accordingly define partition on TempStream.region { from TempStream[temp > 33] insert into HighTempStream; } from HighTempStream#window(1h) select max(temp)as max insert into HourlyMaxTempStream;
Handling Ordering • When the data processed in parallel, output might be generated out of order. • Due to lack of a global time, we cannot trigger windows and other time sensitive constructs • Solution: the current time needs to be propagated though the graph
CEP = SQL for Realtime Analytics • Easy to follow from SQL • Expressive, short, sweet and fast!! • Define core operations that covers 90% of problems • Lets experts dig in when they like! And it Scales!!
Questions?Visit us at Booth 1025http://wso2.com/landing/strata-hadoop-world-ca-2015/