1 / 59

System.S. Stream Processing Platform Overview

Explore System.S. for energy trading, astronomy, and more. Harness stream processing benefits like real-time data analysis, information extraction, and intelligence organization. Learn about the components and promises of SPADE language for high-performance stream computing applications.

misu
Download Presentation

System.S. Stream Processing Platform Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System S –High-Performance Stream Computing Platform Olivier Verscheure IBM T.J. Watson Research Center

  2. Outline • System S Overview • System S for Energy Trading • System S for Astronomy

  3. Stream processing will be everywhere

  4. Minimizing time to react Process data as it is continuously generated data Extracting and organizing information and intelligence Data Sources Stream Processing System What is Stream Processing? Database/data warehouse

  5. Transaction processing Batch processing Without Stream Processing? Minimizing time to react Process data as it is continuously generated Database/data warehouse data Extracting and organizing information and intelligence Data Sources

  6. Minimizing time to react Process data as it is continuously generated stream data packet Database/data warehouse data operator Extracting and organizing information and intelligence Sensor Network Stream Processing System What Makes a Stream Processing System?

  7. Tooling Developer UI Composition UI Analyst UI Application Interconnection of operators Runtime Environment Job management, resource management, content routing, programming model, object store Hardware Platform Servers, networks, storage, operating system, file system What Makes a Stream Processing System? Stream Processing System

  8. Event/Data Volume & Diversity High Volume Complex Analysis Time Sensitive Time Sensitivity Analysis Complexity Continuous Event and Stream Processing • Stream Processing enables… • high message/data rates, • low (msec-secs) latency, • advanced analysis • Today’s Complex Event Processing (CEP) solutions target… • 10K messages/sec, • secs-minutes latency, • rules-based analysis

  9. System S Stream Processing • New stream computing paradigm • Pull information from anywhere in real time • Ultra-low latency, ultra-high throughput • Scalable • Financial market uses: • Market data feed processing • Algorithmic / automated trading • Risk management • Compliance management and market surveillance

  10. System S: A Closer Look Analytics may be a combination of provided and user-developed/legacy operators System S continually adapts to new inputs, new modalities • This notional System S application… • Calculates VWAP • Calculates P/E, based earningsfrom Edgar • Refines earnings based on encumbrances identified in newsfeeds System S applications can seamlessly process structured (event) and unstructured data

  11. Correlate Transform Annotator Edge Adapters Segmenter Filter Classifier SPADE Building BlocksClassifiers, Annotators, Correlators, Filters, Aggregators

  12. Application Programming Sink Adapters Operator Repository Source Adapters • Consumable • Reusable set of operators • Connectors to external static or streaming data sources and sinks MARIO: Automated Application Composition SPADE: Stream processing dataflow scripting language Platform Optimized Compilation

  13. SPADE SPADE (Stream Processing Application Declarative Engine) is an intermediate language for streaming applications. Simplifies design of applications used by System S Hides complexities of manipulating data streams (e.g., contains generic language support for data types and building block operations) fanning out applications to distributed heterogeneous nodes transporting data through diverse computer infrastructures (ingesting external data, routing intermediate results, looping in feedback, branching, outputing the results, ...)

  14. Basic Promises of SPADE SPADE is easy to use Programmers provide descriptions of stream-based data processing tasks using SPADE’s intermediate language SPADE’s query engine comes up with an execution plan, builds it, and hands it off to System S runtime for deployment SPADE enables rapid application development Customizable operators – do not require low-level coding Support for user defined operators and legacy code SPADE is high performance Optimized code generation

  15. Functor Aggregate Source Sink A simple example [Application] SourceSink trace [Typedefs] typespace sourcesink typedef id_t Integer typedef timestamp_t Long [Program] // virtual schema declaration vstream Sensor (id : id_t, location : Double, light : Float, temperature : Float, timestamp : timestamp_t) // a source stream is generated by a Source operator – in this case tuples come from an input file stream SenSource( schemaFor(Sensor)) := Source( ) [ “file:///SenSource.dat” ] {} // this intermediate stream is produced by an Aggregate operator, using the SenSource stream as input stream SenAggregator ( schemaFor(Sensor) ) := Aggregate( SenSource<count(100),count(1)> ) [ id . location ] { Any(id), Any(location), Max(light), Min(temperature), Avg(timestamp) } // this intermediate stream is produced by a functor operator stream SenFunctor( id: Integer, location: Double, message: String ) := Functor( SenAggregator) [ log(temperature,2.0)>6.0 ] { id, location, “Node ”+toString(id)+ “ at location ”+toString(location) } // result management is done by a sink operator – in this case produced tuples are sent to a socket Null := Sink( SenFunctor) [ “cudp://192.168.0.144:5500/” ] {}

  16. Performance optimization and scalability • Split/Aggregate/Join architectural pattern • High-ingest rate input stream must be split • Aggregate: model creation • Join: correlation • Operator Fusion • Fine-granularity operators • From small parts, make a big one that fits • Code generation • Actual code must match the underlying runtime environment • Number of cores • Interconnect characteristics • Architecture-specific instructions • Compiler-based optimization • Driven by automatic profiling • Driven by incremental learning of application characteristics Logical app view Physical app view

  17. Operator Fusion - Illustration One PE per Operator Spade compiler can generate optimized operator grouping schemes Fuse all except sources and sinks A truly random partitioning

  18. Operating System Transport System S Data Fabric X86 Box X86 Blade FPGA Blade X86 Blade Cell Blade X86 Blade X86 Blade X86 Blade X86 Blade X86 Blade System S Runtime Services Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation Runs on commodity hardware – from single node to blade centers to high performance multi-rack clusters Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container

  19. Operating System Transport System S Data Fabric BG Node BG node BG node BG node BG node X86 Blade X86 Blade FPGA Blade X86 Blade X86 Blade Cell Blade X86 Blade X86 Blade X86 Blade X86 Blade System S Runtime Services Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation Adapts to changes in resources, workload, data rates Runs on commodity hardware – from single node to blade centers to high performance multi-rack clusters Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container Capable of exploiting specialized hardware

  20. Site B Site B Site C Site C Site A Site A Distributed operation

  21. Source Adapters Sink Adapters Operator Repository Automated, Optimized Composition (SPADE, MARIO) Automated, Optimized Deploy and Management (Scheduler) Advantages of Stream Processing as Parallelization Model • Streams as first class entity • Explicit task and data parallelism • More intuitive approach to multi-core exploitation • Automated composition • Query optimization over well-known operators • Inquiry optimization using semantic tagging of operators and data sources • Operator and data source profiling for better resource management • Reuse of operators across stored and live data • MapReduce is similar programming model with storage as transport

  22. System S Pilots Trading Advantage Manufacturing Improving the quality semi-conductor wafers with dynamic manufacturing tools tuning Identification of and response to opportunities in real-time market data 2.5K events / sec 10 msec latency 5 Million events / sec Millisecond latency Semiconductor Solutions Astrophysics Government World’s largest and first fully digital radio observatory for astrophysics, space and earth sciences, and radio research Detect & respond to phenomena based on large volumes of structured and unstructured information 1.5 Million events / sec

  23. Sneak Preview: IBM InfoSphere Streams Applications Data Warehouse Business Process Management Business Intelligence

  24. Status • In development in Research since 2003. First prototype release for use by USG in May 2005. Release 3.2 slated for Nov. 2008. • IBM will create a product called InfoSphere Streams, based on System S, that will be generally available in 1H 2010. • IBM will have an Early Adopter Program for a limited number of customers starting later this year. We already are engaged with some customers and will be adding more before the March 2009 EAP software release.

  25. Outline • System S Overview • System S for Energy Trading • System S for Astronomy

  26. The Energy Trading Scenario using Stream Computing • Sample application showing power of Stream Computing • Only one of many possible applications/services • Weather conditions and events drive pricing of energy futures • natural events interfering with energy supply • announcements, news stories, … • Energy traders today struggle to integrate info from multiple sources • cannot get it in real time, to inform their trade decisions • they see 8 screens, integrate manually via a spreadsheet • IBM Stream Computing assembles, deploys applications • integrates diverse sources of data • provides timely correlations, analyses

  27. Illustrative Example: Fear and Opportunity in the Gulf • News Flash: • Hurricane Dean Upgraded to Category 5 • Path Projected through Gulf • Oil Stocks Uniformly Down • If you saw it coming, because you watched for more primal data… • Like hurricane path predictions from NOAA • Or even weather satellite and/or sensor data • Real-time equities trade data • If you’ve been accumulating intelligence on the location (and value) of company assets that could be in the path… • If you could apply such analysis before the news cycle… • You could take advantage – in both directions… Same story, viewed 2 hours later. The story’s the same, but… …the live tickers tell a different story Oil company stocks down, based on early fears. Properties with significant assets in path are still down More affected companies have been identified(a bit late, no?) Others are up, showing recovery even beforethe storm hits

  28. Recommendations Based on Hurricane Forecast Capture market data (high volume) Compute portfolio market indicators (low latency) Correlate combined risk and trade VWAP to determine buy/sell recommendations Make recommendations and notify DHTML Result rendering Capture weather sensor data, analyses hurricane predicted path Dynamically updated risk assessment for assets in projected path Estimate impact on portfolios Real-time projections of hurricane path Web Zero platform System S platform

  29. Outline • System S Overview • System S for Energy Trading • System S for Astronomy • Past & current projects

  30. Past Projects • Outlier detection from single tripole • Decomposing combined DOA’s from single tripole • SPADE UDOP’s • Linking against Lapack and Blas libraries • About 50 non-trivial processing elements • Being optimized by SPADE team now • Convolutional resampling (tConvolve) on System S • Mostly built-in operators (soon built-in operators only) • Fully parametrizable using Perl; e.g., # of w planes • Does scale very well!

  31. Outlier detection from single tripole • Receive 3D electric field • Demultiplex 3D electric field • Each UDP packet contains multiplexed electric fields • Compute intensity • I(t)=|Ex(t)|2+|Ey(t)|2+|Ez(t)|2 • Detect outliers • Outlier detected if: mI(t-N:t-1) + T.I(t-N:t-1)  I(t)  mI(t-N:t-1) - T.I(t-N:t-1) • Visualize detected outliers in Matlab in real-time

  32. Field intensity Windowed statistics |Ex|2 ^2 Aggregate Avg 10c,1c Avg mI2 Data Demux |Ey|2 Barrier Seq,  Barrier mI Sqrt(mI2-mI2) UDP Source I mI Aggregate Avg 10c,1c Avg mI, I |Ez|2 mI-T.I mI+T.I U, L Filter out Seq<10 UDP sink Filter out empty lists UIL? Barrier I,{U,L} File sink Outlier detection SPADE Flow of Operators

  33. Field intensity |Ex|2 Data Demux |Ey|2 Barrier Seq,  Source I |Ez|2 Field Intensity

  34. Source Data Demux |Ez|2 |Ey|2 |Ex|2 Barrier Seq,  Field intensity I

  35. Source Data Demux |Ez|2 |Ey|2 |Ex|2 Barrier Seq,  Field intensity I

  36. Source Data Demux |Ez|2 |Ey|2 |Ex|2 Barrier Seq,  Field intensity I

  37. Source Data Demux |Ez|2 |Ey|2 |Ex|2 Barrier Seq,  Field intensity I

  38. Source Data Demux |Ez|2 |Ey|2 |Ex|2 Barrier Seq,  Field intensity I

  39. Field intensity Windowed statistics |Ex|2 ^2 Aggregate Avg 10c,1c Avg mI2 Data Demux |Ey|2 Barrier Seq,  Barrier mI Sqrt(mI2-mI2) UDP Source I mI Aggregate Avg 10c,1c Avg mI, I |Ez|2 mI-T.I mI+T.I U, L Filter out Seq<10 UDP sink Filter out empty lists UIL? Barrier I,{U,L} File sink Outlier detection SPADE Flow of Operators

  40. Windowed statistics ^2 Aggregate Avg 10c,1c Avg mI2 Barrier mI Sqrt(mI2-mI2) mI Aggregate Avg 10c,1c Avg mI, I mI-T.I mI+T.I U, L Windowed Statistics

  41. Aggregate Avg 10c,1c ^2 Avg Aggregate Avg 10c,1c Avg mI2 mI Barrier mI-T.I mI+T.I mI Sqrt(mI2-mI2) Windowed statistics U, L mI, I

  42. Aggregate Avg 10c,1c ^2 Avg Aggregate Avg 10c,1c Avg mI2 mI Barrier mI-T.I mI+T.I mI Sqrt(mI2-mI2) Windowed statistics U, L mI, I

  43. Aggregate Avg 10c,1c ^2 Avg Aggregate Avg 10c,1c Avg mI2 mI Barrier mI-T.I mI+T.I mI Sqrt(mI2-mI2) Windowed statistics U, L mI, I

  44. Field intensity Windowed statistics |Ex|2 ^2 Aggregate Avg 10c,1c Avg mI2 Data Demux |Ey|2 Barrier Seq,  Barrier mI Sqrt(mI2-mI2) UDP Source I mI Aggregate Avg 10c,1c Avg mI, I |Ez|2 mI-T.I mI+T.I U, L Filter out Seq<10 UDP sink Filter out empty lists UIL? Barrier I,{U,L} File sink Outlier detection SPADE Flow of Operators

  45. Filter out Seq<10 UDP sink Filter out empty lists UIL? Barrier I,{U,L} File sink Outlier detection Outlier Detection Intensity U, L

  46. U, L Outlier detection Barrier I,{U,L} UIL? Filter out empty lists Filter out Seq<10 Intensity UDP sink File sink

  47. U, L Outlier detection Barrier I,{U,L} UIL? Filter out empty lists Filter out Seq<10 Intensity UDP sink File sink

  48. Decomposing combined DOA’s from single tripole

  49. Direction of Arrival (DOA) • Simple case • Two orthogonal waves

  50. Pseudo-Orbital MomentumTwo-wave case, same frequencies

More Related