230 likes | 332 Views
Customizable Parallel Execution of Scientific Stream Queries. Milena Ivanova and Tore Risch Uppsala Database Laboratory http://user.it.uu.se/~udbl/. Motivation. Scientific instruments and simulators generate high-volume streams Information exploration by wide range analyses
E N D
Customizable Parallel Execution of Scientific Stream Queries Milena Ivanova and Tore Risch Uppsala Database Laboratory http://user.it.uu.se/~udbl/
Motivation • Scientific instruments and simulators generate high-volume streams • Information exploration by wide range analyses • Off-line hard-coded analyses on data stored on disks • Need for on-line, flexible, and scalable stream processing Milena Ivanova and Tore Risch, UDBL
http://www.lois-space.net/ • Very high data volume and rate • Complex numerical data • Relatively expensive user-defined computations Milena Ivanova and Tore Risch, UDBL
Grid Stream Database Manager (GSDM) • Main-memory object-relational DSMS • Distributed and parallel architecture provides scalability for data and computations • Customizable distributed execution strategies for CQs by high-level Data Flow Distribution Templates • PCC (Partition-Compute-Combine) generic template for lattice-shaped partitioned parallelism • Two customizable stream partitioning strategies: Window Split and Window Distribute Milena Ivanova and Tore Risch, UDBL
Related Work • Parallel stream processing Flux (Berkeley), Borealis (MIT, Brown, Brandeis): load balancing and fault tolerance • GSDM: Customizable templates and window split stream partitioning • DSMSs: Aurora (MIT, Brown, Brandeis), Gigascope (AT&T), STREAM (Stanford), TelegraphCQ (Berkeley): centralized architecture and cheap operators • GSDM: distributed and user-defined numerical functions on large stream data items • Conventional data partitioning: Round Robin, hash, range. • GSDM: query based customizable stream partitioning Milena Ivanova and Tore Risch, UDBL
Outline • Motivation • Related Work • GSDM Usage Scenario • Continuous Query Specification: • Stream Query Functions • Data Flow Distribution Templates • Customizable Parallel Strategies: • Window Split • Window Distribute • Experimental Results • Conclusions and Future Work Milena Ivanova and Tore Risch, UDBL
GSDM Scenario GSDM Scenario Milena Ivanova and Tore Risch, UDBL
Stream Query Functions (SQFs) • Declarative query computing a logical window in a result stream given one or several input streams • create function fft3(Radiosignal s) -> • RadioWindow • as select radioWindow(ts(v),fft(x(v)), • fft(y(v)),fft(z(v))) • from RadioWindow v • where v = currentWindow(s); Milena Ivanova and Tore Risch, UDBL
Data Flow Distribution Templates • Define customizable distibuted strategies • CQs as distributed compositions of SQFs • Logical site assignment for each SQF • Template constructor creates data flow graph • Generic template PCC (Partition-Compute-Combine) for customizable partitioned parallelism. Milena Ivanova and Tore Risch, UDBL
CQ Specification and Execution • Distributed execution strategy through template: • set wd= PCC(4,"S-Distribute","RRpart", • "fft3","S-Merge",0.1); • Input and output streams: • set s1 = register_input_stream( • "Radiosignal","1.2.3.4","UDP"); • set s2 = register_result_stream( • "1.2.3.5","Visualize"); • Compilation: • compile(wd,{s1},{s2},"hagrid.it.uu.se"); • Execution: • run(wd); Milena Ivanova and Tore Risch, UDBL
Partitioned Parallelism for Streams • Requirements for stream partitioning strategies: • SQF-semantics preserving • Order-preserving • Good load balancing • Examples: • Window Distribute and Window Split Milena Ivanova and Tore Risch, UDBL
Window Distribute • Distributes several logical windows among different partitions • SQF-independent • Parameterized on partitioning function • set wd= PCC(2,"S-Distribute","RRpart", • "fft3","S-Merge",0.1); Milena Ivanova and Tore Risch, UDBL
Window Distribute (cont.) • Round Robin partitioning: • create function RRpart(Stream s, • Integer ptot, Integer pno) -> Window • as select w[pno] • from Vector of Window w • where w = slidingWindow(s,ptot,ptot); Milena Ivanova and Tore Risch, UDBL
Window Split • Splits a single logical window into sub-windows • Needs knowledge about SQF semantics • SQF-dependent parameters for window splitting and combining • set ws= PCC(2,"OS-Split","fft3part", • "fft3","OS-Join","fft3combine"); Milena Ivanova and Tore Risch, UDBL
Window Split (cont.) • FFT-dependent window split based on FFT- Radix K algorithm: • create function fft3part(RadioWindoww, • Integer ptot,Integer pno)->RadioWindow • as selectradioWindow(ts(w), • fftpart(x(w),ptot,pno), • fftpart(y(w),ptot,pno), • fftpart(z(w),ptot,pno)); Milena Ivanova and Tore Risch, UDBL
Experimental Results • Strategies: Window Split (WS), Window Distribute(WD), and Central – all expressed as templates • Scalability metrics: • Total maximum throughput • Scaled logical window sizes: 256-16384 samples • Impact of computation and communication by two experimental sets (slow and fast implementation of FFT) • Tree-structured partitioning and combining • set wd-tree = PCC(2,"S-Distribute","RRpart","PCC", • {2,"S-Distribute","RRpart","fft3", "S-Merge",0.1}, • "S-Merge",0.1); Milena Ivanova and Tore Risch, UDBL
Experimental ResultsSlow Set, Degree 2 • Load of FFT computing nodes dominates • WS2 more efficient than WD2 Milena Ivanova and Tore Risch, UDBL
Experimental Results – Slow Set, Degree 4 Size 512 WS4 better for window size > = 1024 Milena Ivanova and Tore Risch, UDBL
Experimental Results – Fast Set, Degree 4 Size 2048 Size 8192 Milena Ivanova and Tore Risch, UDBL
Communication Costs Elapsed processing and communication time in partition, compute, and combine phases for WS2 and WD2, fast set, size 8192 Milena Ivanova and Tore Risch, UDBL
Speed-up – slow set Milena Ivanova and Tore Risch, UDBL
Conclusions • Parallel execution of CQs with expensive user-defined computations • High-level Data Flow Distribution Templatesspecify distributed execution strategies for CQs • PCC (Partition-Compute-Combine) template for lattice-shaped partitioned parallelism • Two customizable stream partitioning strategies by templates: Window Split • knowledge about application semantics for more efficient computing • better when expensive CQs execute on limited resources Window Distribute • application independent customizable partitioning • better for non-expensive operations, small window sizes, or more resources Milena Ivanova and Tore Risch, UDBL
Ongoing and Future Work • Optimized data flow graphs using profiled runs in training mode • More complex CQs and other distribution strategies • Adaptation • Resource allocation through Grid infrastructure Milena Ivanova and Tore Risch, UDBL