StreamX10: A Stream Programming Framework on X10

StreamX10: A Stream Programming Framework on X10 Haitao Wei 2012-06-14 School of Computer Science at Huazhong University of Sci&Tech

Outline 1 Introduction and Background 2 COStream Programming Language 3 Stream Compilation on X10 4 Experiments 5 Conclusion and Future Work 2

Background and motivition • Stream Programming • A high level programming model that has been productively applied • Usually, depends on the specific architectures which makes it difficult to port between different platforms • X10 • a productive parallel programming environment • isolates the different architecture details • provides a flexible parallel programming abstract layer for stream programming • StreamX10：trytomake the stream program portable based on X10

COStreamLanguage • stream • FIFO queue connecting operators • operator • Basic func unit—actor node in stream graph • Multiple inputs and multiple outputs • Window • like pop，peek，push operations • Init and work function • composite • Connected operators—subgraph of actors • A stream program is composed of composites

COStreamand Stream Graph stream operator composite peek=10 pop=1 pop=1 S P Source Averager Sink push=1 push=1

Compilation flow of StreamX10

The Execution Framework • The node is partitioned between the places • Each node is mapped to an activity • The nodes use the pipeline fashion to exploit the parallelisms • The local and Global FIFO buffer are used

Work Partition Inter-place 10 Comp. work=10 1 2 Comp. work=10 5 5 5 2 2 2 5 5 5 Comp. work=10 2 1 Speedup：30/10 =3 Communication：2 10 Objective:Minimized Communication and Load Balance (Using Metis)

Global FIFO implementation • Each Producer/Consumer has its own local buffer • the producer uses push operation to store the data to the local buffer • The consumer uses peek/pop operation to fetch data from the local buffer • When the local buffer is full/empty is data will be copied automatically

X10 code in the Back-end Define the work function Call the work function in initial and steady schedule Spawn activities for each node at place according to the partition

Experimental Platform and Benchmarks • Platform • Intel Xeon processor (8 cores ) 2.4 GHZ with 4GB memory • Radhat EL5 with Linux 2.6.18 • X10 compiler and runtime used are 2.2.0 • Benchmarks • Rewrite 11 benchmarks from StreamIt

The throughputs comparison • Throughputs of 4 different configurations (NPLACE*NTHREAD=8) • Normalized to 1 place with 8 threads • for most benchmarks, CPU utilization increases from 24% to 89% ,when places varies from 1 to 4, except for the benchmark with low computation/communication ratio • benefits are little or worse when the number of places increases from 4 to 8

Observation and Analysis • The throughput goes up when the number of places increases. This is because that multiple places increase the CPU utilization • Multiple places show parallelism but also bring more communication overhead • Benchmarks with more computation workload like DES and Serpent_fullcan still benefit form the number of places increasing

Conclusion • We proposed and implemented StreamX10, a stream programming language and compilation system on X10 • A raw partitioning optimization is proposed to exploit the parallelisms based on X10 execution model • Preliminary experiment is conducted to study the performance

Future Work • How to choose the best configuration (# of places and # of threads) automatically for each benchmark • How to decrease the thread switching overhead by mapping multiple nodes to the single activity

Acknowledgment • X10 Innovation Award founding support • QiMingTeng, Haibo Lin and David P. Grove at IBM for their help on this research

Thank you!

StreamX10: A Stream Programming Framework on X10