210 likes | 344 Views
StreamX10: A Stream Programming Framework on X10. Haitao Wei 2012-06- 1 4. School of Computer Science at Huazhong University of Sci&Tech. Outline. 1. Introduction and Background. 2. COStream Programming Language. 3. Stream Compilation on X10. 4. Experiments. 5.
E N D
StreamX10: A Stream Programming Framework on X10 Haitao Wei 2012-06-14 School of Computer Science at Huazhong University of Sci&Tech
Outline 1 Introduction and Background 2 COStream Programming Language 3 Stream Compilation on X10 4 Experiments 5 Conclusion and Future Work 2
Background and motivition • Stream Programming • A high level programming model that has been productively applied • Usually, depends on the specific architectures which makes it difficult to port between different platforms • X10 • a productive parallel programming environment • isolates the different architecture details • provides a flexible parallel programming abstract layer for stream programming • StreamX10:trytomake the stream program portable based on X10
Outline 1 Introduction and Background 2 COStream Programming Language 3 Stream Compilation on X10 4 Experiments 5 Conclusion and Future Work 4
COStreamLanguage • stream • FIFO queue connecting operators • operator • Basic func unit—actor node in stream graph • Multiple inputs and multiple outputs • Window • like pop,peek,push operations • Init and work function • composite • Connected operators—subgraph of actors • A stream program is composed of composites
COStreamand Stream Graph stream operator composite peek=10 pop=1 pop=1 S P Source Averager Sink push=1 push=1
Outline 1 Introduction and Background 2 COStream Programming Language 3 Stream Compilation on X10 4 Experiments 5 Conclusion and Future Work 7
The Execution Framework • The node is partitioned between the places • Each node is mapped to an activity • The nodes use the pipeline fashion to exploit the parallelisms • The local and Global FIFO buffer are used
Work Partition Inter-place 10 Comp. work=10 1 2 Comp. work=10 5 5 5 2 2 2 5 5 5 Comp. work=10 2 1 Speedup:30/10 =3 Communication:2 10 Objective:Minimized Communication and Load Balance (Using Metis)
Global FIFO implementation • Each Producer/Consumer has its own local buffer • the producer uses push operation to store the data to the local buffer • The consumer uses peek/pop operation to fetch data from the local buffer • When the local buffer is full/empty is data will be copied automatically
X10 code in the Back-end Define the work function Call the work function in initial and steady schedule Spawn activities for each node at place according to the partition
Outline 1 Introduction and Background 2 COStream Programming Language 3 Stream Compilation on X10 4 Experiments 5 Conclusion and Future Work 13
Experimental Platform and Benchmarks • Platform • Intel Xeon processor (8 cores ) 2.4 GHZ with 4GB memory • Radhat EL5 with Linux 2.6.18 • X10 compiler and runtime used are 2.2.0 • Benchmarks • Rewrite 11 benchmarks from StreamIt
The throughputs comparison • Throughputs of 4 different configurations (NPLACE*NTHREAD=8) • Normalized to 1 place with 8 threads • for most benchmarks, CPU utilization increases from 24% to 89% ,when places varies from 1 to 4, except for the benchmark with low computation/communication ratio • benefits are little or worse when the number of places increases from 4 to 8
Observation and Analysis • The throughput goes up when the number of places increases. This is because that multiple places increase the CPU utilization • Multiple places show parallelism but also bring more communication overhead • Benchmarks with more computation workload like DES and Serpent_fullcan still benefit form the number of places increasing
Outline 1 Introduction and Background 2 COStream Programming Language 3 Stream Compilation on X10 4 Experiments 5 Conclusion and Future Work 17
Conclusion • We proposed and implemented StreamX10, a stream programming language and compilation system on X10 • A raw partitioning optimization is proposed to exploit the parallelisms based on X10 execution model • Preliminary experiment is conducted to study the performance
Future Work • How to choose the best configuration (# of places and # of threads) automatically for each benchmark • How to decrease the thread switching overhead by mapping multiple nodes to the single activity
Acknowledgment • X10 Innovation Award founding support • QiMingTeng, Haibo Lin and David P. Grove at IBM for their help on this research