Orchestrating the Execution of Stream Programs on Multicore Platforms

Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

512 PicoChip AMBRIC 256 CISCO CSR1 128 NVIDIA G80 Larrabee 64 Unicore 32 Homogeneous Multicore RAZA XLR Cavium Heterogeneous Multicore 16 RAW Cell Niagara 8 AMD Fusion Opteron 4P BCM 1480 4 Core2Quad Xeon Xbox 360 Power6 Power4 PA8800 PA8800 4004 8008 8080 4004 8008 8080 2 Opteron CoreDuo Core2Duo 8086 286 386 486 Pentium P2 P3 P4 1 Core Athlon Itanium Itanium2 1975 1980 1985 1990 1995 2000 2005 2010 Cores are the New Gates (Shekhar Borkar, Intel) Stream Programming CUDA Courtesy: Gordon’06 Courtesy: Gordon’06 X10 Peakstream Fortress C/C++/Java # cores/chip Accelerator Ct C T M Rstream Rapidmind

Stream Programming • Programming style • Embedded domain • Audio/video (H.264), wireless (WCDMA) • Mainstream • Continuous query processing (IBM SystemS), Search (Google Sawzall) • Stream • Collection of data records • Kernels/Filters • Functions applied to streams • Input/Output are streams • Coarse grain dataflow • Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03]

Core 1 Core 2 Core 3 Core 4 Mem Mem Mem Mem Compiling Stream Programs Stream Program Multicore System ? • Heavy lifting • Equal work distribution • Communication • Synchronization

SPE0 SPE1 SPE7 SPU SPU SPU 256 KB LS 256 KB LS 256 KB LS MFC(DMA) MFC(DMA) MFC(DMA) EIB PPE (Power PC) DRAM Stream Graph Modulo Scheduling(SGMS) • Coarse grain software pipelining • Equal work distribution • Communication/computation overlap • Target : Cell processor • Cores with disjoint address spaces • Explicit copy to access remote data • DMA engine independent of PEs • Filters = operations, cores = function units

Streamroller Software Overview SPEX (C++ with stream extensions) StreamIt Stylized C Focus of this talk Streamroller Profiling Stream IR Code Generation Analysis Scheduling Multicore (Cell, Core2Quad, Niagara) SODA (Low power multicore processor for SDR) Custom Application Engine

int wgts[N]; wgts = adapt(wgts); Push and pop items from input/output FIFOs Preliminaries • Synchronous Data Flow (SDF) [Lee ’87] • StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } } Stateless Stateful

PE0 T1 ≈ 4 T4 SGMS Overview Prologue PE0 PE1 PE2 PE3 DMA DMA DMA DMA T1 DMA T4 DMA DMA DMA Epilogue

Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases

A A A 5 PE0 PE1 B B1 A B2 B S B C C C 40 10 C B1 B2 D D A C 5 S D J J D D Processor Assignment • Assign filters to processors • Goal : Equal work distribution • Graph partitioning? • Bin packing? Modified stream program Original stream program Speedup = 60/40 = 1.5 Speedup = 60/32 ~ 2

PE0 PE1 PE2 PE3 Filter Fission Choices Speedup ~ 4 ?

Split/Join overhead factored in … Integrated Fission + PE Assign • Exact solution based on Integer Linear Programming (ILP) • Objective function- Maximal load on any PE • Minimize • Result • Number of times to “split” each filter • Filter → processor mapping

PE0 PE1 A B A PE0 PE1 Time B A B C A1 C B1 A2 A1 A1 A→B A→B A2 A2 B1 A→B B1 A3 Forming the Software Pipeline • To achieve speedup • All chunks should execute concurrently • Communication should be overlapped • Processor assignment alone is insufficient information X Overlap Ai+2 with Bi

Stage Assignment • Data flow traversal of the stream graph • Assign stages using above two rules PE 1 PE 1 Si i i Sj≥ Si SDMA > Si DMA j PE 2 Sj = SDMA+1 j Preserve causality (producer-consumer dependence) Communication-computation overlap

A S B1 Stage 0 Stage 1 DMA DMA DMA C B2 Stage 2 J DMA DMA Stage 3 D Stage 4 Stage Assignment Example A S C B1 B2 J D PE 0 PE 1

Code Generation for Cell • Target the Synergistic Processing Elements (SPEs) • PS3 – up to 6 SPEs • QS20 – up to 16 SPEs • One thread / SPE • Challenge • Making a collection of independent threads implement a software pipeline • Adapt kernel-only code schema of a modulo schedule

Complete Example A void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } S B1 Time A S B1 A B1toJ S StoB2 B1 AtoC DMA DMA DMA A B2 B1toJ S StoB2 B1 C B2 AtoC J C J A JtoD B2 B1toJ S CtoD StoB2 DMA DMA B1 AtoC J D C A JtoD B2 B1toJ S CtoD StoB2 B1 AtoC J C D SPE1 DMA1 SPE2 DMA2

Experiments • StreamIt benchmarks • Signal processing – DCT, FFT, FMRadio, radar • MPEG2 decoder subset • Encryption – DES • Parallel sort – Bitonic • Range of 30 to 90 filters/benchmark • Platform • QS20 blade server – 2 Cell chips, 16 SPEs • Software • StreamIt to C – gcc 4.1 • IBM Cell SDK 2.1

Evaluation on QS20 Split/join overhead reduces benefit from fission Barrier synchronization (1 per iteration of the stream graph)

SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) • Solver time < 30 seconds for 16 processors

Conclusions • Streamroller • Efficient mapping of stream programs to multicore • Coarse grain software pipelining • Performance summary • 14.7x speedup on 16 cores • Up to 35% better than greedy solution (11% on average) • Scheduling framework • Tradeoff memory space vs. load balance • Memory constrained (embedded) systems • Cache based system

Questions ? http://cccp.eecs.umich.edu http:/www.eecs.umich.edu/~sdrg

Backup Slides

DMA DMA DMA A A A A B B B B C C C C DMA D D D D E E E E DMA DMA DMA F F F F DMA PE1 PE2 PE3 PE4 A A A A B C B C B C B C D E D E D E D E F F F F Naïve Unfolding Completely stateless stream program PE1 PE2 PE3 PE4 Stream program with stateful nodes Stream data dependence State data dependence

SGMS (ILP) vs. Greedy (MIT method, ASPLOS’06) • Highly dependent on graph structure • DMA overlapped in both ILP and Greedy • Speedup drops with exposed DMA • Solver time < 30 seconds for 16 processors

Unfolding vs. SGMS

PE 1 PE 2 PE 3 All input data need to be available for iterations to begin A A A B B B A C Sync + DMA D B C C A Sync + DMA D D C B A D Sync + DMA B C A Sync + DMA D B C Sync + DMA D C D Pros and Cons of Unfolding • Pros • Simple • Good speedups for mostly stateless programs • Cons Speedup affected by filters with large state Sync+DMA • Long latency for one iteration • Not suitable for real time scheduling

SGMS Assumptions • Data independent filter work functions • Static schedule • High bandwidth interconnect • EIB 96 Bytes/cycle • Low overhead DMA, low observed latency • Cores can run MIMD efficiently • Not (directly) suitable for GPUs

Speedup Upperbounds 15% work in one stateless filter Max speedup = 100/15 = 6.7

Summary • 14.7x speedup on 16 cores • Preprocessing steps (fission) necessary • Hiding communication important • Extensions • Scaling to more cores • Constrained memory system • Other targets • SODA, FPGA, GPU

Stream Programming • Algorithmic style suitable for • Audio/video processing • Signal processing • Networking (packet processing, encryption/decryption) • Wireless domain, software defined radio • Characteristics • Independent filters • Segregated address space • Independent threads of control • Explicit communication • Coarse grain dataflow • Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03,’08] Bitstream parser Zigzag unorder Motion decode Inverse Quant AC Inverse Quant DC 1D-DCT (row) Saturate 1D-DCT (column) Bounded saturate MPEG Decoder

Effect of Exposed DMA

Scheduling via Unfolding Evaluation • IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server • StreamIt benchmarks characteristics

Absolute Performance

3 3 Σ Σ P P P P P j=0 j=0 Σ Σ Σ Σ Σ i=1 i=1 i=1 i=1 i=1 4 Σ j=0 4 Σ j=0 0_0 a1,0,0,i – b1,0 = 0 Original actor 1_2 a1,1,j,i – b1,1≤ Mb1,1 1_0 1_1 Fissed 2x a1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1 1_3 2_3 a1,2,j,i – b1,2≤ Mb1,2 2_0 2_1 2_2 Fissed 3x a1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2 2_4 b1,0 + b1,1 + b1,2 = 1

Bit mask to control active stages Activate Stage 0 Go through all input items Bit mask controls what gets executed Start DMA operation Call filter work function Left shift bit mask (activate more stages) Poll for completion of all outstanding DMAs Barrier synchronization SPE Code Template void spe_work() { char stage[N] = {0, 0,...,0}; stage[0] = 1; for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) { } ... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); } }

Orchestrating the Execution of Stream Programs on Multicore Platforms

Orchestrating the Execution of Stream Programs on Multicore Platforms

Presentation Transcript

An Execution Model for Heterogeneous Multicore Architectures

Parallel Execution Models for Future Multicore Architectures

Loop-Extended Symbolic Execution on Binary Programs

Orchestrating Success :

Maestro : Orchestrating Predictive Resource Management in Future Multicore Systems

Compilation and Execution of C# Programs

Towards Execution Guarantees for Stream Queries

Estimation of Assembly Programs Execution Time

Customizable Parallel Execution of Scientific Stream Queries

Synergistic Execution of Stream Programs on Multicores with Accelerators

Model-based estimation and control on multicore platforms

Software Pipelined Execution of Stream Programs on GPUs

Analyzing Memory Access Intensity in Parallel Programs on Multicore

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures

Time-Predictable Execution of Embedded Software on Multi-core Platforms

Phased Scheduling of Stream Programs

Static Translation of Stream Programs

Loop-Extended Symbolic Execution on Binary Programs

Time-Predictable Execution of Embedded Software on Multi-core Platforms

Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms

Publish Live Stream On Social Media Platforms