380 likes | 582 Views
Orchestrating the Execution of Stream Programs on Multicore Platforms. Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. 512. PicoChip. AMBRIC. 256. CISCO CSR1. 128. NVIDIA G80. Larrabee. 64. Unicore. 32. Homogeneous Multicore. RAZA XLR.
E N D
Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan
512 PicoChip AMBRIC 256 CISCO CSR1 128 NVIDIA G80 Larrabee 64 Unicore 32 Homogeneous Multicore RAZA XLR Cavium Heterogeneous Multicore 16 RAW Cell Niagara 8 AMD Fusion Opteron 4P BCM 1480 4 Core2Quad Xeon Xbox 360 Power6 Power4 PA8800 PA8800 4004 8008 8080 4004 8008 8080 2 Opteron CoreDuo Core2Duo 8086 286 386 486 Pentium P2 P3 P4 1 Core Athlon Itanium Itanium2 1975 1980 1985 1990 1995 2000 2005 2010 Cores are the New Gates (Shekhar Borkar, Intel) Stream Programming CUDA Courtesy: Gordon’06 Courtesy: Gordon’06 X10 Peakstream Fortress C/C++/Java # cores/chip Accelerator Ct C T M Rstream Rapidmind
Stream Programming • Programming style • Embedded domain • Audio/video (H.264), wireless (WCDMA) • Mainstream • Continuous query processing (IBM SystemS), Search (Google Sawzall) • Stream • Collection of data records • Kernels/Filters • Functions applied to streams • Input/Output are streams • Coarse grain dataflow • Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03]
Core 1 Core 2 Core 3 Core 4 Mem Mem Mem Mem Compiling Stream Programs Stream Program Multicore System ? • Heavy lifting • Equal work distribution • Communication • Synchronization
SPE0 SPE1 SPE7 SPU SPU SPU 256 KB LS 256 KB LS 256 KB LS MFC(DMA) MFC(DMA) MFC(DMA) EIB PPE (Power PC) DRAM Stream Graph Modulo Scheduling(SGMS) • Coarse grain software pipelining • Equal work distribution • Communication/computation overlap • Target : Cell processor • Cores with disjoint address spaces • Explicit copy to access remote data • DMA engine independent of PEs • Filters = operations, cores = function units
Streamroller Software Overview SPEX (C++ with stream extensions) StreamIt Stylized C Focus of this talk Streamroller Profiling Stream IR Code Generation Analysis Scheduling Multicore (Cell, Core2Quad, Niagara) SODA (Low power multicore processor for SDR) Custom Application Engine
int wgts[N]; wgts = adapt(wgts); Push and pop items from input/output FIFOs Preliminaries • Synchronous Data Flow (SDF) [Lee ’87] • StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } } Stateless Stateful
PE0 T1 ≈ 4 T4 SGMS Overview Prologue PE0 PE1 PE2 PE3 DMA DMA DMA DMA T1 DMA T4 DMA DMA DMA Epilogue
Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases
A A A 5 PE0 PE1 B B1 A B2 B S B C C C 40 10 C B1 B2 D D A C 5 S D J J D D Processor Assignment • Assign filters to processors • Goal : Equal work distribution • Graph partitioning? • Bin packing? Modified stream program Original stream program Speedup = 60/40 = 1.5 Speedup = 60/32 ~ 2
PE0 PE1 PE2 PE3 Filter Fission Choices Speedup ~ 4 ?
Split/Join overhead factored in … Integrated Fission + PE Assign • Exact solution based on Integer Linear Programming (ILP) • Objective function- Maximal load on any PE • Minimize • Result • Number of times to “split” each filter • Filter → processor mapping
Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases
PE0 PE1 A B A PE0 PE1 Time B A B C A1 C B1 A2 A1 A1 A→B A→B A2 A2 B1 A→B B1 A3 Forming the Software Pipeline • To achieve speedup • All chunks should execute concurrently • Communication should be overlapped • Processor assignment alone is insufficient information X Overlap Ai+2 with Bi
Stage Assignment • Data flow traversal of the stream graph • Assign stages using above two rules PE 1 PE 1 Si i i Sj≥ Si SDMA > Si DMA j PE 2 Sj = SDMA+1 j Preserve causality (producer-consumer dependence) Communication-computation overlap
A S B1 Stage 0 Stage 1 DMA DMA DMA C B2 Stage 2 J DMA DMA Stage 3 D Stage 4 Stage Assignment Example A S C B1 B2 J D PE 0 PE 1
Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases
Code Generation for Cell • Target the Synergistic Processing Elements (SPEs) • PS3 – up to 6 SPEs • QS20 – up to 16 SPEs • One thread / SPE • Challenge • Making a collection of independent threads implement a software pipeline • Adapt kernel-only code schema of a modulo schedule
Complete Example A void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } S B1 Time A S B1 A B1toJ S StoB2 B1 AtoC DMA DMA DMA A B2 B1toJ S StoB2 B1 C B2 AtoC J C J A JtoD B2 B1toJ S CtoD StoB2 DMA DMA B1 AtoC J D C A JtoD B2 B1toJ S CtoD StoB2 B1 AtoC J C D SPE1 DMA1 SPE2 DMA2
Experiments • StreamIt benchmarks • Signal processing – DCT, FFT, FMRadio, radar • MPEG2 decoder subset • Encryption – DES • Parallel sort – Bitonic • Range of 30 to 90 filters/benchmark • Platform • QS20 blade server – 2 Cell chips, 16 SPEs • Software • StreamIt to C – gcc 4.1 • IBM Cell SDK 2.1
Evaluation on QS20 Split/join overhead reduces benefit from fission Barrier synchronization (1 per iteration of the stream graph)
SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) • Solver time < 30 seconds for 16 processors
Conclusions • Streamroller • Efficient mapping of stream programs to multicore • Coarse grain software pipelining • Performance summary • 14.7x speedup on 16 cores • Up to 35% better than greedy solution (11% on average) • Scheduling framework • Tradeoff memory space vs. load balance • Memory constrained (embedded) systems • Cache based system
Questions ? http://cccp.eecs.umich.edu http:/www.eecs.umich.edu/~sdrg
DMA DMA DMA A A A A B B B B C C C C DMA D D D D E E E E DMA DMA DMA F F F F DMA PE1 PE2 PE3 PE4 A A A A B C B C B C B C D E D E D E D E F F F F Naïve Unfolding Completely stateless stream program PE1 PE2 PE3 PE4 Stream program with stateful nodes Stream data dependence State data dependence
SGMS (ILP) vs. Greedy (MIT method, ASPLOS’06) • Highly dependent on graph structure • DMA overlapped in both ILP and Greedy • Speedup drops with exposed DMA • Solver time < 30 seconds for 16 processors
PE 1 PE 2 PE 3 All input data need to be available for iterations to begin A A A B B B A C Sync + DMA D B C C A Sync + DMA D D C B A D Sync + DMA B C A Sync + DMA D B C Sync + DMA D C D Pros and Cons of Unfolding • Pros • Simple • Good speedups for mostly stateless programs • Cons Speedup affected by filters with large state Sync+DMA • Long latency for one iteration • Not suitable for real time scheduling
SGMS Assumptions • Data independent filter work functions • Static schedule • High bandwidth interconnect • EIB 96 Bytes/cycle • Low overhead DMA, low observed latency • Cores can run MIMD efficiently • Not (directly) suitable for GPUs
Speedup Upperbounds 15% work in one stateless filter Max speedup = 100/15 = 6.7
Summary • 14.7x speedup on 16 cores • Preprocessing steps (fission) necessary • Hiding communication important • Extensions • Scaling to more cores • Constrained memory system • Other targets • SODA, FPGA, GPU
Stream Programming • Algorithmic style suitable for • Audio/video processing • Signal processing • Networking (packet processing, encryption/decryption) • Wireless domain, software defined radio • Characteristics • Independent filters • Segregated address space • Independent threads of control • Explicit communication • Coarse grain dataflow • Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03,’08] Bitstream parser Zigzag unorder Motion decode Inverse Quant AC Inverse Quant DC 1D-DCT (row) Saturate 1D-DCT (column) Bounded saturate MPEG Decoder
Scheduling via Unfolding Evaluation • IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server • StreamIt benchmarks characteristics
3 3 Σ Σ P P P P P j=0 j=0 Σ Σ Σ Σ Σ i=1 i=1 i=1 i=1 i=1 4 Σ j=0 4 Σ j=0 0_0 a1,0,0,i – b1,0 = 0 Original actor 1_2 a1,1,j,i – b1,1≤ Mb1,1 1_0 1_1 Fissed 2x a1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1 1_3 2_3 a1,2,j,i – b1,2≤ Mb1,2 2_0 2_1 2_2 Fissed 3x a1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2 2_4 b1,0 + b1,1 + b1,2 = 1
Bit mask to control active stages Activate Stage 0 Go through all input items Bit mask controls what gets executed Start DMA operation Call filter work function Left shift bit mask (activate more stages) Poll for completion of all outstanding DMAs Barrier synchronization SPE Code Template void spe_work() { char stage[N] = {0, 0,...,0}; stage[0] = 1; for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) { } ... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); } }