EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation

EECS 583 – Class 20Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi

Announcements & Reading Material • Exams graded and will be returned in Wednesday’s class • This class • “Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008. • Next class – Research Topic 3: Security • “Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software,” James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb. 2005

Stream Graph Modulo Scheduling

SPE0 SPE1 SPE7 SPU SPU SPU 256 KB LS 256 KB LS 256 KB LS MFC(DMA) MFC(DMA) MFC(DMA) EIB PPE (Power PC) DRAM Stream Graph Modulo Scheduling (SGMS) • Coarse grain software pipelining • Equal work distribution • Communication/computation overlap • Synchronization costs • Target : Cell processor • Cores with disjoint address spaces • Explicit copy to access remote data • DMA engine independent of PEs • Filters = operations, cores = function units

int wgts[N]; wgts = adapt(wgts); Push and pop items from input/output FIFOs Preliminaries • Synchronous Data Flow (SDF) [Lee ’87] • StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } } Stateless Stateful

PE0 T1 ≈ 4 T4 SGMS Overview Prologue PE0 PE1 PE2 PE3 DMA DMA DMA DMA T1 DMA T4 DMA DMA DMA Epilogue

Fission + Processor assignment Stage assignment Code generation Load balance Causality DMA overlap SGMS Phases

Processor Assignment: Maximizing Throughputs T2 = 50 for all filter i = 1, …, N for all PE j = 1,…,P Minimize II PE0 • Assigns each filter to a processor A A W: 20 A B Four Processing Elements W: workload B C B C W: 20 PE1 PE2 PE3 C PE0 W: 20 E A B D D D T1 = 170 F D C W: 30 E E E Minimum II: 50 Balanced workload!Maximum throughput W: 50 F F F W: 30 T1/T2 = 3. 4 PE0 PE1 PE2 PE3

A A A 5 PE0 PE1 B B1 A B2 B S B C C C 40 10 C B1 B2 D D A C 5 S D J J D D Need More Than Just Processor Assignment • Assign filters to processors • Goal : Equal work distribution • Graph partitioning? • Bin packing? Modified stream program Original stream program Speedup = 60/40 = 1.5 Speedup = 60/32 ~ 2

PE0 PE1 PE2 PE3 Filter Fission Choices Speedup ~ 4 ?

Split/Join overhead factored in … Integrated Fission + PE Assign • Exact solution based on Integer Linear Programming (ILP) • Objective function- Maximal load on any PE • Minimize • Result • Number of times to “split” each filter • Filter → processor mapping

PE0 PE1 A B A PE0 PE1 Time B A B C A1 C B1 A2 A1 A1 A→B A→B A2 A2 B1 A→B B1 A3 Step 2: Forming the Software Pipeline • To achieve speedup • All chunks should execute concurrently • Communication should be overlapped • Processor assignment alone is insufficient information X Overlap Ai+2 with Bi

Stage Assignment • Data flow traversal of the stream graph • Assign stages using above two rules PE 1 PE 1 Si i i Sj≥ Si SDMA > Si DMA j PE 2 Sj = SDMA+1 j Preserve causality (producer-consumer dependence) Communication-computation overlap

A S B1 Stage 0 Stage 1 DMA DMA DMA C B2 Stage 2 J DMA DMA Stage 3 D Stage 4 Stage Assignment Example A S C B1 B2 J D PE 0 PE 1

Step 3: Code Generation for Cell • Target the Synergistic Processing Elements (SPEs) • PS3 – up to 6 SPEs • QS20 – up to 16 SPEs • One thread / SPE • Challenge • Making a collection of independent threads implement a software pipeline • Adapt kernel-only code schema of a modulo schedule

Complete Example A void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } S B1 Time A S B1 A B1toJ S StoB2 B1 AtoC DMA DMA DMA A B2 B1toJ S StoB2 B1 C B2 AtoC J C J A JtoD B2 B1toJ S CtoD StoB2 DMA DMA B1 AtoC J D C A JtoD B2 B1toJ S CtoD StoB2 B1 AtoC J C D SPE1 DMA1 SPE2 DMA2

SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) • Solver time < 30 seconds for 16 processors

SGMS Conclusions • Streamroller • Efficient mapping of stream programs to multicore • Coarse grain software pipelining • Performance summary • 14.7x speedup on 16 cores • Up to 35% better than greedy solution (11% on average) • Scheduling framework • Tradeoff memory space vs. load balance • Memory constrained (embedded) systems • Cache based system

Discussion Points • Is it possible to convert stateful filters into stateless? • What if the application does not behave as you expect? • Filters change execution time? • Memory faster/slower than expected? • Could this be adapted for a more conventional multiprocessor with caches? • Can C code be automatically streamized? • Now you have seen 3 forms of software pipelining: • 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling • Where else can it be used?

Compilation for GPUs

Why GPUs? 20

Efficiency of GPUs GTX 285 : 5.2 GFLOP/W High Flop Per Watt i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ High Memory Bandwidth i7 : 0.36 GFLOP/$ GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 : 51 GFLOPS High DP Flop Rate GTX 285 :1062 GFLOPS GTX 285 : 88.5 GFLOPS i7 :102 GFLOPS GTX 480 : 168 GFLOPS 21

GPU Architecture Shared Shared SM 0 SM 1 SM 2 SM 29 Shared Shared 0 0 0 0 1 1 1 1 3 3 3 3 2 2 2 2 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Regs Regs Regs Regs Host Memory CPU Interconnection Network PCIe Bridge Global Memory (Device Memory) 22

CUDA • “Compute Unified Device Architecture” • General purpose programming model • User kicks off batches of threads on the GPU • Advantages of CUDA • Interface designed for compute - graphics free API • Orchestration of on chip cores • Explicit GPU memory management • Full support for Integer and bitwise operations 23

Programming Model Device Grid 1 Host Kernel 1 Time Grid 2 Kernel 2 24

GPU Scheduling SM 0 SM 30 SM 2 SM 3 SM 1 Shared Shared Shared Shared Shared 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 Regs Regs Regs Regs Regs Grid 1 25

Warp Generation Warp 0 Warp 1 SM0 Block 0 ThreadId Shared 31 63 32 0 1 0 2 3 Block 1 4 5 6 7 Registers Block 2 Block 3 26

Memory Hierarchy Grid 0 Per app Constant Memory Per app Texture Memory Per app Global Memory __global__ int GlobalVar Per-thread Local Memory Per-thread Register Per Block Shared Memory Texture<float,1,ReadMode> TextureVar Device Host __shared__ int SharedVar Block 0 __constant__ int ConstVar int RegisterVar Thread 0 int LocalVarArray[10] 27

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation