Adaptive Input-aware Compilation for Graphics Engines

Adaptive Input-aware Compilation for Graphics Engines MehrzadSamadi1, Amir Hormati2, Mojtaba Mehrara3, Janghaeng Lee1 and Scott Mahlke1 1University of Michigan - Ann Arbor 2Microsoft Research 3NVIDIA Research

GPU Performance Gap • High performance at low cost • Peak performance is difficult to achieve GeForce GTX 680 GeForce GTX 590 In Practice GeForce GTX 480 GeForce GTX 280 GeForce 8800 GTX GeForce 7800 GTX

TMV Performance on Various Input SquareMatrix RectangularMatrix RectangularMatrix

GPU Execution Model SM 0 SM 7 SM 3 SM 2 SM 1 0 1 0 0 0 0 1 1 1 1 2 3 2 2 2 2 3 3 3 3 4 5 4 4 4 4 5 5 5 5 ExecutesThread 6 7 6 6 6 6 7 7 7 7 Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Grid 1

Transposed Matrix Vector Multiplication (4 x 1M) Thread 0 ~ 15 Thread 0 ~ 15 Block 0 Block 1 Block 2 Block 3 IDLE SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 Regs Regs Regs Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Shared Shared Shared

Transposed Matrix Vector Multiplication (1M x 4) 125,000 blocks / SM SM 0 SM 1 SM 2 SM 3 Block 0 ~ 7 Block 8 ~ 15 SM 4 SM 5 SM 6 SM 7 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 Regs Regs Regs Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Shared Shared Shared Block 1,000,000

GPU Programming Challenge - Portability GPU Architectures Cores : 240 2008 FastestMatrix-VectorMultiplicationfor any GPU for any input size Cores : 512 2011 Cores : 1536 2012

Adaptic • Adaptive Input-aware Compilation for GPUs • Device-Portable • Input-Portable • Programmers can focus on the algorithms without concerning about low-level details • Streaming Language • Higher-level of abstraction • Separating Memory Access from Algorithm • e.g) StreamIt

Actor 1 Splitter Actor 2 Actor 3 Actor 4 Actor 5 Joiner Actor 6 Stream It • Higher-level of abstraction • Decoupling computation and memory accesses • Coarse grain exposed parallelism, exposed communication • Streaming actors use buffers to communicate • A lot of recent works on extending portability of streaming applications

Compilation Flow in Adaptic StreamIt Code Launch Kernel Target GPU Input Range • Why? • Global Memory Accesses • Large access latency • Optimizations • Memory Restructuring • Coalesced Access • Neighboring Access • Data Reuse Memory AccessOptimization Input size? • Splits Actors • More blocks will be generated • Alleviate resource under-utilization • Optimizations • Stream Reduction • Intra-actor Parallelization Input-unawareOptimization LargestInput SmallestInput Actor Segmentation LargeInput Input-awareOptimization SmallInput PerformanceModel • Integrate Actors • Merge several actors into one • Alleviate high resource contention • Optimizations • Vertical Integration • Horizontal Integration Actor Integration Kernel 0 Kernel 1 Kernel 2 Kernel 3 Several CUDA Kernels for various input range Offline Compilation Executable

Memory Optimization • Global Memory - Large access latency • Not access the words in sequence • No coalescing 14 12 10 0 8 4 2 6 14 12 10 8 6 4 2 0 15 1 13 5 3 11 13 1 7 11 9 7 5 3 9 15 0 1 2 3 2 3 0 1 5 4 5 6 6 7 4 7 8 11 9 10 11 8 10 9 12 13 13 14 15 15 12 14 A[i, j]  Actor A has i pops and j pushes Global Memory Thread 0 Thread 1 Thread 2 Thread 3 A[4,4] A[4,4] A[4,4] A[4,4] Global Memory

Memory Optimization • Global Memory - Large access latency • Not access the words in sequence • No coalescing 0 0 4 4 8 8 12 12 0 0 4 4 8 8 12 12 1 1 5 5 9 9 13 13 1 1 5 5 9 9 13 13 2 2 6 6 10 10 14 14 2 2 6 6 10 10 14 14 A[i, j]  Actor A has i pops and j pushes 3 3 7 7 11 11 15 15 3 3 7 7 11 11 15 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global Memory Thread 0 Thread 1 Thread 2 Thread 3 A[4,4] A[4,4] A[4,4] A[4,4] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global Memory

Actor Segmentation Actor 0 Actor 0 Actor 1 Actor 2 Actor 3 4 x 1M Transposed Matrix-Vector Multiplication Block 0 Block 1 Block 2 Block 3 Block 0 ~ Block 31 Block 32 Block 64 Block 96

Actor Integration • Merges several actors or threads to balance threads’ workloads • Vertical integration: reducing off-chip memory traffic by storing intermediate results in the shared memory. • Horizontal integration : reducing synchronization overhead and also lets the merged actors share instructions. Actor 1 Actor 1 Actor 2 Fused Actor 0 Actor 3 Splitter Fused Actor 1 Actor 4 Actor 5 Actor 6 Actor 7 Actor 6 Joiner Actor 8

Experimental Setup • CPU - Intel Xeon X5650 • GPU • NVidia Telsa C2050 • 3GB GDDR 5 • NVidia GTX 285 • 2GB GDDR 2 • Benchmarks • CUBLAS Library 3.2 • NVidia SDK 3.1

Result( Matrix Vector Multlipication)

Results (Speedup) Input Size

Results(BiCGSTAB) Input unaware Input unaware

Summary • Performance of GPU is affected by • GPU Model / Input • CUDA / OpenCL Programming Model • Lacks Architecture and Input Portability • Scientific Applications use irregular input • Hard to get optimized performance • Proposed Adaptic • Architecture and input portable /w streaming language • Showed speedup over CUBLAS / SDK in various input range

Q & A

Adaptive Input-aware Compilation for Graphics Engines