200 likes | 287 Views
Adaptive Input-aware Compilation for Graphics Engines. Mehrzad Samadi 1 , Amir Hormati 2 , Mojtaba Mehrara 3 , Janghaeng Lee 1 and Scott Mahlke 1. 1 University of Michigan - Ann Arbor 2 Microsoft Research 3 NVIDIA Research. GPU Performance Gap. High performance at low cost
E N D
Adaptive Input-aware Compilation for Graphics Engines MehrzadSamadi1, Amir Hormati2, Mojtaba Mehrara3, Janghaeng Lee1 and Scott Mahlke1 1University of Michigan - Ann Arbor 2Microsoft Research 3NVIDIA Research
GPU Performance Gap • High performance at low cost • Peak performance is difficult to achieve GeForce GTX 680 GeForce GTX 590 In Practice GeForce GTX 480 GeForce GTX 280 GeForce 8800 GTX GeForce 7800 GTX
TMV Performance on Various Input SquareMatrix RectangularMatrix RectangularMatrix
GPU Execution Model SM 0 SM 7 SM 3 SM 2 SM 1 0 1 0 0 0 0 1 1 1 1 2 3 2 2 2 2 3 3 3 3 4 5 4 4 4 4 5 5 5 5 ExecutesThread 6 7 6 6 6 6 7 7 7 7 Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Grid 1
Transposed Matrix Vector Multiplication (4 x 1M) Thread 0 ~ 15 Thread 0 ~ 15 Block 0 Block 1 Block 2 Block 3 IDLE SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 Regs Regs Regs Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Shared Shared Shared
Transposed Matrix Vector Multiplication (1M x 4) 125,000 blocks / SM SM 0 SM 1 SM 2 SM 3 Block 0 ~ 7 Block 8 ~ 15 SM 4 SM 5 SM 6 SM 7 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 Regs Regs Regs Regs Regs Regs Regs Regs Shared Shared Shared Shared Shared Shared Shared Shared Block 1,000,000
GPU Programming Challenge - Portability GPU Architectures Cores : 240 2008 FastestMatrix-VectorMultiplicationfor any GPU for any input size Cores : 512 2011 Cores : 1536 2012
Adaptic • Adaptive Input-aware Compilation for GPUs • Device-Portable • Input-Portable • Programmers can focus on the algorithms without concerning about low-level details • Streaming Language • Higher-level of abstraction • Separating Memory Access from Algorithm • e.g) StreamIt
Actor 1 Splitter Actor 2 Actor 3 Actor 4 Actor 5 Joiner Actor 6 Stream It • Higher-level of abstraction • Decoupling computation and memory accesses • Coarse grain exposed parallelism, exposed communication • Streaming actors use buffers to communicate • A lot of recent works on extending portability of streaming applications
Compilation Flow in Adaptic StreamIt Code Launch Kernel Target GPU Input Range • Why? • Global Memory Accesses • Large access latency • Optimizations • Memory Restructuring • Coalesced Access • Neighboring Access • Data Reuse Memory AccessOptimization Input size? • Splits Actors • More blocks will be generated • Alleviate resource under-utilization • Optimizations • Stream Reduction • Intra-actor Parallelization Input-unawareOptimization LargestInput SmallestInput Actor Segmentation LargeInput Input-awareOptimization SmallInput PerformanceModel • Integrate Actors • Merge several actors into one • Alleviate high resource contention • Optimizations • Vertical Integration • Horizontal Integration Actor Integration Kernel 0 Kernel 1 Kernel 2 Kernel 3 Several CUDA Kernels for various input range Offline Compilation Executable
Memory Optimization • Global Memory - Large access latency • Not access the words in sequence • No coalescing 14 12 10 0 8 4 2 6 14 12 10 8 6 4 2 0 15 1 13 5 3 11 13 1 7 11 9 7 5 3 9 15 0 1 2 3 2 3 0 1 5 4 5 6 6 7 4 7 8 11 9 10 11 8 10 9 12 13 13 14 15 15 12 14 A[i, j] Actor A has i pops and j pushes Global Memory Thread 0 Thread 1 Thread 2 Thread 3 A[4,4] A[4,4] A[4,4] A[4,4] Global Memory
Memory Optimization • Global Memory - Large access latency • Not access the words in sequence • No coalescing 0 0 4 4 8 8 12 12 0 0 4 4 8 8 12 12 1 1 5 5 9 9 13 13 1 1 5 5 9 9 13 13 2 2 6 6 10 10 14 14 2 2 6 6 10 10 14 14 A[i, j] Actor A has i pops and j pushes 3 3 7 7 11 11 15 15 3 3 7 7 11 11 15 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global Memory Thread 0 Thread 1 Thread 2 Thread 3 A[4,4] A[4,4] A[4,4] A[4,4] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global Memory
Actor Segmentation Actor 0 Actor 0 Actor 1 Actor 2 Actor 3 4 x 1M Transposed Matrix-Vector Multiplication Block 0 Block 1 Block 2 Block 3 Block 0 ~ Block 31 Block 32 Block 64 Block 96
Actor Integration • Merges several actors or threads to balance threads’ workloads • Vertical integration: reducing off-chip memory traffic by storing intermediate results in the shared memory. • Horizontal integration : reducing synchronization overhead and also lets the merged actors share instructions. Actor 1 Actor 1 Actor 2 Fused Actor 0 Actor 3 Splitter Fused Actor 1 Actor 4 Actor 5 Actor 6 Actor 7 Actor 6 Joiner Actor 8
Experimental Setup • CPU - Intel Xeon X5650 • GPU • NVidia Telsa C2050 • 3GB GDDR 5 • NVidia GTX 285 • 2GB GDDR 2 • Benchmarks • CUBLAS Library 3.2 • NVidia SDK 3.1
Results (Speedup) Input Size
Results(BiCGSTAB) Input unaware Input unaware
Summary • Performance of GPU is affected by • GPU Model / Input • CUDA / OpenCL Programming Model • Lacks Architecture and Input Portability • Scientific Applications use irregular input • Hard to get optimized performance • Proposed Adaptic • Architecture and input portable /w streaming language • Showed speedup over CUBLAS / SDK in various input range