280 likes | 512 Views
GPUDet: A Deterministic GPU Architecture. Hadi Jooybar 1 , Wilson Fung 1 , Mike O’Connor 2, Joseph Devietti 3 , Tor M. Aamodt 1. 1 The University of British Columbia 2 AMD Research 3 University of Washington . GPUs are … Fast Energy efficient Commodity hardware. But…
E N D
GPUDet: A Deterministic GPU Architecture Hadi Jooybar1, Wilson Fung1, Mike O’Connor2, Joseph Devietti3,Tor M. Aamodt1 1The University of British Columbia 2AMD Research 3University of Washington
GPUs are … • Fast • Energy efficient • Commodity hardware • But… • Mostly use for certain range of applications • Why? • Communication among concurrent threads 1000s of Threads
V0 V0 V0 Motivation 0 __global__ voidBFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V1 V1 V1 V2 V2 V2 Cost = - Active = - Cost = 1 Active = 1 Cost = 1 Active = 1 Cost = - Active = - Cost = 1 Active = 1 Cost = 2 Active = 1 BFS algorithm Published in HiPC 2007
Motivation • What about debuggers?! • The bug may appear occasionally or in different places in each run. I will debug it this time OMG! Where was that bug?!
GPUDet • Strong Determinism (hardware proposal) • Same Outputs • Same Execution Path • Makes the program easier to • Debug • Test
V0 Motivation 0 __global__ voidBFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V1 V2 Cost = 1 Active = 1 Cost = 2 Active = 1 BFS algorithm Published in HiPC 2007
GPUDet • Strong Determinism • Same Outputs • Same Execution Path • Makes the program easier to • Debug • Test • There is no free lunch • Performance Overhead • Our goal is to provide Deterministic Execution on GPU architectures with acceptable performance overhead
GPU Architecture DRAM CPU DRAM L2 Cache Kernel launch x = input[threadID]; y= func(x); output[threadID] = y; Compute Unit L1 Cache workgroup 0 workgroup 2 workgroup 1 Memory Unit ALU ALU ALU Workgroups
Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion
Deterministic GPU Execution Challenges … Isolation Isolation Quantum 0 Quantum n T0 … T1 • Isolation mechanism • Provide method to pause execution of a thread T2 T0 T0 T0 T0 T3 T1 T1 T1 T1 Normal Execution T2 T2 T2 T2 T3 T3 T3 T3 Communication Communication
Deterministic GPU Execution Challenges • Isolation mechanism • Lack of private caches • Lack of cache coherency • Provide method to pause execution of a thread • Single Instruction Multiple Threads (SIMT) • Potential deadlock condition • Major changes in control flow hardware • Performance overhead … workgroupn wavefront
Deterministic GPU Execution Challenges • Very large number of threads • Expensive global synchronization • Expensive serialization • Different program properties • Large number of short running threads • Frequent workgroup synchronization • Less locality in intra thread memory accesses
Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion
Deterministic Execution of a Wavefront T15 T0 T2 T1 if (tid < 16) x[tid%2] = tid; … x[0] = 0 x[1] = 1 x[0] = 2 x[1] = 15 Execution of one wavefront is deterministic Coalescing Unit Address x Mask v v - - - - - - … - Data Race Data 14 15 - - - - - - … - x[0] = 14 Not modified x[1] = 15 To memory
Deterministic GPU Execution Challenges Isolation Isolation … wavefront granularity • Isolation mechanism • Provide method to pause execution of a thread T0 T0 T1 T1 not a challenge anymore T2 T2 T3 T3 Communication Communication
Read Only • GPUDet-Basic Global Memory Commit Reaching Quantum Boundary Load Op Atomic Op Store Buffers Wavefronts Instruction Count Atomic Operations Memory Fences Workgroup Barriers Execution Complete … Local Memory
Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion
Workgroup-Aware Quantum Formation • Extra global synchronizations Reducing number of synchronizations Avoid unnecessary quantum termination Load Imbalance
Workgroup-Aware Quantum Formation All reach a workgroup barrier Continue execution in the parallel mode Workgroup-Aware Decision Making Quanta are finished by workgroup barriers
Workgroup-Aware Quantum Formation Workgroup-Aware Decision Making Finish execution of the Kernel function Deterministic workgroup partitioning
Deterministic Parallel Commit using the Z-Buffer Unit • Z-Buffer Unit Depth Buffer Store Buffer Contents ≈ Color Values Wavefront ID ≈ Depth Values
Compute Unit Level Serialization • GPUs preserve Point to Point Ordering Serialization is only among compute units
Outline • Introduction • GPU Architecture • Challenges • Deterministic Execution with GPUDet • GPUDet Optimizations • Workgroup-Aware Quantum Formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization • Results and Conclusion
Results • GPGPU-Sim 3.0.2 Applications with atomic operations 2x Slowdown
Quantum Formation 19% Performance Improvementfor application with small kernel functions 20% Performance Improvementfor application with barriers
Deterministic Parallel Commit using the Z-Buffer Unit 60% Performance Improvement on Average
Compute Unit Level Serialization 6.1x Performance Improvement in Serial Mode
Conclusion • Encourages programmers to use GPUs in broader range of applications • Exploits GPU characteristics to reduce performance overhead • Deterministic execution within a wavefront • Workgroup-aware quantum formation • Deterministic parallel commit using Z-Buffer Unit • Compute Unit level serialization Questions?