290 likes | 703 Views
Ping Xiang , Yi Yang, Huiyang Zhou. Warp-Level Divergence in GPUs: Characterization , Impact , and Mitigation. The 20th IEEE International Symposium On High Performance Computer Architecture , Orlando, Florida, USA. 1. Outline. Background Motivation Mitigation: WarpMan Experiments
E N D
Ping Xiang,Yi Yang,HuiyangZhou Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation The 20th IEEE International Symposium On High Performance Computer Architecture, Orlando, Florida, USA 1
Outline • Background • Motivation • Mitigation: WarpMan • Experiments • Conclusions 2
Overview of GPU Architecture Control Register File ALU ALU Warp TB Threads Warp ALU ALU Shared Memory Cache Warp ALU ALU … ALU ALU DRAM 3
Motivation: Resource Fragmentation • Typically large TB size (512, e.g.) • More efficient data sharing/communication within a TB • Limited total TB number Register File TB TB Unused Registers 4
Motivation Warp-Level Divergence: warps within the same TB don’t finish at the same time …. Warp1 Warp2 Finished Warp3 Resources cannot be released promptly Finished Warp4 TB Unused Resources 5
Outline • Background • Motivation • Characterization: • Mitigation: WarpMan • Experiments • Conclusions 6
Characterization: Temporal Resource underutilization Register File TB Finished Finished TB Unused Resources Spatial Resource underutilization 7
Spatial Resource Underutilization Register resource as an example 46% 28% 17% 8
Temporal Resource Underutilization RTRU: ratio of temporal resource underutilization • Case Study: Ray Tracing • 6 warps per TB • Study TB0 as an example RTRU = 49.7% 9
Why There Is Temporal Resource Underutilization? • Input-dependent workload imbalance • Same code, different input: “if(a < 123)” • Program-dependent workload imbalance • Code like if(tid < 32) • Memory divergence • Some warps experience more cache hits than others • Warp schedulingpolicy • Scheduler prioritizes certain warps than others 10
Outline • Background • Motivation • Characterization: • Micro-benchmarking • Mitigation: WarpMan • Experiments • Conclusions 12
Micro-benchmark • Code runs on both GTX480 and GTX 680 • 1. __global__ void TB_resource_kernel(…, bool call = false){ • 2. if(call) bloatOccupancy(start, size); • ... • 3. clock_tstart_clock = clock(); • 4. if(tid < 32){ //tid is the threadid within a TB • 5 clock_offset = 0; • 6. while( clock_offset < clock_count ) { • 7. clock_offset = clock() - start_clock; • 8. } • 9. } • 10. clock_tend_clock = clock(); • 11. d_o[index] = start_clock; //index is the global thread id • 12. d_e[index] = end_clock; • 13.} 13
Micro-benchmarking • Results > Using CUDA device [0]: GeForce GTX 480 > Detected Compute SM 3.0 hardware with 8 multi-processors. … CTA 250 Warp 0: start 80, end 81 CTA 269 Warp 0: start 80, end 81 CTA 272 Warp 0: start 80, end 81 CTA 283 Warp 0: start 80, end 81 CTA 322 Warp 0: start 80, end 81 CTA 329 Warp 0: start 80, end 81 … 14
Outline • Background • Motivation • Mitigation: WarpMan • Experiments • Conclusions 15
WarpMan Warp Level Resource Management TB-level Resource Management SM Finished Warp0 TB0 TB2 Finished Warp1 Finished Warp2 TB1 Unused Resources Workload TB0 TB2 Warp0 cycle Warp1 Warp2 TB1 16
WarpMan Warp Level Resource Management TB-level Resource Management WarpMan SM SM Finished Warp TB0 TB0 TB2 Finished Warp Finished Warp Finished Released Resource Warp2 From TB2 TB1 TB1 Unused Resources Warp0 From TB2 Warp1 From TB2 Workload TB0 Saved Cycle Warp0 cycle Warp1 Warp2 warp2 TB1 Warp0 and warp 1 17
WarpMan ---- Design • Dispatch logic • Traditional TB-level dispatching logic • Add partial TB dispatch logic • Workload buffer • Store the dispatched but not running partial TBs 18
Dispatching Shared memory Warp entries TB entries Registers Workload to be dispatched Resources required for a TB TB-level Resource Check A full TB The shared memory is still allocated at the TB level Warp-level Resource Check A partial TB Resources required for a Warp 19
Workload Buffer • Store the dispatched but not running TB • Hardware TB id (assigned by the hardware) • Software TB id (defined by the software) • Start warp id • End warp id • Valid bit 3 26 5 5 1 40bits 20
Workload Buffer Store the dispatched but not running TB TB120 WarpMan SM Workload buffer TB118 Finished Warp2 From TB120 TB Num 120 TB117 0 Start Warp ID 1 2 2 End Warp ID 1 0 Valid Unused Resources Warp0 From TB120 Warp1 From TB120 21
Outline • Background • Motivation: • Mitigation: WarpMan • Experiments • Conclusions 22
Methodology • Use GPUWattch for both timingand energyevaluation • Baseline Architecture: (GTX480) • 15 SMs, with SIMD size of 32, running at 1.4Ghz • Max TBs per SM is 8, Max threads per SM is 1536 • Scheduling policy: round robin / two level • 16KB L1 cache, 48 KB shared memory. 128KB regs • Applicationsfrom: • Nvidia CUDA SDK • Rodinia Benchmark Suit • GPGPUsim 23
Performance Results: • temp: allow early finished warps to release resource for new warps • temp + spatial: resources are allocated /released at warp level • The performance improvements can be as high as 71%/76% • On average, 15.3% improvements 24
Energy Results The energy savings can be as high as over 20%, and 6% on average
A Software Alternative a smaller TB size • Change the softwareto have a smaller TB size • Change the hardwareto enable more concurrent TBs • Inefficientshared memory usage / synchronization • Decreasethe data locality • Moreas we proceed to the experimental results… 26
Comparing to the Software Alternative • CT and ST: software approach decreases L1 locality • NN and BT: reduced total number of threads • On average: 25% improvement VS 48% degradation 125% 52% 27
Related Work • Resource underutilization due to branch divergence or thread-level divergence has been well studied. • Yi Yang et al [Pact-21] targets at the shared memoryresource management and is complementary to our proposed WarpMan scheme. • D. Tarjan, et al [US Patent-2009], proposes to use virtual register table to manage physical register file to enable more concurrent TBs 28
Conclusion • We highlight the limitations of TB-level resource management • we characterizewarp-level divergence and reveal the fundamental reasons for such divergent behavior; • we propose WarpManand show that it can be implemented with minor hardware changes • we show that our proposed solution is highly effective and achieves significant performanceimprovements and energy savings Questions? 29