Performance in GPU Architectures: Potentials and Distances

Performance in GPU Architectures: Potentials and Distances Amirali Baniasadi ECE University of Victoria Ahmad Lashgar ECE University of Tehran WDDD-9 June 5, 2011

This Work Goal: Investigating GPU performance for general-purpose workloads How: Studying the isolated impact of • Memory divergence • Branch divergence • Context-keeping resources Key finding: Memory has the biggest impact. Branch divergence solution needs memory consideration. A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Outline Background Performance Impacting Parameters Machine Models Performance Potentials Performance Distances Sensitivity Analysis Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

GPU Architecture TPC1 TPC10 SM1 SM1 SM2 SM2 SM3 SM3 . . . . . . . . . DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 DRAM1 MCtrl5 MCtrl2 MCtrl1 MCtrl6 DRAM6 DRAM1 DRAM5 DRAM2 • Number of concurrent CTAs per SM is limited by the size of 3 shared resources: • Thread Pool • Register File • Shared Memory . . . Register File Shared Memory Thread Pool TID CTAID Program Counter . . . . . . . . . . . . . . . Interconnection Network . . . . . . . . . . . . TID CTAID Program Counter … … PE1 PE2 PE31 PE32 L1Data L1Cost L1Text A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Branch Divergence A: // Pre-Divergence if(CONDITION) { B: //NT path } else { C: //T path } D: // reconvergence point • SM is SIMD processor • Group of threads (warp) execute the same instruction on the lanes. • Branch instruction potentially diverge warp to two groups: • Threads with taken outcome • Threads with not-taken outcome A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Control-flow mechanism • Control-flow solutions address this. • Previous solutions: • Postdominator Reconvergence (PDOM) • Masking and serializing in diverging paths, finally reconverging all paths • Dynamic Warp Formulation (DWF) • Regrouping the threads in diverging paths into new warps A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

PDOM SIMD Utilization over time TOS TOS Dynamic regrouping of diverged threads at same path increases utilization A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

DWF SIMD Utilization over time Warp Pool Merge Possibility A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance impacting parameters • Memory Divergence • Increase of memory pressure with un-coalesced memory accesses • Branch Divergence • Decrease of SIMD efficiency with inter-warp diverging-branch • Workload Parallelism • CTA-limiting resources bound memory latency hiding capability • Concurrent CTAs share 3 CTA-limiting resources: • Shared Memory • Register File • Thread Pool A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Machine Models • Isolates the impact of each parameter: X X - Y Y - Z Z DC:DWF Control-flow PC:PDOM Control-flow IC:Ideal Control-flow (MIMD) IM:Ideal Memory M:Real Memory Limited Resources :LR Unlimited Resources :UR A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Machine Models continued… • LR-DC-M • LR-PC-M • LR-IC-M • LR-DC-IM • LR-PC-IM • LR-IC-IM • UR-DC-M • UR-PC-M • UR-IC-M • UR-DC-IM • UR-PC-IM • UR-IC-IM Real-Memory Limited per SM resources Ideal-Memory Real-Memory Unlimited per SM resources Ideal-Memory A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Methodology GPGPU-sim v2.1.1b 13 benchmarks from RODINIA benchmark suite and CUDA SDK 2.3 A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Potentials • The speedup can be reached if the impacting parameter is idealized • 3 Potentials (per control-flow mechanism): • Memory Potential • Speedup due to ideal memory • Control Potential • Speedup due to free-of-divergence architecture • Resource Potential • Speedup due to infinite CTA-limiting resources per SM A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Potentials continued… A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Memory Potentials DWF 61% PDOM 59% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Resource Potentials DWF 8.6% PDOM 9.4% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Control Potentials PDOM -7% DWF 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Distances • How much an otherwise ideal GPU is distanced from ideal due to the parameter. • 3 Distances: • Memory Distance • Distance form ideal GPU due to real memory • Resource Distance • Distance from ideal GPU due to limited resources • Control Distance • Distance from ideal GPU due to branch divergence A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance Distances continued… A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Memory Distance 40% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Resource Distance 2% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Control Distances PDOM 8% DWF 15% A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Sensitivity Analysis • Validating the findings under aggressive configurations: • Aggressive-Memory • 2x L1 caches • 2x Number of memory controllers • Aggressive-Resource • 2x CTA-limiting resources • Limited to performance potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-memory DWF memory potential 28% PDOM memory potential 28% Memory Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-memory continued… DWF control potential -0.4% PDOM control potential -8% Control Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-memory continued… PDOM resource potential 8% DWF resource potential ~0% Resource Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-resource DWF memory potential 52% PDOM memory potential 51% Memory Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-resource continued… PDOM control potential -8% DWF control potential 2% Control Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Aggressive-resource continued… PDOM resource potential 4% DWF resource potential 3% Resource Potentials A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Conclusion A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Conclusion • Performance in GPUs • Potentials: Improvement by idealizing • Memory: 59% and 61% for PDOM and DWF • Control: -7% and 2% for PDOM and DWF • Resource: 9.4% and 8.6 for PDOM and DWF • Distances: Distance from ideal system due to a none-ideal factor • Memory: 40% • Control: 8% and 15% for PDOM and DWF • Resource: 2% • Findings: • Memory has the biggest impact among the 3 factors • Improving control-flow mechanism has to consider memory pressure • Same trend under aggressive memory and context-keeping resources A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Thank you. Questions? A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Why 32 PEs per SM • GPGPU-sim v2.1.1b coalesces memory accesses over SIMD width slices of a warp separately, similar to pre-Fermi GPUs: • Example: Warp Size = 32, PEs per SM = 8 • 4 independent coalescing domains in a warp • We used 32 PEs per SM with ¼ clock rate to model coalescing similar to Fermi GPUs: A. Lashgar and A. Baniasadi. Performance in GPU architectures: Potentials and Distances.

Performance in GPU Architectures: Potentials and Distances