Online Performance Projection for Clusters with Heterogeneous GPUs

Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, AshwinM. Aji, Wu-chunFeng (Virginia Tech, USA) JiayuanMeng, PavanBalaji(Argonne National Laboratory, USA)

Diversity in Accelerators Nov, 2013 Nov, 2008 Performance Share of Accelerators in Top500 Systems Source: top500.org Lokendra Panwar (lokendra@cs.vt.edu)

Heterogeneity “Among” Nodes • Clusters are deploying different accelerators • Different accelerators for different tasks • Example clusters: • “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs • “Darwin” at LANL: NVIDIA GPUs, AMD GPUs • “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs Lokendra Panwar (lokendra@cs.vt.edu)

Heterogeneity “Among” Nodes • Clusters are deploying different accelerators • Different accelerators for different tasks • Example clusters: • “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs • “Darwin” at LANL: NVIDIA GPUs, AMD GPUs • “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs • However … A unified programming model for “all” accelerators: OpenCL • CPUs, GPUs, FPGAs, DSPs Lokendra Panwar (lokendra@cs.vt.edu)

Affinity of Tasks to Processors • Peak performance doesn’t necessarily translate into actual device performance. Lokendra Panwar (lokendra@cs.vt.edu)

Affinity of Tasks to Processors • Peak performance doesn’t necessarily translate into actual device performance. ? OpenCL Program Lokendra Panwar (lokendra@cs.vt.edu)

Challenges for Runtime Systems • It is crucial for heterogeneous runtime systems to embrace different accelerators in clusters w.r.t. performance and power • Examples of OpenCL runtime systems: • SnuCL • VOCL • SOCL • Challenges: • Efficiently choose the right device for the right task • Keep the decision making overhead minimal Lokendra Panwar (lokendra@cs.vt.edu)

Our Contributions • An online workload characterization technique for OpenCL kernels • Our model projects the relative ranking of different devices with little overhead • An end-to-end evaluation of our technique for multiple architectural families of AMD and NVIDIA GPUs Lokendra Panwar (lokendra@cs.vt.edu)

Outline • Introduction • Motivation • Contributions • Design • Evaluation • Conclusion Lokendra Panwar (lokendra@cs.vt.edu)

Design • Goal: • Rank accelerators for a given OpenCL workload • Accurately AND efficiently • Decision making with minimal overhead Lokendra Panwar (lokendra@cs.vt.edu)

Design • Goal: • Rank accelerators for a given OpenCL workload • Accurately AND efficiently • Decision making with minimal overhead • Choices: • Static Code Analysis: • Fast • Inaccurate, as it does not account for dynamic properties: • Input data dependence, memory access patterns, dynamic instructions Lokendra Panwar (lokendra@cs.vt.edu)

Design • Goal: • Rank accelerators for a given OpenCL workload • Accurately AND efficiently • Decision making with minimal overhead • Choices: • Static Code Analysis: • Fast • Inaccurate, as it does not account for dynamic properties: • Input data dependence, memory access patterns, dynamic instructions • Dynamic Code Analysis: • Higher accuracy • Execute either on actual device or through a “emulator” • Not always feasible to run on actual devices: Data transfer costs, Clusters are “busy” • Emulators are very slow Lokendra Panwar (lokendra@cs.vt.edu)

Design – Workload Profiling Instruction Mix Emulator OpenCL Kernel Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

Design – Workload Profiling Instruction Mix Mini Emulator • “Mini-emulation” • Emulate a single workgroup • Collect dynamic characteristics: • Instruction traces • Global and Local memory transactions and access patterns • In typical data-parallel workloads, workgroups exhibit similar runtime characteristics • Asymptotically lower overhead OpenCL Kernel Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

Design – Device Profiling GPU1 GPU 2 GPU N …… Instruction and Memory Microbenchmarks Device Throughput Profiles Lokendra Panwar (lokendra@cs.vt.edu)

Design – Device Profiling • Build device throughput profiles: • Modified SHOC microbenchmarks to • Obtain hardware throughput with varying occupancy • Collect throughputs for instructions, global memory and local memory • Built only once Global and Local memory profile of AMD 7970 Lokendra Panwar (lokendra@cs.vt.edu)

Design – Find Performance Limiter Device Profile Instruction Mix Workload Profile Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

Design – Find Performance Limiter • Single workgroup dynamic characteristics Full kernel characteristics • Device occupancy as scaling factor • Compute projected theoretical times: • Instructions • Global memory • Local memory • GPUs aggressively try to hide latencies of components • Performance limiter = • max(tlocal, tglobal, tcompute)* • Compare the normalized predicted times and choose best device *Zhang et. al. A Quantitative Performance Analysis Model for GPU Architectures, HPCA’2011 Lokendra Panwar (lokendra@cs.vt.edu)

Design GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Device Profile Lokendra Panwar (lokendra@cs.vt.edu)

Design GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Device Profile Dynamic Profiling Instruction Mix Mini-Emulator (Single workgroup) GPU Kernel Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

Design GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Device Profile Effective Instruction Throughput Dynamic Profiling Performance Projection Instruction Mix Mini-Emulator (Single workgroup) Effective Global Memory Bandwidth Perf. Limiter? GPU Kernel Memory Patterns Bank Conflicts Effective Local Memory Bandwidth Relative GPU Performances Lokendra Panwar (lokendra@cs.vt.edu)

Experimental Setup • Accelerators: • AMD 7970 : Scalar ALUs, Cache hierarchy • AMD 5870: VLIW ALUs • NVIDIA C2050: Fermi Architecture Cache Hierarchy • NVIDIA C1060: Tesla Architecture • Simulators: • Multi2simv4.1 for AMD and GPGPU-Sim v3.0 for NVIDIA devices • Methodology agnostic to specific emulator • Applications: Lokendra Panwar (lokendra@cs.vt.edu)

Application Boundedness : AMD GPUs compute Projected Time (Normalized) lmem lmem gmem gmem gmem gmem gmem compute lmem compute gmem gmem gmem gmem gmem Lokendra Panwar (lokendra@cs.vt.edu)

Application Boundedness Summary Lokendra Panwar (lokendra@cs.vt.edu)

Accuracy of Performance Projection . Lokendra Panwar (lokendra@cs.vt.edu)

Emulation Overhead – Reduction Kernel Lokendra Panwar (lokendra@cs.vt.edu)

90/10 Paradigm -> 10x10 Paradigm • Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”) • Narrower focus on applications (10%) • Simplified and specialized accelerators for each classification • Why? • 10x lower power, 10x faster -> 100x energy efficient Figure credit: A. Chien, Salishan Conference 2010 Lokendra Panwar (lokendra@cs.vt.edu)

Conclusion • We presented a “Mini-emulation” technique for online workload characterization for OpenCL kernels • The approach is shown to be sufficiently accurate for relative performance projection • The approach has asymptotically lower overhead than projection using full kernel emulation • Our technique is shown to work well with multiple architectural families of AMD and NVIDIA GPUs • With the increasing diversity in accelerators (towards 10x10*), our methodology only becomes more relevant. *S. Borkar and A. Chien, “The future of microprocessors,” Communications of the ACM, 2011 LokendraPanwar (lokendra@cs.vt.edu)

Thank You Lokendra Panwar (lokendra@cs.vt.edu)

Backup Lokendra Panwar (lokendra@cs.vt.edu)

Evolution of Microprocessors: 90/10 Paradigm • Derive common cases for applications (90%) • Broad focus on application workloads • Architectural improvements for 90% of cases • Design an aggregated generic “core” • Lesser customizability for applications Figure credit: A. Chien, Salishan Conference 2010 Lokendra Panwar (lokendra@cs.vt.edu)

90/10 Paradigm -> 10x10 Paradigm • Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”) • Narrower focus on applications (10%) • Simplified and specialized accelerators for each classification • Why? • 10x lower power, 10x faster -> 100x energy efficient Figure credit: A. Chien, Salishan Conference 2010 Lokendra Panwar (lokendra@cs.vt.edu)

compute Application Boundedness : NVIDIA GPUs Projected Time (Normalized) compute gmem gmem gmem compute lmem gmem compute compute gmem compute gmem gmem gmem gmem Lokendra Panwar (lokendra@cs.vt.edu)

Evaluation: Projection Accuracy (Relative to C1060)

Evaluation: Projection Overhead vs. Actual Kernel Execution of Matrix Multiplication

Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Matrix Multiplication

Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Reduction

Online Performance Projection for Clusters with Heterogeneous GPUs