1 / 41

Online Performance Projection for Clusters with Heterogeneous GPUs

Online Performance Projection for Clusters with Heterogeneous GPUs. Lokendra S. Panwar , Ashwin M. Aji , Wu- chun Feng (Virginia Tech, USA) Jiayuan Meng , Pavan Balaji ( Argonne National Laboratory, USA). Diversity in Accelerators. Nov, 2013. Nov, 2008.

bunme
Download Presentation

Online Performance Projection for Clusters with Heterogeneous GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, AshwinM. Aji, Wu-chunFeng (Virginia Tech, USA) JiayuanMeng, PavanBalaji(Argonne National Laboratory, USA)

  2. Diversity in Accelerators Nov, 2013 Nov, 2008 Performance Share of Accelerators in Top500 Systems Source: top500.org Lokendra Panwar (lokendra@cs.vt.edu)

  3. Heterogeneity “Among” Nodes • Clusters are deploying different accelerators • Different accelerators for different tasks • Example clusters: • “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs • “Darwin” at LANL: NVIDIA GPUs, AMD GPUs • “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs Lokendra Panwar (lokendra@cs.vt.edu)

  4. Heterogeneity “Among” Nodes • Clusters are deploying different accelerators • Different accelerators for different tasks • Example clusters: • “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs • “Darwin” at LANL: NVIDIA GPUs, AMD GPUs • “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs • However … A unified programming model for “all” accelerators: OpenCL • CPUs, GPUs, FPGAs, DSPs Lokendra Panwar (lokendra@cs.vt.edu)

  5. Affinity of Tasks to Processors • Peak performance doesn’t necessarily translate into actual device performance. Lokendra Panwar (lokendra@cs.vt.edu)

  6. Affinity of Tasks to Processors • Peak performance doesn’t necessarily translate into actual device performance. ? OpenCL Program Lokendra Panwar (lokendra@cs.vt.edu)

  7. Challenges for Runtime Systems • It is crucial for heterogeneous runtime systems to embrace different accelerators in clusters w.r.t. performance and power • Examples of OpenCL runtime systems: • SnuCL • VOCL • SOCL • Challenges: • Efficiently choose the right device for the right task • Keep the decision making overhead minimal Lokendra Panwar (lokendra@cs.vt.edu)

  8. Our Contributions • An online workload characterization technique for OpenCL kernels • Our model projects the relative ranking of different devices with little overhead • An end-to-end evaluation of our technique for multiple architectural families of AMD and NVIDIA GPUs Lokendra Panwar (lokendra@cs.vt.edu)

  9. Outline • Introduction • Motivation • Contributions • Design • Evaluation • Conclusion Lokendra Panwar (lokendra@cs.vt.edu)

  10. Design • Goal: • Rank accelerators for a given OpenCL workload • Accurately AND efficiently • Decision making with minimal overhead Lokendra Panwar (lokendra@cs.vt.edu)

  11. Design • Goal: • Rank accelerators for a given OpenCL workload • Accurately AND efficiently • Decision making with minimal overhead • Choices: • Static Code Analysis: • Fast • Inaccurate, as it does not account for dynamic properties: • Input data dependence, memory access patterns, dynamic instructions Lokendra Panwar (lokendra@cs.vt.edu)

  12. Design • Goal: • Rank accelerators for a given OpenCL workload • Accurately AND efficiently • Decision making with minimal overhead • Choices: • Static Code Analysis: • Fast • Inaccurate, as it does not account for dynamic properties: • Input data dependence, memory access patterns, dynamic instructions • Dynamic Code Analysis: • Higher accuracy • Execute either on actual device or through a “emulator” • Not always feasible to run on actual devices: Data transfer costs, Clusters are “busy” • Emulators are very slow Lokendra Panwar (lokendra@cs.vt.edu)

  13. Design – Workload Profiling Instruction Mix Emulator OpenCL Kernel Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

  14. Design – Workload Profiling Instruction Mix Mini Emulator • “Mini-emulation” • Emulate a single workgroup • Collect dynamic characteristics: • Instruction traces • Global and Local memory transactions and access patterns • In typical data-parallel workloads, workgroups exhibit similar runtime characteristics • Asymptotically lower overhead OpenCL Kernel Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

  15. Design – Device Profiling GPU1 GPU 2 GPU N …… Instruction and Memory Microbenchmarks Device Throughput Profiles Lokendra Panwar (lokendra@cs.vt.edu)

  16. Design – Device Profiling • Build device throughput profiles: • Modified SHOC microbenchmarks to • Obtain hardware throughput with varying occupancy • Collect throughputs for instructions, global memory and local memory • Built only once Global and Local memory profile of AMD 7970 Lokendra Panwar (lokendra@cs.vt.edu)

  17. Design – Find Performance Limiter Device Profile Instruction Mix Workload Profile Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

  18. Design – Find Performance Limiter • Single workgroup dynamic characteristics Full kernel characteristics • Device occupancy as scaling factor • Compute projected theoretical times: • Instructions • Global memory • Local memory • GPUs aggressively try to hide latencies of components • Performance limiter = • max(tlocal, tglobal, tcompute)* • Compare the normalized predicted times and choose best device *Zhang et. al.  A Quantitative Performance Analysis Model for GPU Architectures, HPCA’2011 Lokendra Panwar (lokendra@cs.vt.edu)

  19. Design GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Device Profile Lokendra Panwar (lokendra@cs.vt.edu)

  20. Design GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Device Profile Dynamic Profiling Instruction Mix Mini-Emulator (Single workgroup) GPU Kernel Memory Patterns Bank Conflicts Lokendra Panwar (lokendra@cs.vt.edu)

  21. Design GPU 1 GPU 2 GPU N …… Instruction and Memory Benchmarks Static Profiling Device Profile Effective Instruction Throughput Dynamic Profiling Performance Projection Instruction Mix Mini-Emulator (Single workgroup) Effective Global Memory Bandwidth Perf. Limiter? GPU Kernel Memory Patterns Bank Conflicts Effective Local Memory Bandwidth Relative GPU Performances Lokendra Panwar (lokendra@cs.vt.edu)

  22. Outline • Introduction • Motivation • Contributions • Design • Evaluation • Conclusion Lokendra Panwar (lokendra@cs.vt.edu)

  23. Experimental Setup • Accelerators: • AMD 7970 : Scalar ALUs, Cache hierarchy • AMD 5870: VLIW ALUs • NVIDIA C2050: Fermi Architecture Cache Hierarchy • NVIDIA C1060: Tesla Architecture • Simulators: • Multi2simv4.1 for AMD and GPGPU-Sim v3.0 for NVIDIA devices • Methodology agnostic to specific emulator • Applications: Lokendra Panwar (lokendra@cs.vt.edu)

  24. Application Boundedness : AMD GPUs compute Projected Time (Normalized) lmem lmem gmem gmem gmem gmem gmem compute lmem compute gmem gmem gmem gmem gmem Lokendra Panwar (lokendra@cs.vt.edu)

  25. Application Boundedness Summary Lokendra Panwar (lokendra@cs.vt.edu)

  26. Accuracy of Performance Projection . Lokendra Panwar (lokendra@cs.vt.edu)

  27. Accuracy of Performance Projection . Lokendra Panwar (lokendra@cs.vt.edu)

  28. Accuracy of Performance Projection . Lokendra Panwar (lokendra@cs.vt.edu)

  29. Emulation Overhead – Reduction Kernel Lokendra Panwar (lokendra@cs.vt.edu)

  30. Outline • Introduction • Motivation • Contributions • Design • Evaluation • Conclusion Lokendra Panwar (lokendra@cs.vt.edu)

  31. 90/10 Paradigm -> 10x10 Paradigm • Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”) • Narrower focus on applications (10%) • Simplified and specialized accelerators for each classification • Why? • 10x lower power, 10x faster -> 100x energy efficient Figure credit: A. Chien, Salishan Conference 2010 Lokendra Panwar (lokendra@cs.vt.edu)

  32. Conclusion • We presented a “Mini-emulation” technique for online workload characterization for OpenCL kernels • The approach is shown to be sufficiently accurate for relative performance projection • The approach has asymptotically lower overhead than projection using full kernel emulation • Our technique is shown to work well with multiple architectural families of AMD and NVIDIA GPUs • With the increasing diversity in accelerators (towards 10x10*), our methodology only becomes more relevant. *S. Borkar and A. Chien, “The future of microprocessors,” Communications of the ACM, 2011 LokendraPanwar (lokendra@cs.vt.edu)

  33. Thank You Lokendra Panwar (lokendra@cs.vt.edu)

  34. Backup Lokendra Panwar (lokendra@cs.vt.edu)

  35. Evolution of Microprocessors: 90/10 Paradigm • Derive common cases for applications (90%) • Broad focus on application workloads • Architectural improvements for 90% of cases • Design an aggregated generic “core” • Lesser customizability for applications Figure credit: A. Chien, Salishan Conference 2010 Lokendra Panwar (lokendra@cs.vt.edu)

  36. 90/10 Paradigm -> 10x10 Paradigm • Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”) • Narrower focus on applications (10%) • Simplified and specialized accelerators for each classification • Why? • 10x lower power, 10x faster -> 100x energy efficient Figure credit: A. Chien, Salishan Conference 2010 Lokendra Panwar (lokendra@cs.vt.edu)

  37. compute Application Boundedness : NVIDIA GPUs Projected Time (Normalized) compute gmem gmem gmem compute lmem gmem compute compute gmem compute gmem gmem gmem gmem Lokendra Panwar (lokendra@cs.vt.edu)

  38. Evaluation: Projection Accuracy (Relative to C1060)

  39. Evaluation: Projection Overhead vs. Actual Kernel Execution of Matrix Multiplication

  40. Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Matrix Multiplication

  41. Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Reduction

More Related