1 / 26

PGI Compilers & Tools Update- March 2018

Check out the March 2018 PGI Compilers & Tools update for news about the HPC SDK, OpenACC, CUDA, and more.

nvidia
Download Presentation

PGI Compilers & Tools Update- March 2018

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PGI COMPILERS & TOOLS UPDATE PGI Compilers for Heterogeneous Supercomputing, March 2018

  2. PGI - THE NVIDIA HPC SDK Fortran, C & C++ Compilers Optimizing, SIMD Vectorizing, OpenMP Accelerated Computing Features OpenACC Directives CUDA Fortran Multi-Platform Solution Multicore x86-64 and OpenPOWER CPUs, NVIDIA Tesla GPUs Supported on Linux, macOS, Windows MPI/OpenMP/OpenACC Tools Debugger Performance Profiler Interoperable with DDT, TotalView 2

  3. OPENACC FOR EVERYONE PGI Community Edition Now Available FREE PROGRAMMING MODELS OpenACC, CUDA Fortran, OpenMP, C/C++/Fortran Compilers and Tools PLATFORMS X86, OpenPOWER, NVIDIA GPU UPDATES 1-2 times a year 6-9 times a year 6-9 times a year PGI Premier Services SUPPORT User Forums PGI Support LICENSE Annual Perpetual Volume/Site 3

  4. Latest CPUs Support Intel Skylake AMD Zen IBM POWER9 Full OpenACC 2.6 OpenMP 4.5 for multicore CPUs AVX-512 code generation Integrated CUDA 9.1 toolkit/libraries New fastmath intrinsics library Partial C++17 support Optional LLVM-based x86 code generator pgicompilers.com/whats-new 4

  5. SPEC ACCEL 1.2 BENCHMARKS OpenACC OpenMP 4.5 200 200 Intel 2018 PGI 18.1 PGI 18.1 150 150 GEOMEAN Seconds GEOMEAN Seconds 100 100 4.4x Speed-up 50 50 0 0 2-socket Skylake 40 cores / 80 threads 2-socket EPYC 48 cores / 48 threads 2-socket Broadwell 40 cores / 80 threads 2-socket Broadwell 1x Volta V100 Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs @ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. Volta: NVIDIA DGX1 system with two 20 core Intel Xeon E5-2698 v4 CPUs @ 2.20GHz, 256GB memory, one NVIDIA Tesla V100-SXM2-16GB GPU @ 1.53GHz. SPEC® is a registered trademark of the Standard Performance Evaluation Corporation (www.spec.org). 5

  6. SPEC CPU 2017 FP SPEED BENCHMARKS 200 Intel 2018 PGI 18.1 150 GEOMEAN Seconds 100 50 0 2-socket Skylake 40 cores / 80 threads 2-socket EPYC 48 cores / 48 threads 2-socket Broadwell 40 cores / 80 threads Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs @ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. SPEC® is a registered trademark of the Standard Performance Evaluation Corporation (www.spec.org). 6

  7. OPENACC UPDATE 7

  8. OPENACC DIRECTIVES Manage Data Movement • Incremental #pragma acc data copyin(a,b) copyout(c) { ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ... } • Single source base • Interoperable Initiate Parallel Execution • Performance portable • CPU, GPU, Manycore Optimize Loop Mappings 8

  9. OPENACC IS FOR MULTICORE CPUS & GPUS 98 !$ACC KERNELS 99 !$ACC LOOP INDEPENDENT 100 DO k=y_min-depth,y_max+depth 101 !$ACC LOOP INDEPENDENT 102 DO j=1,depth 103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k) 104 ENDDO 105 ENDDO 106 !$ACC END KERNELS GPU CPU % pgfortran -ta=multicore –fast –Minfo=acc -c \ update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable Generating Multicore code 100, !$acc loop gang 102, Loop is parallelizable % pgfortran -ta=tesla –fast -Minfo=acc –c \ update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable 102, Loop is parallelizable Accelerator kernel generated Generating Tesla code 100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y 102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x 9

  10. CLOVERLEAF AWE Hydrodynamics mini-App, bm32 data set http://uk-mac.github.io/CloverLeaf 160 142x 140 Speedup vs Single Haswell Core PGI 18.1 OpenACC 109x 120 Intel 2018 OpenMP 100 80 67x 60 40x 40 14.8x 15x 11x 20 10x 10x 7.6x 7.9x 0 Kepler Pascal 1x 2x 4x Multicore Haswell Multicore Broadwell Multicore Skylake Volta V100 Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Broadwell server, eight V100s (dgx07), Skylake 2x20 core Xeon Gold server (sky-4). Compilers: Intel 2018.0.128, PGI 18.1 Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC (MPI+OpenACC) Data compiled by PGI February 2018. 10

  11. OPENACC UPTAKE IN HPC Hackathons Training Community Applications 3 of Top 5 HPC Apps ANSYS Fluent & Gaussian released; VASP in development 5 Events in 2017 All hackathons are initiated by users OpenACC and DLI 10 new modules; instructor certification User Group SC17 participation up 27% vs SC16 6–9 Events in 2018 Slack Channel 2x growth in last 6 months 18+ 2018 workshops ECMWF, KAUST, PSC, CESGA 5 ORNL CAAR Codes GTC, XGC, ACME, FLASH, LSDalton 94 Codes GPU accelerated to date Online Courses 5K+ attended over last 3 years Downloads PGI Community Edition quarterly downloads up 136% in 2017 109 Apps Total being tracked Expertise 94 mentors registered Online Labs 4.3K+ taken over last 3 years 11

  12. Parallelization Strategy Within Gaussian 16, GPUs are used for a small fraction of code that consumes a large fraction of the execution time. T e implementation of GPU parallelism conforms to Gaussian’s general parallelization strategy. Its main tenets are to avoid changing the underlying source code and to avoid modif cations which negatively af ect CPU performance. For these reasons, OpenACC was used for GPU parallelization. PGI Accelerator Compilers with OpenACC PGI compilers fully support the current OpenACC standard as well as important extensions to it. PGI is an important contributor to the ongoing development of OpenACC. OpenACC enables developers to implement GPU parallelism by adding compiler directives to their source code, of en eliminating the need for rewriting or restructuring. For example, the following Fortran compiler directive identif es a loop which the compiler should parallelize: ! $ac c par al l el l oop Other directives allocate GPU memory, copy data to/from GPUs, specify data to remain on the GPU, combine or split loops and other code sections, and generally provide hints for optimal work distribution management, and more. T e OpenACC project is very active, and the specif cations and tools are changing fairly rapidly. T is has been true throughout the lifetime of this project. Indeed, one of its major challenges has been using OpenACC in the midst of its development. T e talented people at PGI were instrumental in addressing issues that arose in one of the very f rst uses of OpenACC for a large commercial sof ware package. T e Gaussian approach to parallelization relies on environment-specif c parallelization frameworks and tools: OpenMP for shared-memory, Linda for cluster and network parallelization across discrete nodes, and OpenACC for GPUs. T e process of implementing GPU support involved many dif erent aspects: Identifying places where GPUs could be benef cial. T ese are a subset of areas which are parallelized for other execution contexts because using GPUs requires f ne grained parallelism. Understanding and optimizing data movement/storage at a high level to maximize GPU ef ciency. PGI’s sophisticated prof ling and performance evaluation tools were vital to the success of the ef ort. Specifying GPUs to Gaussian 16 T e GPU implementation in Gaussian 16 is sophisticated and complex but using it is simple and straightforward. GPUs are specif ed with 1 additional Link 0 command (or equivalent Default.Route f le entry/command line option). For example, the following commands tell Gaussian to run the calculation using 24 compute cores plus 8 GPUs+8 controlling cores (32 cores total): Request 32 CPUs for the calculation: 24 cores for computation, and 8 cores to control GPUs (see below). Use GPUs 0-7 with CPUs 0-7 as their controllers. %CPU=0- 31 %GPUCPU=0- 7=0- 7 Detailed information is available on our website. GAUSSIAN 16 Project Contributors Mike Frisch, Ph.D. President and CEO Gaussian, Inc. Using OpenACC allowed us to continue development of our fundamental algorithms and software capabilities simultaneously with the GPU-related work. In the end, we could use the same code base for SMP, cluster/ network and GPU parallelism. PGI's compilers were essential to the success of our efforts. Roberto Gomperts NVIDIA Michael Frisch Gaussian Brent Leback NVIDIA/PGI Giovanni Scalmani Gaussian Gaussian, Inc. 340 Quinnipiac S t. Bldg. 40 Wallingford, CT 06492 USA custserv@ gaussian.com Gaussian is a registered trademark of Gaussian, Inc. All other trademarks and registered trademarks are the properties of their respective holders. Specif cations subject to change without notice. Copyright © 2017, Gaussian, Inc. All rights reserved. 12

  13. ANSYS FLUENT Sunil Sathe Lead Software Developer ANSYS Fluent We’ve effectively used OpenACC for heterogeneous computing in ANSYS Fluent with impressive performance. We’re now applying this work to more of our models and new platforms. 13

  14. VASP Prof. Georg Kresse Computational Materials Physics University of Vienna For VASP, OpenACC is the way forward for GPU acceleration. Performance is similar and in some cases better than CUDA C, and OpenACC dramatically decreases GPU development and maintenance efforts. We’re excited to collaborate with NVIDIA and PGI as an early adopter of CUDA Unified Memory. 14

  15. MPAS-A Richard Loft Director, Technology Development NCAR Our team has been evaluating OpenACC as a pathway to performance portability for the Model for Prediction (MPAS) atmospheric model. Using this approach on the MPAS dynamical core, we have achieved performance on a single P100 GPU equivalent to 2.7 dual socketed Intel Xeon nodes on our new Cheyenne supercomputer. Image courtesy: NCAR 15

  16. NUMECA FINE/Open David Gutzwiller Lead Software Developer NUMECA Porting our unstructured C++ CFD solver FINE/Open to GPUs using OpenACC would have been impossible two or three years ago, but OpenACC has developed enough that we’re now getting some really good results. 16

  17. COSMO Dr. Oliver Fuhrer Senior Scientist Meteoswiss OpenACC made it practical to develop for GPU-based hardware while retaining a single source for almost all the COSMO physics code. 17

  18. GAMERA FOR GPU Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Lalith Wijerathne The University of Tokyo With OpenACC and a compute node based on NVIDIA's Tesla P100 GPU, we achieved more than a 14X speed up over a K Computer node running our earthquake disaster simulation code Map courtesy University of Tokyo 18

  19. QUANTUM ESPRESSO Filippo Spiga Head of Research Software Engineering University of Cambridge CUDA Fortran gives us the full performance potential of the CUDA programming model and NVIDIA GPUs. !$CUF KERNELS directives give us productivity and source code maintainability. It’s the best of both worlds. 19

  20. OPENACC AND CUDA UNIFIED MEMORY 20

  21. Programming GPU-Accelerated Systems CUDA Unified Memory for Dynamically Allocated Data GPU Developer View GPU Developer View With CUDA Unified Memory PCIe System Memory GPU Memory Unified Memory 21

  22. PGI OpenACC and CUDA Unified Memory Compiling with the –ta=tesla:managed option GPU Developer View With CUDA Unified Memory #pragma acc data copyin(a,b) copyout(c) { ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ... } Unified Memory C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory 22

  23. PGI OpenACC and CUDA Unified Memory Compiling with the –ta=tesla:managed option GPU Developer View With CUDA Unified Memory ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ... Unified Memory C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory 23

  24. GTC: An OpenACC Production Application Being ported for runs on the ORNL Summit supercomputer The gyrokinetic toroidal code (GTC) is a massively parallel, particle-in-cell production code for turbulence simulation in support of the burning plasma experiment ITER, the crucial next step in the quest for fusion energy. http://phoenix.ps.uci.edu/gtc_group 24 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

  25. GTC Performance using OpenACC OpenPOWER | NVLink | Unified Memory | P100 | V100 16x 16.5X 14x 12x 12.1X 12X 10x 8x 6x 6.1X 5.9X 4x 2x 20-core P8 P8+2xP100 UM P8+2xP100 P8+4xP100 UM P8+4xP100 x64+4xV100 Data Directives Data Directives Data Directives P8 UM : IBM POWER8NVL, 2 sockets, 20 cores, NVLINK : No Data Directives in sources, compiled with –ta=tesla:managed 25

  26. OPENACC RESOURCES Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow Success Stories https://www.openacc.org/success-stories Support Options Resources https://www.openacc.org/resources www.openacc.org/community#slack PGI User Forums pgicompilers.com/userforum Compilers and Tools https://www.openacc.org/tools Events https://www.openacc.org/events stackoverflow.com/questions/tagged /openacc' Questions 26

More Related