1 / 25

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters. Filippo Spiga 1,2 <fs395@cam.ac.uk>. 1 HPCS, University of Cambridge 2 Quantum ESPRESSO Foundation. What is Quantum ESPRESSO?.

toni
Download Presentation

Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters Filippo Spiga1,2 <fs395@cam.ac.uk> 1HPCS, University of Cambridge 2Quantum ESPRESSO Foundation

  2. What is Quantum ESPRESSO? • Quantum ESPRESSO is an integrated suite of computer codes for atomistic simulations based on DFT, pseudo-potentials, and plane waves • "ESPRESSO" stands for opEnSource Package for Research in Electronic Structure, Simulation, and Optimization • Quantum ESPRESSO is an initiative of SISSA, EPFL, and ICTP, with many partners in Europe and worldwide • Quantum ESPRESSO is free software that can be freely downloaded. Everybody is free to use it and welcome to contribute to its development

  3. What Quantum ESPRESSO can do? • ground-state calculations • Kohn-Sham orbitals and energies, total energies and atomic forces • finite as well as infinite system • any crystal structure or supercell • insulators and metals (different schemes of BZ integration) • structural optimization (many minimization schemes available) • transition states and minimum-energy paths (via NEB or string dynamics) electronic polarization via Berry’s phase • finite electric fields via saw-tooth potential or electric enthalpy • norm-conserving as well as ultra-soft and PAW pseudo-potentials • many different energy functionals, including meta-GGA, DFT+U, and hybrids (van der Waals soon to be available) • scalar-relativistic as well as fully relativistic (spin-orbit) calculations • magnetic systems, including non-collinear magnetism • Wannier intepolations • ab-initio molecular dynamics • Car-Parrinello (many ensembles and flavors) • Born-Oppenheimer (many ensembles and flavors) • QM-MM (interface with LAMMPS) • linear response and vibrational dynamics • phonon dispersions, real-space interatomic force constants • electron-phonon interactions and superconductivity effective charges and dielectric tensors • third-order an-harmonicities and phonon lifetimes • infrared and (off-resonance) Raman cross sections • thermal properties via the quasi-harmonic approximation • electronic excited states • TDDFT for very large systems (both real-time and “turbo-Lanczos”) • MBPT for very large systems (GW, BSE) .... plus several post processing tools!

  4. Quantum ESPRESSO in numbers • 350,000+ lines of FORTRAN/C code • 46 registered developers • 1600+ registered users • 5700+ downloads of the latest 5.x.x version • 2 web-sites (quantum-espresso.org & qe-forge.org) • 1 official user mailing-list, 1 official developer mailing-list • 24 international schools and training courses (1000+ participants)

  5. PWscf in a nutshellprogram flow 3D-FFT + GEMM + LAPACK 3D-FFT 3D-FFT + GEMM

  6. Spoiler! • Only PWscf ported to GPU • Performance serial (full socket vs full socket + GPU): 3x ~ 4x • Performance parallel (best MPI+OpenMP vs ... + GPU): 2x ~ 3x • Designed to run better at low number of nodes (efficiency not high) • spin magnetization and noncolin not ported (working on it) • I/O set low on purpose • NVIDIA Kepler GPU not exploited at their best (working on it)

  7. Achievement: smart and selective BLAS phiGEMM: CPU+GPU GEMM operations • Drop-in library wont work as expected, need control • overcome limit of the GPU memory • flexible interface (C on the HOST, C on the DEVICE) • dynamic workload adjustment (SPLIT) -- heuristic • call-by-call profiling capabilities GPU C1 A1 B × + CPU I H2D D2H unbalance A2 C2 B × +

  8. Challenge: rectangular GEMMbad shape, poor performance • Issues: • A and B can be larger than GPU memory • A and B matrices are "badly" rectangular (dominant dimension) • Solutions: ~ +15% performance • tiling approach • not too big, not too small • GEMM computation must exceed copies (H-D, D-H), especially for small tiles • handling the "SPECIAL-K" case • adding beta × C done once • accumulating alpha × Ai × Bitimes n n k Common case due to data distribution k m m Optimizations included in phiGEMM ( version >1.9)

  9. Challenge: parallel 3D-FFT • 3D-FFT burns up to 40%~45% of total SCF run-time • 90-ish % 3D-FFT of PWscf are inside vloc_psi ("Wave" grid) • 3D-FFT is "small"  <3003 COMPLEX DP • 3D-FFT can be not a cube • In serial a 3DFFT is called as it is, in parallel 3D-FFT = Σ1D-FFT • In serial data layout is straightforward, in parallel not* • MPI communication become big issue for many-node problem • GPU FFT is mainly memory bounded  grouping & batching 3D-FFT

  10. Challenge: FFT data layoutit is all about sticks & planes A single 3D-FFT is divided in independent 1D-FFTs There are two "FFT grid" representation in Reciprocal Space: wave functions (Ecut) and charge density (4Ecut) Data are not contiguous and not “trivially” distributed across processors Zeros are not transformed. Different cut-offs preserve accuracy

  11. Challenge: parallel 3D-FFT Optimization #1 • CUDA-enabled MPI for P2P (within socket) • Overlap FFT computation with MPI communication • MPI communication >>> FFT computation for many nodes Sync MemCpy HD MPI MemCpy DH

  12. Challenge: parallel 3D-FFT Optimization #2 Observation: Limitation in overlapping D-H copy due to MPI communication • pinned needed (!!!) • Stream D-H copy to hide CPU copy and FFT computation Optimization #3 Observation: MPI “packets” small for many nodes • Re-order data before communication • Batch MPI_Alltoallv communications Optimization #4 Idea: reduce data transmitted (risky...) • Perform FFTs and GEMM in DP, truncate data before communication to SP

  13. Achievements: parallel 3D-FFTminiDFT 1.6 (k-points calculations, ultra-soft pseudo-potentials) Optimization #1: +37% improvement in communication Optimization #2: Optimization #3: +10% improvement in communication Optimization #4: +52% (!!!) improvement in communication SP vs DP without proper stream mng with proper stream mng Lower gain in PWscf !!!

  14. Challenge: parallel 3D-FFT All data of all FFT computed back to host mem 1 2 Data reordering before GPU-GPU communication Image courtesy of D.Stoic

  15. Challenge: H*psi compute/update H * psi: compute kinetic and non-local term (in G space)  complexity : Ni × (N × Ng+ Ng × N × Np) Loop over (not converged) bands: FFT (psi) to R space  complexity : Ni × Nb × FFT(Nr)        compute V * psi  complexity : Ni × Nb × Nr        FFT (V * psi) back to G space  complexity : Ni × Nb × FFT(Nr)     compute Vexx  complexity : Ni × Nc × Nq × Nb × (5 × Nr + 2×FFT(Nr)) N = 2×Nb (where Nb = number of valence bands) Ng = number of G vectors Ni = number of Davidson iteration Np = number of PP projector Nr = size of the 3D FFT grid Nq = number of q-point (may be different from Nk)

  16. Challenge: H*psinon-converged electronic bands dilemma Non-predictable number of FFT across all SCF iterations

  17. Challenge: parallel 3D-FFTthe orthogonal approach Considerations: • memory on GPU  ATLAS K40 (12 GByte) • (still) too much communication  GPU Direct capability needed • enough 3D-FFT  not predictable in advance • benefit also for CPU-only! FFT GR CUFFT GR PSIC PSIC PSIC PSI PSI Multiple LOCAL grid to compute “MPI_Allgatherv” products products DISTRIBUTED “MPI_Allscatterv” PSIC PSIC Overlapping is possible!! PSIC HPSI HPSI CUFFT RG FFT RG Not ready for production yet

  18. Challenge: eigen-solverswhich library? • LAPACK  MAGMA (ICL, University of Tennessee) • hybridization approach (CPU + GPU), dynamic scheduling based on DLA (QUARK) • single and multi-GPU, no memory distributed (yet) • some (inevitable) numerical "discrepancies" • ScaLAPACK  ELPA  ELPA + GPU (RZG + NVIDIA) • ELPA (Eigenvalue SoLvers for PetaflopApplications) improves ScaLAPACK • ELPA-GPU proof-of-concept based on CUDA FORTRAN • effective results below expectation • Lancronzdiagonaliz w/ tridiagonal QR algorithm (Penn State) • simple (too simple?) and designed to be GPU friendly • take advantage of GPU Direct • experimental, need testing and validation

  19. HPC Machines • 128 nodes dual-socket • dual 6-core Intel Ivy Bridge • dual NVIDIA K20c per node • dual Mellanox Connect-IB FDR TITAN (ORNL) [CRAY] WILKES (HPCS) [DELL] • 18688 nodes single-socket • single 16-core AMD Opteron • one NVIDIA K20x per node • Gemini interconnection #2 Green500 Nov 2013 ( ~3632 MFlops/W ) #2 Top500 Jun 2013 ( ~17.59 PFlops Rmax )

  20. Achievement: Save Powerserial multi-threaded, single GPU, NVIDIA Fermi generation -58% -57% -54% 3.67x 3.2x 3.1x Tests run early 2012 @ ICHEC

  21. Achievement: improved time-to-solution 2.4x ~2.9x ~3.4x ~3.4x 2.5x ~3.5x 2.4x ~2.1x Serial Parallel Parallel tests run on Wilkes Serial tests run on SBN machine

  22. Challenge: running on CRAY XK7 Key differences... • AMD Bulldozer architecture, 2 cores shares same FPU pipeline  aprun -j 1 • NUMA locality matters a lot , for both CPU-only and CPU+GPU  aprun –cc numanode • GPU Direct over RDMA is not supported (yet?)  3D-FFT not working • Scheduling policy "unfriendly"  input has to be really big Performance below expectation (<2x)  Tricks: many-pw.x, __USE_3D_FFT

  23. Challenge: educate users • Performance portability myth • "configure, compile, run" same as the CPU version • All dependencies (MAGMA, phiGEMM) compiled by QE-GPU • No more than 2 MPI process per GPU • Hyper-Q does not work automatically, an additional running deamon is needed • Forget about 1:1 output comparison • QE-GPU can run on every GPU but some GPU are better than others...

  24. Lessons learntbeing "heterogeneous" today and tomorrow • GPU does not really improve code scalability, only time-to-solution • Re-think about data distribution for massive parallel architectures • Deal with un-controlled "numerical fluctuations" (GPU magnifies this) • The "data movement" constrain will soon disappear  new Intel Phi Kings Landing, NVIDIA project Denver expected by 2015 • Looking for true alternatives, new algorithms • not easy, extensive validation _plus_ module dependencies • Performance is a function of human effort • Follow the mantra «Do what you are good at.»

  25. Links: • http://hpc.cam.ac.uk • http://www.quantum-espresso.org/ • http://foundation.quantum-espresso.org/ • http://qe-forge.org/gf/project/q-e/ • http://qe-forge.org/gf/project/q-e-gpu/ Thank you for your attention!

More Related