Molecular Dynamics

Molecular Dynamics GPU Applications Catalog Higher Ed & Research May 8th, 2012

Sections Included * • Molecular Dynamics Applications Overview • AMBER • NAMD • GROMACS • LAMMPS * In fullscreen mode, click on link to view a particular module. Click on NVIDIA logo in each slide to return to this page.

Molecular Dynamics (MD) Applications GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison

New/Additional MD Applications Ramping GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison

GPU Value to Molecular Dynamics • Study disease & discover drugs • Predict drug and protein interactions What • Speed of simulations is critical • Enables study of: • Longer timeframes • Larger systems • More simulations GPU Ready Applications Abalone ACEMD AMBER DL_PLOY GAMESS GROMACS LAMMPS NAMD Why • GPUs increase throughput & accelerate simulations How AMBER 11 Application 4.6x performance increase with 2 GPUs with only a 54% added cost* GPU Test Drive Pre-configured Applications AMBER 11 NAMD 2.8 • AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333

All Key MD Codes are GPU Ready • AMBER, NAMD, GROMACS, LAMMPS • Life and Material Sciences • Great multi-GPU performance • Additional MD GPU codes: Abalone, ACEMD, HOOMD-Blue • Focus: scaling to large numbers of GPUs

AMBER

Outstanding AMBER Results with GPUs

Run AMBER Faster Up to 5x Speed Up With GPUs CPU Supercomputer DHFR (NVE) 23,558 Atoms “…with two GPUs we can run a single simulation as fast as on 128 CPUs of a Cray XT3 or on 1024 CPUs of an IBM BlueGene/L machine. We can try things that were undoable before. It still blows my mind.” Axel Kohlmeyer Temple University

AMBER Make Research More Productive with GPUs 318% Higher Performance With GPU No GPU Base node configuration: Dual Xeon X5670s and DualTesla M2090 GPUs per node 54% Additional Expense • Adding Two 2090 GPUs to a Node Yields a > 4xPerformance Increase

NAMD

Run NAMD Faster Up to 7x Speed Up With GPUs STMV 1,066,628 Atoms ApoA-1 92,224 Atoms Test Platform: 1 Node, Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On. Visit www.nvidia.com/simcluster for more information on speed up results, configuration and test models. NAMD 2.8 B1 + unreleaesd patch, STMV Benchmark A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs Performance numbers for 2 M2070 8 cores (GPU+CPU) vs. 8 cores (CPU)

Make Research More Productive with GPUs 250% Higher With GPU 54% Additional Expense No GPU • Get up to a 250% Performance Increase(STMV – 1,066628 atoms)

GROMACS

GROMACS Partnership Overview • Erik Lindahl, David van der Spoel, Berk Hess are head authors and project leaders. Szilárd Páll is a key GPU developer. • 2010: single GPU support (OpenMM library in GROMACS 4.5) • NVIDIA Dev Tech resources allocated to GROMACS code • 2012: GROMACS 4.6 will support multi-GPU nodes as well as GPU clusters

GROMACS 4.6 Release Features • GROMACS Multi-GPU Expected in April 2012 • Multi-GPU support- GPU acceleration is one of main focus: majority of features will be accelerated in 4.6 in a transparent fashion • PME simulations get special attention, and most of the effort will go into making these algorithms well load-balanced • Reaction-Field and Cut-Off simulations also run accelerated • List of non-supported GPU accelerated features will be quite short

GROMACS 4.6 Alpha Release Absolute Performance • Absolute performance of GROMACS running CUDA- and SSE-accelerated non-bonded kernels with PME on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic and truncated dodecahedron cells, pressure coupling, as well as virtual interaction sites enabling 5 fs are shown • Benchmark systems: RNAse in water with 24040 atoms in cubic and 16816 atoms in truncated dodecahedron box • Settings: electrostatics cut-off auto-tuned >0.9 nm, LJ cut-off 0.9 nm, 2 fs and 5 fs (with vsites) time steps • Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release Strong Scaling Strong scaling of GPU-accelerated GROMACS with PME and reaction-field on: • Up to 40 cluster nodes with 80 GPUs • Benchmark system: water box with 1.5M particles • Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps • Hardware: Bullx cluster nodes with 2x Intel Xeon E5649 (6C), 2x NVIDIA Tesla M2090, 2x QDR Infiniband 40 Gb/s

GROMACS 4.6 Alpha Release PME Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: • 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system • Sizes which fall beyond the typical single-node production size. • Benchmark systems: water boxes size ranging from 1.5k to 3M particles. • Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps. • Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075.

GROMACS 4.6 Alpha Release Rxn-Field Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: • 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system • sizes which fall beyond the typical single-node production size • Benchmark systems: water boxes size ranging from 1.5k to 3M particles • Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps • Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release Weak Scaling • Weak scaling of the CUDA non-bonded force kernel on GeForce and Tesla GPUs. Perfect weak scaling, challenges for strong scaling • Benchmark systems: water boxes size ranging from 1.5k to 3M particles • Settings: electrostatics & LJ cut-off 1.0 nm, 2 fs time steps • Hardware: workstation with 2x Intel Xeon X5650 (6C) CPUs, 4x NVIDIA Tesla C2075

LAMMPS

LAMMPS Released GPU Features and Future Plans LAMMPS August 2009 • First GPU accelerated support LAMMPS Aug. 22, 2011 • Selected accelerated Non-bonded short‐range potentials (SP, MP, DP support) • Lennard-Jones (several variants with &without coulombic interactions) • Morse • Buckingham • CHARMM • Tabulated • Course grain SDK • Anisotropic Gay-Bern • RE-squared • “Hybrid” combinations (GPU accel & no GPU accel) • Particle-Particle Particle-Mesh (SP or DP) • Neighbor list builds Longer Term* • Improve performance on smaller particle counts • Neighbor List is the problem • Improve long-range performance • MPI/Poisson Solve is the problem • Additional pair potential support (including expensive advanced force fields) – See “Tremendous Opportunity for GPUs” slide* • Performance improvements focused to specific science problems * Courtesy of Michael Brown at ORNL and Paul Crozier at Sandia Labs

LAMMPS8.6x Speed-up with GPUs W.M. Brown, “GPU Acceleration in LAMMPS”, 2011 LAMMPS Workshop

LAMMPS4x Faster on Billion Atoms Billion Atom Lennard-Jones Benchmark 103 Seconds 288 GPUs + CPUs 1920 x86 CPUs Test Platform: NCSA Lincoln Cluster with S1070 1U GPU servers attached CPU-only Cluster- Cray XT5

LAMMPS • 4X-15X Speedups • Gay-Berne • RE-Squared • From August 2011 LAMMPS Workshop • Courtesy of W. Michael Brown, ORNL

LAMMPS Conclusions • Runs both with individual multi-GPU node, as well as GPU clusters • Outstanding raw performance! • Performance is 3x-40X higher than equivalent CPU code • Impressive linear strong scaling • Good weak scaling, scales to a billion particles • Tremendous opportunity to GPU accelerate other force fields

END

Molecular Dynamics