300 likes | 580 Views
Molecular Dynamics. GPU Applications Catalog. Higher Ed & Research. May 8th, 2012. Sections Included *. Molecular Dynamics Applications Overview AMBER NAMD GROMACS LAMMPS.
E N D
Molecular Dynamics GPU Applications Catalog Higher Ed & Research May 8th, 2012
Sections Included * • Molecular Dynamics Applications Overview • AMBER • NAMD • GROMACS • LAMMPS * In fullscreen mode, click on link to view a particular module. Click on NVIDIA logo in each slide to return to this page.
Molecular Dynamics (MD) Applications GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
New/Additional MD Applications Ramping GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
GPU Value to Molecular Dynamics • Study disease & discover drugs • Predict drug and protein interactions What • Speed of simulations is critical • Enables study of: • Longer timeframes • Larger systems • More simulations GPU Ready Applications Abalone ACEMD AMBER DL_PLOY GAMESS GROMACS LAMMPS NAMD Why • GPUs increase throughput & accelerate simulations How AMBER 11 Application 4.6x performance increase with 2 GPUs with only a 54% added cost* GPU Test Drive Pre-configured Applications AMBER 11 NAMD 2.8 • AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
All Key MD Codes are GPU Ready • AMBER, NAMD, GROMACS, LAMMPS • Life and Material Sciences • Great multi-GPU performance • Additional MD GPU codes: Abalone, ACEMD, HOOMD-Blue • Focus: scaling to large numbers of GPUs
Run AMBER Faster Up to 5x Speed Up With GPUs CPU Supercomputer DHFR (NVE) 23,558 Atoms “…with two GPUs we can run a single simulation as fast as on 128 CPUs of a Cray XT3 or on 1024 CPUs of an IBM BlueGene/L machine. We can try things that were undoable before. It still blows my mind.” Axel Kohlmeyer Temple University
AMBER Make Research More Productive with GPUs 318% Higher Performance With GPU No GPU Base node configuration: Dual Xeon X5670s and DualTesla M2090 GPUs per node 54% Additional Expense • Adding Two 2090 GPUs to a Node Yields a > 4xPerformance Increase
Run NAMD Faster Up to 7x Speed Up With GPUs STMV 1,066,628 Atoms ApoA-1 92,224 Atoms Test Platform: 1 Node, Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On. Visit www.nvidia.com/simcluster for more information on speed up results, configuration and test models. NAMD 2.8 B1 + unreleaesd patch, STMV Benchmark A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs Performance numbers for 2 M2070 8 cores (GPU+CPU) vs. 8 cores (CPU)
Make Research More Productive with GPUs 250% Higher With GPU 54% Additional Expense No GPU • Get up to a 250% Performance Increase(STMV – 1,066628 atoms)
GROMACS Partnership Overview • Erik Lindahl, David van der Spoel, Berk Hess are head authors and project leaders. Szilárd Páll is a key GPU developer. • 2010: single GPU support (OpenMM library in GROMACS 4.5) • NVIDIA Dev Tech resources allocated to GROMACS code • 2012: GROMACS 4.6 will support multi-GPU nodes as well as GPU clusters
GROMACS 4.6 Release Features • GROMACS Multi-GPU Expected in April 2012 • Multi-GPU support- GPU acceleration is one of main focus: majority of features will be accelerated in 4.6 in a transparent fashion • PME simulations get special attention, and most of the effort will go into making these algorithms well load-balanced • Reaction-Field and Cut-Off simulations also run accelerated • List of non-supported GPU accelerated features will be quite short
GROMACS 4.6 Alpha Release Absolute Performance • Absolute performance of GROMACS running CUDA- and SSE-accelerated non-bonded kernels with PME on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic and truncated dodecahedron cells, pressure coupling, as well as virtual interaction sites enabling 5 fs are shown • Benchmark systems: RNAse in water with 24040 atoms in cubic and 16816 atoms in truncated dodecahedron box • Settings: electrostatics cut-off auto-tuned >0.9 nm, LJ cut-off 0.9 nm, 2 fs and 5 fs (with vsites) time steps • Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075
GROMACS 4.6 Alpha Release Strong Scaling Strong scaling of GPU-accelerated GROMACS with PME and reaction-field on: • Up to 40 cluster nodes with 80 GPUs • Benchmark system: water box with 1.5M particles • Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps • Hardware: Bullx cluster nodes with 2x Intel Xeon E5649 (6C), 2x NVIDIA Tesla M2090, 2x QDR Infiniband 40 Gb/s
GROMACS 4.6 Alpha Release PME Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: • 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system • Sizes which fall beyond the typical single-node production size. • Benchmark systems: water boxes size ranging from 1.5k to 3M particles. • Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps. • Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075.
GROMACS 4.6 Alpha Release Rxn-Field Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: • 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system • sizes which fall beyond the typical single-node production size • Benchmark systems: water boxes size ranging from 1.5k to 3M particles • Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps • Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075
GROMACS 4.6 Alpha Release Weak Scaling • Weak scaling of the CUDA non-bonded force kernel on GeForce and Tesla GPUs. Perfect weak scaling, challenges for strong scaling • Benchmark systems: water boxes size ranging from 1.5k to 3M particles • Settings: electrostatics & LJ cut-off 1.0 nm, 2 fs time steps • Hardware: workstation with 2x Intel Xeon X5650 (6C) CPUs, 4x NVIDIA Tesla C2075
LAMMPS Released GPU Features and Future Plans LAMMPS August 2009 • First GPU accelerated support LAMMPS Aug. 22, 2011 • Selected accelerated Non-bonded short‐range potentials (SP, MP, DP support) • Lennard-Jones (several variants with &without coulombic interactions) • Morse • Buckingham • CHARMM • Tabulated • Course grain SDK • Anisotropic Gay-Bern • RE-squared • “Hybrid” combinations (GPU accel & no GPU accel) • Particle-Particle Particle-Mesh (SP or DP) • Neighbor list builds Longer Term* • Improve performance on smaller particle counts • Neighbor List is the problem • Improve long-range performance • MPI/Poisson Solve is the problem • Additional pair potential support (including expensive advanced force fields) – See “Tremendous Opportunity for GPUs” slide* • Performance improvements focused to specific science problems * Courtesy of Michael Brown at ORNL and Paul Crozier at Sandia Labs
LAMMPS8.6x Speed-up with GPUs W.M. Brown, “GPU Acceleration in LAMMPS”, 2011 LAMMPS Workshop
LAMMPS4x Faster on Billion Atoms Billion Atom Lennard-Jones Benchmark 103 Seconds 288 GPUs + CPUs 1920 x86 CPUs Test Platform: NCSA Lincoln Cluster with S1070 1U GPU servers attached CPU-only Cluster- Cray XT5
LAMMPS • 4X-15X Speedups • Gay-Berne • RE-Squared • From August 2011 LAMMPS Workshop • Courtesy of W. Michael Brown, ORNL
LAMMPS Conclusions • Runs both with individual multi-GPU node, as well as GPU clusters • Outstanding raw performance! • Performance is 3x-40X higher than equivalent CPU code • Impressive linear strong scaling • Good weak scaling, scales to a billion particles • Tremendous opportunity to GPU accelerate other force fields