Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++ James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/

UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers Theoretical and Computational Biophysics Group

Biomolecular simulations are our computational microscope Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently

Our goal for NAMD is practical supercomputing for NIH researchers • 44,000 users can’t all be computer experts. • 11,700 have downloaded more than one version. • 2300 citations of NAMD reference papers. • One program for all platforms. • Desktops and laptops – setup and testing • Linux clusters – affordable local workhorses • Supercomputers – free allocations on TeraGrid • Blue Waters – sustained petaflop/s performance • User knowledge is preserved. • No change in input or output files. • Run any simulation on any number of cores. • Available free of charge to all. Phillips et al., J. Comp. Chem.26:1781-1802, 2005.

NAMD uses a hybrid force-spatial parallel decomposition Kale et al.,J. Comp. Phys.151:283-312, 1999. • Spatially decompose data and communication. • Separate but related work decomposition. • “Compute objects” facilitate iterative, measurement-based load balancing system.

Charm++ overlaps NAMD algorithms Phillips et al., SC2002. Objects are assigned to processors, queued as data arrives, and executed in priority order.

NAMD adjusts grainsize to match parallelism to processor count • Tradeoff between parallelism and overhead • Maximum patch size is based on cutoff • Ideally one or more patches per processor • To double, split in x, y, z dimensions • Number of computes grows much faster! • Hard to automate completely • Also need to select number of PME pencils • Computes partitioned in outer atom loop • Old: Heuristic based on on distance, atom count • New: Measurement-based compute partitioning

Measurement-based grainsize tuning enables scalable implicit solvent simulation Before - Heuristic (256 cores) After - Measurement-based (512 cores)

The age of petascale biomolecular simulation is near

Larger machines enable larger simulations

Target is still 100 atoms per thread 2002 Gordon Bell Award ATP synthase: 300K atoms Chromatophore: 100M atoms PSC Lemieux: 3000 cores Blue Waters: 300,000 cores, 1.2M threads

Scale brings other challenges • Limited memory per core • Limited memory per node • Finicky parallel filesystems • Limited inter-node bandwidth • Long load balancer runtimes Which is why we collaborate with PPL!

Challenges in 100M-atom Biomolecule Simulation • How to overcome sequential bottleneck? • Initialization • Output trajectory & restart data • How to achieve good strong-scaling results? • Charm++ Runtime

Loading Data into System (1) • Traditionally done on a single core • Molecule size is small • Result of 100M-atom system • Memory: 40.5 GB ! • Time: 3301.9 sec !

Loading Data into System (2) • Compression scheme • Atom “Signature” representing common attributes of a atom • Support more science simulation parameters • However, not enough • Memory: 12.8 GB! • Time: 125.5 sec!

Loading Data into System (3) • Parallelizing initialization • #input procs: a parameter chosen either by user or auto-computed at runtime • First, each loads 1/N of all atoms • Second, atoms shuffled with neighbor procs for later spatial decomposition • Good enough  e.g. 600 input procs • Memory: 0.19 GB • Time: 12.4 sec

Output Trajectory & Restart Data (1) • At least 4.8GB output to file system per output step • tens ms/step target makes it more critical • Parallelizing output • Each output proc is responsible for a portion of atoms • Output to single file for compatibility

Output Issue (1)

Output Issue (2) • Multiple and independent file • Post-processing into a single file

Initial Strong Scaling on Jaguar 6,720 cores 53,760 cores 107,520 cores 224,076 cores

Multi-threading MPI-based Charm++ Runtime • Exploit multicore • Portable as based on MPI • On each node: • “processor” represented as a thread • N “worker” threads share 1 “communication” thread • Worker thread: only handle computation • Communication: only handle network message

Benefits of SMP Mode (1) • Intra-node communication is faster • Msg transferred as a pointer • Program launch time reduced • 224K cores: ~6 min  ~1 min • Transparent to application developers • Correct charm++ program runs both in non-SMP and SMP mode

Benefits of SMP Mode(2) • Reduce memory footprint further • Read-only data structures shared • Memory footprint for MPI library is reduced • On avg. 7X reduction! • Better cache performance • Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)

Potential Bottleneck on Communication Thread • Computation & Communication Overlap alleviates the problem to some extent

Node-aware Communication • In runtime: multicast, broadcast etc. • E.g.: a series of bcast in startup: 2.78X reduction • In application: multicast tree • Incorporate knowledge of computation to guide the construction of the tree • Least loaded node as intermediate node

Handle Burst of Messages (1) • A global barrier after each timestep due to constant pressure algorithm • More amplified due to only 1 comm thd per node

Handle Burst of Messages (2) • Work flow of comm thread • Alternate in send/release/receive modes • Dynamic flow control • Exit one mode to another • E.g. 12.3% for 4480-node (53,760 cores)

Hierarchical Load Balancer • Large memory consumption in centralized one • Processors are divided into groups • Load balancing is done in each group

Improvement due to Load Balancing

Performance Improvement ofSMP over non-SMP on Jaguar

Strong Scale on Jaguar (2) 224,076 cores 107,520 cores 53,760 cores 6,720 cores

Weak Scale on Intrepid (~1466 atoms/core) 100M 48M 12M 24M 2M 6M • 100M-atom ONLY runs in SMP mode • Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap

Conclusion and Future Work • IO bottleneck solved by parallelization • An approach that optimizes both application and its underlying runtime • SMP mode in runtime • Continue to improve performance • PME calculation • Integrate and optimize new science codes

Acknowledgement • Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation • David Tanner for the implicit solvent work • Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE • Funds: NIH, NSF

Thanks 

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

Presentation Transcript

Load Balancing and Topology Aware Mapping for Petascale Machines

100 million jobs

5 to 100 million species of organisms on Earth

Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines

Future Direction with NAMD

The Charm++ Programming Model and NAMD

Petascale I/O Impacts on Visualization

PTAS’s with Scaling

Demonstration: Using NAMD

Scaling Up Efforts to Provide More than 100 Million Water Purification Packets Every Year

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Scaling Challenges in NAMD: Past and Future

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

Petascale

Introduction to Petascale Computing

Massive Molecular Dynamics Simulation for Studying 100 Million Atoms System

Progress Towards Petascale Virtual Machines

Charm with STAR

Petascale Science with GTC/ADIOS

News with Charm

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks