350 likes | 503 Views
Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++. James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/. Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/.
E N D
Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++ James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/
UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers Theoretical and Computational Biophysics Group
Biomolecular simulations are our computational microscope Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently
Our goal for NAMD is practical supercomputing for NIH researchers • 44,000 users can’t all be computer experts. • 11,700 have downloaded more than one version. • 2300 citations of NAMD reference papers. • One program for all platforms. • Desktops and laptops – setup and testing • Linux clusters – affordable local workhorses • Supercomputers – free allocations on TeraGrid • Blue Waters – sustained petaflop/s performance • User knowledge is preserved. • No change in input or output files. • Run any simulation on any number of cores. • Available free of charge to all. Phillips et al., J. Comp. Chem.26:1781-1802, 2005.
NAMD uses a hybrid force-spatial parallel decomposition Kale et al.,J. Comp. Phys.151:283-312, 1999. • Spatially decompose data and communication. • Separate but related work decomposition. • “Compute objects” facilitate iterative, measurement-based load balancing system.
Charm++ overlaps NAMD algorithms Phillips et al., SC2002. Objects are assigned to processors, queued as data arrives, and executed in priority order.
NAMD adjusts grainsize to match parallelism to processor count • Tradeoff between parallelism and overhead • Maximum patch size is based on cutoff • Ideally one or more patches per processor • To double, split in x, y, z dimensions • Number of computes grows much faster! • Hard to automate completely • Also need to select number of PME pencils • Computes partitioned in outer atom loop • Old: Heuristic based on on distance, atom count • New: Measurement-based compute partitioning
Measurement-based grainsize tuning enables scalable implicit solvent simulation Before - Heuristic (256 cores) After - Measurement-based (512 cores)
Target is still 100 atoms per thread 2002 Gordon Bell Award ATP synthase: 300K atoms Chromatophore: 100M atoms PSC Lemieux: 3000 cores Blue Waters: 300,000 cores, 1.2M threads
Scale brings other challenges • Limited memory per core • Limited memory per node • Finicky parallel filesystems • Limited inter-node bandwidth • Long load balancer runtimes Which is why we collaborate with PPL!
Challenges in 100M-atom Biomolecule Simulation • How to overcome sequential bottleneck? • Initialization • Output trajectory & restart data • How to achieve good strong-scaling results? • Charm++ Runtime
Loading Data into System (1) • Traditionally done on a single core • Molecule size is small • Result of 100M-atom system • Memory: 40.5 GB ! • Time: 3301.9 sec !
Loading Data into System (2) • Compression scheme • Atom “Signature” representing common attributes of a atom • Support more science simulation parameters • However, not enough • Memory: 12.8 GB! • Time: 125.5 sec!
Loading Data into System (3) • Parallelizing initialization • #input procs: a parameter chosen either by user or auto-computed at runtime • First, each loads 1/N of all atoms • Second, atoms shuffled with neighbor procs for later spatial decomposition • Good enough e.g. 600 input procs • Memory: 0.19 GB • Time: 12.4 sec
Output Trajectory & Restart Data (1) • At least 4.8GB output to file system per output step • tens ms/step target makes it more critical • Parallelizing output • Each output proc is responsible for a portion of atoms • Output to single file for compatibility
Output Issue (2) • Multiple and independent file • Post-processing into a single file
Initial Strong Scaling on Jaguar 6,720 cores 53,760 cores 107,520 cores 224,076 cores
Multi-threading MPI-based Charm++ Runtime • Exploit multicore • Portable as based on MPI • On each node: • “processor” represented as a thread • N “worker” threads share 1 “communication” thread • Worker thread: only handle computation • Communication: only handle network message
Benefits of SMP Mode (1) • Intra-node communication is faster • Msg transferred as a pointer • Program launch time reduced • 224K cores: ~6 min ~1 min • Transparent to application developers • Correct charm++ program runs both in non-SMP and SMP mode
Benefits of SMP Mode(2) • Reduce memory footprint further • Read-only data structures shared • Memory footprint for MPI library is reduced • On avg. 7X reduction! • Better cache performance • Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)
Potential Bottleneck on Communication Thread • Computation & Communication Overlap alleviates the problem to some extent
Node-aware Communication • In runtime: multicast, broadcast etc. • E.g.: a series of bcast in startup: 2.78X reduction • In application: multicast tree • Incorporate knowledge of computation to guide the construction of the tree • Least loaded node as intermediate node
Handle Burst of Messages (1) • A global barrier after each timestep due to constant pressure algorithm • More amplified due to only 1 comm thd per node
Handle Burst of Messages (2) • Work flow of comm thread • Alternate in send/release/receive modes • Dynamic flow control • Exit one mode to another • E.g. 12.3% for 4480-node (53,760 cores)
Hierarchical Load Balancer • Large memory consumption in centralized one • Processors are divided into groups • Load balancing is done in each group
Strong Scale on Jaguar (2) 224,076 cores 107,520 cores 53,760 cores 6,720 cores
Weak Scale on Intrepid (~1466 atoms/core) 100M 48M 12M 24M 2M 6M • 100M-atom ONLY runs in SMP mode • Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap
Conclusion and Future Work • IO bottleneck solved by parallelization • An approach that optimizes both application and its underlying runtime • SMP mode in runtime • Continue to improve performance • PME calculation • Integrate and optimize new science codes
Acknowledgement • Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation • David Tanner for the implicit solvent work • Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE • Funds: NIH, NSF