510 likes | 641 Views
Scalable Molecular Dynamics for Large Biomolecular Systems. Robert Brunner James C Phillips Laxmikant Kale. Overview. Context: approach and methodology Molecular dynamics for biomolecules Our program NAMD Basic Parallelization strategy NAMD performance Optimizations Techniques Results
E N D
Scalable Molecular Dynamicsfor Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale
Overview • Context: approach and methodology • Molecular dynamics for biomolecules • Our program NAMD • Basic Parallelization strategy • NAMD performance Optimizations • Techniques • Results • Conclusions: summary, lessons and future work
The context • Objective: Enhance Performance and productivity in parallel programming • For complex, dynamic applications • Scalable to thousands of processors • Theme: • Adaptive techniques for handling dynamic behavior • Look for optimal division of labor between human programmer and the “system” • Let the programmer specify what to do in parallel • Let the system decide when and where to run the subcomputations • Data driven objects as the substrate
5 8 1 1 2 10 4 3 8 2 3 9 7 5 6 10 9 4 9 12 11 13 6 13 7 11 12
Data driven execution Scheduler Scheduler Message Q Message Q
Charm++ • Parallel C++ with Data Driven Objects • Object Arrays and collections • Asynchronous method invocation • Object Groups: • global object with a “representative” on each PE • Prioritized scheduling • Mature, robust, portable • http://charm.cs.uiuc.edu
Load balancing • Based on migratable objects • Collect timing data for several cycles • Run heuristic load balancer • Several alternative ones • Re-map and migrate objects accordingly • Registration mechanisms facilitate migration
Measurement based load balancing • Application induced imbalances: • Abrupt, but infrequent, or • Slow, cumulative • rarely: frequent, large changes • Principle of persistence • Extension of principle of locality • Behavior, including computational load and communication patterns, of objects tend to persist over time • We have implemented strategies that exploit this automatically
Molecular dynamics and NAMD • MD to understand the structure and function of biomolecules • proteins, DNA, membranes • NAMD is a production quality MD program • Active use by biophysicists (science publications) • 50,000+ lines of C++ code • 1000+ registered users • Features and “accessories” such as • VMD: visualization • Biocore: collaboratory • Steered and Interactive Molecular Dynamics
NAMD Contributors • PI s : • Laxmikant Kale, Klaus Schulten, Robert Skeel • NAMD 1: • Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson • NAMD2: • M. Bhandarkar, R. Brunner, A. Gursoy, J. Phillips, N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..
Molecular Dynamics • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • bonds: • non-bonded: electrostatic and van der Waal’s • Calculate velocities and Advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 100,000)
Cut-off radius • Use of cut-off radius to reduce work • 8 - 14 Å • Faraway charges ignored! • 80-95 % work is non-bonded force computations • Some simulations need faraway contributions • Periodic systems: Ewald, Particle-Mesh Ewald • Aperiodic systems: FMA • Even so, cut-off based computations are important: • near-atom calculations are part of the above • multiple time-stepping is used: k cut-off steps, 1 PME/FMA
Scalability • The Program should scale up to use a large number of processors. • But what does that mean? • An individual simulation isn’t truly scalable • Better definition of scalability: • If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size
Isoefficiency • Quantify scalability • (Work of Vipin Kumar, U. Minnesota) • How much increase in problem size is needed to retain the same efficiency on a larger machine? • Efficiency : Seq. Time/ (P · Parallel Time) • parallel time = • computation + communication + idle
Atom decomposition • Partition the Atoms array across processors • Nearby atoms may not be on the same processor • Communication: O(N) per processor • Communication/Computation: O(N)/(N/P): O(P) • Again, not scalable by our definition
Force Decomposition • Distribute force matrix to processors • Matrix is sparse, non uniform • Each processor has one block • Communication: • Ratio: • Better scalability in practice • (can use 100+ processors) • Plimpton: • Hwang, Saltz, et al: • 6% on 32 Pes 36% on 128 processor • Yet not scalable in the sense defined here!
Spatial Decomposition • Allocate close-by atoms to the same processor • Three variations possible: • Partitioning into P boxes, 1 per processor • Good scalability, but hard to implement • Partitioning into fixed size boxes, each a little larger than the cutoff distance • Partitioning into smaller boxes • Communication: O(N/P): • so, scalable in principle
Spatial Decomposition in NAMD • NAMD 1 used spatial decomposition • Good theoretical isoefficiency, but for a fixed size system, load balancing problems • For midsize systems, got good speedups up to 16 processors…. • Use the symmetry of Newton’s 3rd law to facilitate load balancing
Spatial Decomposition But the load balancing problems are still severe:
FD + SD • Now, we have many more objects to load balance: • Each diamond can be assigned to any processor • Number of diamonds (3D): • 14·Number of Patches
Bond Forces • Multiple types of forces: • Bonds(2), Angles(3), Dihedrals (4), .. • Luckily, each involves atoms in neighboring patches only • Straightforward implementation: • Send message to all neighbors, • receive forces from them • 26*2 messages per patch!
Bonded Forces: • Assume one patch per processor: • an angle force involving atoms in patches: • (x1,y1,z1), (x2,y2,z2), (x3,y3,z3) • is calculated in patch: (max{xi}, max{yi}, max{zi}) A C B
Implementation • Multiple Objects per processor • Different types: patches, pairwise forces, bonded forces, • Each may have its data ready at different times • Need ability to map and remap them • Need prioritized scheduling • Charm++ supports all of these
Load Balancing • Is a major challenge for this application • especially for a large number of processors • Unpredictable workloads • Each diamond (force object) and patch encapsulate variable amount of work • Static estimates are inaccurate • Measurement based Load Balancing Framework • Robert Brunner’s recent Ph.D. thesis • Very slow variations across timesteps
Bipartite graph balancing • Background load: • Patches (integration, ..) and bond-related forces: • Migratable load: • Non-bonded forces • Bipartite communication graph • between migratable and non-migratable objects • Challenge: • Balance Load while minimizing communication
Load balancing strategy Greedy variant (simplified): Sort compute objects (diamonds) Repeat (until all assigned) S = set of all processors that: -- are not overloaded -- generate least new commun. P = least loaded {S} Assign heaviest compute to P Refinement: Repeat - Pick a compute from the most overloaded PE - Assign it to a suitable underloaded PE Until (No movement) Cell Compute Cell
Optimizations • Series of optimizations • Examples to be covered here: • Grainsize distributions (bimodal) • Integration: message sending overheads
Grainsize and Amdahls’s law • A variant of Amdahl’s law, for objects, would be: • The fastest time can be no shorter than the time for the biggest single object! • How did it apply to us? • Sequential step time was 57 seconds • To run on 2k processors, no object should be more than 28 msecs. • Should be even shorter • Grainsize analysis via projections showed that was not so..
Grainsize analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem
Performance audit • Through the optimization process, • an audit was kept to decide where to look to improve performance Total Ideal Actual Total 57.04 86 nonBonded 52.44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Overhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receives 0 1.61 Integration time doubled
Integration overhead analysis integration Problem: integration time had doubled from sequential run
Integration overhead example: • The projections pictures showed the overhead was associated with sending messages. • Many cells were sending 30-40 messages. • The overhead was still too much compared with the cost of messages. • Code analysis: memory allocations! • Identical message is being sent to 30+ processors. • Simple multicast support was added to Charm++ • Mainly eliminates memory allocations (and some copying)
Lessons learned • Need to downsize objects! • Choose smallest possible grainsize that amortizes overhead • One of the biggest challenge • was getting time for performance tuning runs on parallel machines
Future and Planned work • Speedup on small molecules! • Interactive molecular dynamics • Increased speedups on 2k-10k processors • Smaller grainsizes • New algorithms for reducing communication impact • New load balancing strategies • Further performance improvements for PME/FMA • With multiple timestepping • Needs multi-phase load balancing
Steered MD: example picture Image and Simulation by the theoretical biophysics group, Beckman Institute, UIUC
More information • Charm++ and associated framework: • http://charm.cs.uiuc.edu • NAMD and associated biophysics tools: • http://www.ks.uiuc.edu • Both include downloadable software
Performance: size of system Performance data on Cray T3E