Ashok Srinivasan Florida State University cs.fsu/~asriniva

Long-Time Molecular Dynamics Simulations in Nano-Mechanics through Parallelization of the Time Domain Ashok Srinivasan Florida State University http://www.cs.fsu.edu/~asriniva Aim:Simulate for long time spans Solution features: Use data from prior simulations to parallelize the time domain Acknowledgements: NSF, ORNL, NERSC, NCSA Collaborators: Yanan Yu and Namas Chandra

Outline • Background • Limitations of Conventional Parallelization • Example Application: Carbon Nanotube Tensile Test • Small Time Step Size in Molecular Dynamics Simulations • Other Time Parallelization Approaches • Data-Driven Time Parallelization • Experimental Results • Scaled efficiently to ~ 1000 processors, for a problem where conventional parallelization scales to just 2-3 processors • Conclusions

Background • Limitations of Conventional Parallelization • Example Application: Carbon Nanotube Tensile Test • Molecular Dynamics Simulations

Limitations of Conventional Parallelization • Conventional parallelization decomposes the state space across processors • It is effective for large state space • It is not effective when computational effort arises from a large number of time steps • … or when granularity becomes very fine due to a large number of processors

Example ApplicationCarbon Nanotube Tensile Test • Pull the CNT at a constant velocity • Determine stress-strain response and yield strain (when CNT starts breaking) using MD • Strain rate dependent

A Drawback of Molecular Dynamics • Molecular dynamics • In each time step, forces of atoms on each other modeled using some potential • After force is computed, update positions • Repeat for desired number of time steps • Time steps size ~ 10 –15 seconds, due to physical and numerical considerations • Desired time range is much larger • A million time steps are required to reach 10-9 s • Around a day of computing for a 3000-atom CNT • MD uses unrealistically large strain-rates

Other Time Parallelization Approaches • Waveform relaxation • Repeatedly solve for the entire time domain • Parallelizes well but convergence can be slow • Several variants to improve convergence • Parareal approach • Features similar to ours and to waveform relaxation • Precedes our approach • Not data-driven • Sequential phase for prediction • Not very effective in practice so far • Has much potential to be improved

Waveform Relaxation • Special case: Picard iterations • Ex: dy/dt = y, y(0) = 1 becomes • dyn+1/dt = yn(t), y0(t) = 1 • In general • dy/dt = f(y,t), y(0) = y0 becomes • dyn+1/dt = g(yn, yn+1, t), y0(t) = y0 • g(u, u, t) = f(u, t) • g(yn, yn+1, t) = f(yn,t): Picard • g(yn, yn+1, t) = f(y,t): Converges in 1 iteration • Jacobi, Gauss-Seidel, and SOR versions of g defined • Many improvements • Ex: DIRM combines above with reduced order modeling Exact N = 1 N = 2 N = 3 N = 4

Parareal approach • Based on an “approximate-verify-correct” sequence • An example of shooting methods for time-parallelization • Not shown to be effective in realistic situations Second prediction Initial computed result Correction Initial prediction

Data-Driven Time Parallelization • Time Parallelization • Use Prior Data

Time Parallelization • Each processor simulates a different time interval • Initial state is obtained by prediction, except for processor 0 • Verify if prediction for end state is close to that computed by MD • Prediction is based on dynamically determining a relationship between the current simulation and those in a database of prior results If time interval is sufficiently large, then communication overhead is small

Problems with multiple time-scales • Fine-scale computations (such as MD) are more accurate, but more time consuming • Much of the details at the finer scale are unimportant, but some are A simple schematic of multiple time scales

Use Prior Data • Results for identical simulation exists • Retrieve the results • Results for slightly different parameter, with the same coarse-scale response exists • Retrieve the results • Verify closeness, or pre-determine acceptable parameter range • Current simulation behaves like different prior ones at different times • Identify similar prior results, learn relationship, verify prediction • Not similar to prior results • Try to identify coarse-scale behavior, apply dynamic iterations to improve on predictions

Experimental Results • CNT tensile test • CNT identical to prior results, but different strain-rate • 1000-atoms CNT, 300 K • Static and dynamic prediction • CNT identical to prior results, but different strain-rate and temperature • CNT differs in size from prior result, and simulated with a different strain-rate

Dimensionality Reduction • Movement of atoms in a 1000-atom CNT can be considered the motion of a point in 3000-dimensional space • Find a lower dimensional subspace close to which the points lie • We use principal orthogonal decomposition • Find a low dimensional affine subspace • Motion may, however, be complex in this subspace • Use results for different strain rates • Velocity = 10m/s, 5m/s, and 1 m/s • At five different time points • [U, S, V] = svd(Shifted Data) • Shifted Data = U*S*VT • States of CNT expressed as • m + c1 u1 + c2 u2 u u m

Basis Vectors from POD • CNT of ~ 100 A with 1000 atoms at 300 K Blue: z Green, Red: x, y u1 (blue) and u2 (red) for z u1 (green) for x is not “significant”

Relate strain rate and time • Coefficients of u1 • Blue: 1m/s • Red: 5 m/s • Green: 10m/s • Dotted line: same strain • Suggests that behavior is similar at similar strains • In general, clustering similar coefficients can give parameter-time relationships

Prediction When v is the only parameter • Static Predictor • Independently predict change in each coordinate • Use precomputed results for 40 different time points each for three different velocities • To predict for (t; v) not in the database • Determine coefficients for nearby v at nearby strains • Fit a linear surface and interpolate/extrapolate to get coefficients c1 and c2 for (t; v) • Get state as m + c1 u1 + c2 u2 • Dynamic Prediction • Correct the above coefficients, by determining the error between the previously predicted and computed states Green: 10 m/s, Red: 5 m/s, Blue: 1 m/s,Magenta: 0.1 m/s,Black: 0.1m/s through direct prediction

Verification of prediction • Definition of equivalence of two states • Atoms vibrate around their mean position • Consider states equivalent if difference in position, potential energy, and temperature are within the normal range of fluctuations • Max displacement ~= 0.2 A • Mean displacement ~= 0.08 A • Potential energy fluctuation = 0.35% • Temperature fluctuation = 12.5 K Displacement (from mean) Mean position

Stress-strain response at 0.1 m/s • Blue: Exact result • Green: Direct prediction with interpolation / extrapolation • Points close to yield involve extrapolation in velocity and strain • Red: Time parallel results

Speedup • Red line: Ideal speedup • Blue: v = 0.1m/s • Green: A different predictor • v = 1m/s, using v = 10m/s • CNT with 1000 atoms • Xeon/ Myrinet cluster

Temperature and velocity vary • Use 1000-atom CNT results • Temperatures: 300K, 600K, 900K, 1200K • Velocities: 1m/s, 5m/s, 10m/s • Dynamically choose closest simulation for prediction Speedup __ 450K, 2m/s … Linear Stress-strain Blue: Exact 450K Red: 200 processors

CNTs of varying sizes • Use a 1000-atom CNT, 10 m/s, 300K result • Parallelize 1200, 1600, 2000-atom CNT runs • Observe that the dominant mode is approximately a linear function of the initial z-coordinate • Normalize coordinates to be in [0,1] • z t+Dt = z t+ z’t+DtDt, predict z’ • Speedup • - 2000 atoms • .- 1600 atoms • __ 1200 atoms • … Linear • Stress-strain • Blue: Exact 2000 atoms, 1m/s • Red: 200 processors

Predict change in coordinates • Express x’ in terms of basis functions • Example: • x’ t+Dt = a0, t+Dt + a1, t+Dt x t • a0, t+Dt, a1, t+Dt are unknown • Express changes, y, for the base (old) simulation similarly, in terms of coefficients b and perform least squares fit • Predict ai, t+Dt as bi, t+Dt + R t+Dt • R t+Dt = (1-b)R t + b(ai, t- bi, t) • Intuitively, the difference between the base coefficient and the current coefficient is predicted as a weighted combination of previous weights • We use b = 0.5 • Gives more weight to latest results • Does not let random fluctuations affect the predictor too much • Velocity estimated as latest accurate results known

Conclusions • Data-driven time parallelization shows significant improvement in speed, without sacrificing accuracy significantly, in the CNT tensile test • The 980-processor simulation attained a flop rate of ~ 420 Gflops • Its flops per atom rate of 420 Mflops/atom is likely the largest flop per atom rate in classical MD simulations • References • See http://www.cs.fsu.edu/~asriniva/research.html

Future Work • More complex problems • Better prediction • POD is good for representing data, but not necessarily for identifying patterns • Use better dimensionality reduction / reduced order modeling techniques • Better learning • Better verification

Ashok Srinivasan Florida State University cs.fsu/~asriniva