320 likes | 468 Views
Generalized Atomic Systems: A Tool Kit for Atomistic Simulation Data Michael Waters Katie Sebeck 2/20/2013. Overview. Traditional Workflow in Molecular Dynamics Defining the Problem An Interchangeable Approach Aiding Analysis Current Usage. Basics of Atomistic Simulations.
E N D
Generalized Atomic Systems: A Tool Kit for Atomistic Simulation Data Michael Waters Katie Sebeck 2/20/2013
Overview • Traditional Workflow in Molecular Dynamics • Defining the Problem • An Interchangeable Approach • Aiding Analysis • Current Usage
Basics of Atomistic Simulations • Atoms in boxes • Positions • Updated by iteratively solving F=ma according to empirical force fields • Velocity • Type, charge, etc.. • System wide data • Simulation box • Number of atoms • Temperature, energy, pair potentials…
ALL molecular dynamics data can be contained in ASCII text files
A Brief Guide to Atomistic File Types • pdb, xyz, mol, cfg, sfd, gro, mdl, LAMMPS read_data, ccm, xsd, cif, car…
Through a Traditional Workflow • Control file • Structure file • Format depends on program units real timestep 1.0 atom_style bond dimension 3 boundary ppp #---------------Coordinates and Bonds -------------- lattice fcc 1.0 region 1 block -9.025 -1.805 0 70.395 0 37.905 #N=28 read_data n28lat pair_stylelj/cut 9.805 pair_coeff 1 1 0.1431 3.923 pair_coeff 2 2 0.1432 3.923 pair_coeff 3 3 4.72 2.616 pair_modify mix arithmetic bond_style harmonic bond_coeff 1 41.82 1.54 group alkane type 1 2 group copper type 3 neighbor 1.0 bin thermo 1 thermo_style custom step temp pekeetotal #minimize 1.0e-4 1.0e-6 100 1000 fix hope all nve run 100000 n=16, 500 Chains, rho=0.7918 8000 atoms 3 atom types 7500 bonds 1 bond types 0 angles 0 dihedrals 0 impropers 0 92.055 xloxhi 0 70.395 yloyhi 0 37.905 zlozhi Masses 1 14.002 2 14.002 3 63.54 Atoms 1 1 2 1.80500000000000 1.80500000000000 1.80500000000000 2 1 1 2.65313400000000 3.07841000000000 1.80500000000000
Through a Traditional Workflow • Information about simulation run in control file • Hardware, software version metadata formatting depends on system configuration • Produces output of overall run statistics Loop time of 3515.13 on 32 procs for 50000 steps with 107008 atoms Pair time (%) = 1108.83 (31.5444) Bond time (%) = 78.4225 (2.231) Neigh time (%) = 162.274 (4.61645) Comm time (%) = 1270 (36.1294) Outpt time (%) = 523.248 (14.8856) Other time (%) = 372.363 (10.5931) Nlocal: 3344 ave 8049 max 0 min Histogram: 16 0 0 0 0 0 2 6 3 5 Nghost: 7940.66 ave 15817 max 0 min Histogram: 8 4 4 0 0 0 0 0 8 8 Neighs: 862976 ave 2.19776e+06 max 0 min Histogram: 16 0 0 0 0 2 2 6 2 4
Through a Traditional Workflow • Output files generally dictated by control file • Final structure file • System properties log • Other run-time analysis outputs • HIGHLY VARIED FORMATING! • Quantitative analysis of output by scripting, MATLAB or Excel
Through a Traditional Workflow • Output structure file may or may not be in a format which can be fed into visualization software • Many software options available: • VMD • Avogadro • POVray • VESTA • … • Analysis output may or may not be in a format which can be parsed by plotting software
An Endless Series of Parsing Problems • Input file • Convert from something you can manipulate/generate to something the code can read • Output analysis • Typically requires writing new parsing routines • Different codes require re-writing scripts • Visualizations • May require extract data from other files manually • Most visualization code is already equipped to parse a variety of file types
Data from Legacy Code • Locally developed molecular dynamics code, FLX • Trying to port data into another code, LAMMPS • Ctrl+C, Ctrl+V and lots of manual editing… • Very time consuming for each file
Obstacles to Data Sharing and Reuse • Energy barrier of converting files formats • Example: A file downloaded directly from Protein Data Bank (.pdb) may not be readable by MD code (LAMMPS) • Extracting relevant quantities from available data sets • Parsing rules not always clear if unfamiliar with the format • Formats not always well documented
Problem Statement • Too much redundant work • Too little documentation or code clarity • Too much time spent manipulating data formatting • How can we fix this?
Our Approach: Interchangeable Libraries • We created a General Atomic System (GAS) class • All file read functions generate a GAS object • GAS objects are accepted by • Write file functions • Analysis functions • Manipulation functions
Examining Existing Standards for Commonalities • Positions • Type • Number of atoms
Examining Existing Standards for Commonalities • Positions • Type • Number of atoms
Examining Existing Standards for Commonalities • Positions • Type • Number of atoms/ end of atoms section
Creating a Common Data Structure • GAS class contains • System data • Internal functions • Trivial ontology • Simplicity in data structure is flexibility • Internal functions should be as reliable as possible • Obvious and explicit naming schemes
User Time Savings • From read_data to xyz: timing comparisons • Manual copy-paste, eliminating excess columns: 2.15 minutes • Calling functions, including typing out calls: 1.05 minutes • Actual function timing:~6 seconds
Aiding Analysis • With all data in standard structure: • Write all analysis based on this format • Input format independent • Allows reuse of analysis functions • Reuse begs for optimization • Intended reuse encourages documentation • Nested analyses now possible • Modularization saves: • Time • Effort • Error
Traditional Scripting Problems • Scripts typically used for: • Quantitative analysis • Modifying files to be parsed by various software • Rewriting input/output handling for each script • MATLAB, sed, awk and grep are not the friendliest or fastest parsing tools • Lack of commenting • Can only be applied to specific file types or a single file
Examples of Scripting 2.5 seconds
The Python Version… • Once a function is written, can be called in just a few lines by ANY GAS system containing sufficient information 0.4 seconds
User Time Savings • Open source and custom function libraries instead of MATLAB allows for brute force parallelization, shifting of load to external resources • Faster run times: • 2.5 using bash versus 0.4 in Python • Faster coding times • Reuse of functions without additional modifications needed • Eliminating redundant coding efforts • Use of common language promotes code reusability • Writing code for “future” self as well as others
Ways We’re Using GAS • Polymerization • Analyze pair-pair distances • Alter system topology • Automatically generate system readable file • Iterative system analysis • Quantitative analysis of a series of files • Radial distribution functions • Density profile • Bond length distributions • Automatically generates easily parsed output files • Automatic movie rendering
Moving Forward • More file formats • More advanced analysis methods and functions • Density functional theory support • Non-spherical particles • Collaboration with other groups • Better metadata integration
Final Thoughts • Our lives are much better • Our code is much more consistent • Future users have a hope of understanding what we did • If you want people to use it, it needs to be USEFUL and EASY