1 / 0

NSF CI days @ U Kentucky, February 2010

Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale …. Thomas E. Cheatham III Associate Professor tec3@utah.edu Departments of Medicinal Chemistry and of Pharmaceutics and Pharmaceutical Chemistry,

aneko
Download Presentation

NSF CI days @ U Kentucky, February 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale… Thomas E. Cheatham III Associate Professor tec3@utah.edu Departments of Medicinal Chemistry and of Pharmaceutics and Pharmaceutical Chemistry, College of Pharmacy, University of Utah NSF TeraGrid Science Advisory Board NSF LRAC/MRAC allocations panel (~2002-2008), chair NSF LRAC award since ~2001; ||-computing since 1987; ~17 M hours this year on local and NSF machines U Utah CI Council; Information Technology Council; CHPC NSF CI days @ U Kentucky, February 2010
  2. eScience = cyberinfrastructure (???) "The term "e-Science" denotes the systematic development of research methods that exploit advanced computational thinking“ Professor Malcolm Atkinson, e-Science Envoy. “Cyberinfrastructure” consists of computing systems, data storage systems, data repositories and advanced instruments, visualization environments, and people, all linked together by software and advanced networks to improve scholarly productivity and enable breakthroughs not otherwise possible. EDUCAUSE, Campus Cyberinfrastructure workgroup
  3. “If you’re a scientist, talk to a computer scientist about your challenges, and vice versa.” i.e. clustering, data handling, …
  4. How do drugs bind and influence structure (and dynamics)? the tool: biomolecular simulation
  5. energy vs. sampling The tool: Biomolecular simulation
  6. What is bio-molecular simulation? “physics” based atomic potential—the force field—tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …) bonds d+ d- electrostatics angles van der Waals dihedrals There are many force fields, each with distinct performance characteristics…
  7. What is bio-molecular simulation? “physics” based atomic potential—the force field—tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …) codes and methods developed over the past ~40 yrs by various teams including centers, labs and industry 80’s vectorization + early parallel architectures 90’s shared memory and distributed memory parallelized 00’s special purpose hardware and optimized codes AMBER, CHARMM, Encad, NAMD, Desmond, GROMOS, Gromacs, LAMMPS, …
  8. CAPTOPRIL: ACE inhibitor (antihypertensive) VIRACEPT: HIV protease inhibitor (AIDS therapy). Agouron. CRIXIVAN: HIV protease inhibitor (AIDS therapy). Merck. VIAGRA: cGMP PD type 5 inhibitor (impotence). Pfizer. ZOMIG: trypamine receptor antagonist (migraine) Zeneca TEVETEN: Angiotensin II receptor antagonist (hypertension) TRUSOPT: carbonic anhydrase inhibitor (glaucoma) ARICEPT: AChE inhibitor (alzheimers/dementia) COZAAR: angiotensin II receptor antagonist (hypertension) NOROXIN: inhibits bacterial DNA synthesis (antibacterial) de novo protein folding, structure prediction computer aided drug design design of novel materials / properties multi-scale modeling simulation of time scales that are approaching relevant time scales
  9. general goals of bio-molecular simulation research Do the simulations model reality? How can we assess & validate the results? Can simulations provide predictive insight? How can we improve the applied methods?
  10. general goals of bio-molecular simulation research Do the simulations model reality? How can we assess & validate the results? Can simulations provide predictive insight? How can we improve the applied methods? How can we facilitate the simulation experiment? How can we better disseminate the data? How can we use the emerging machines? computational science (???)
  11. Experience required? structural biology statistical mechanics biophysics / computational chemistry pharmacy / organic chemistry UNIX / system administration coding ability (Fortran90, scripting, …) parallel computing data handling, analysis, viz B.A. Chemistry B.A. Mathematics & Comp. Sci. PhD Pharm Chem Programmer / Analyst, 2 yrs + NSF centers Still learning… interdisciplinary teams?
  12. The power of the TeraGrid(aka metacenter, PACI, xD: NSF centers) education / training CM-2a, CM-5, MasPar training, ~1989 Summer Institute in Supercomputing at PSC, 1992 Scientific Computing Institute at Los Alamos, 1992 vectorization, basic concepts of shared vs. distributed memory Heterogeneous computing at PSC, 1994 shared memory + MPI (+ PVM, TCGMSG, …) Shared memory and MPI parallelized AMBER released (PSC, SGI) AMBER workshops (as teacher), 1996 & 1998 outreach center brochures, literature, WWW pages, joint publications Computerworld Smithsonian Awards Finalist (with PSC, UCSF, NIEHS) cycles!!! friendly user status , consultants, helpline, porting guides Allocations: ~100K in 1995, ~1M in 2002, ~10M in 2009, ~14M in 2010, …
  13. Curious trends, barriers and limitations in the field… $$$: 1R01-GM081411-01: Biomolecular simulation for the end-stage refinement of nucleic acid structure 1R01-GM079383-01: “AMBER force field consortium” Research funding focuses on NIH mission (basic science + health relevance) - Funding is results driven, little reward for software optimization NIH does not really fund (or support) supercomputing / CI  …yet NIH funds the bulk of biomolecular simulation research (?)
  14. Curious trends, barriers and limitations in the field… $$$: 1R01-GM081411-01: Biomolecular simulation for the end-stage refinement of nucleic acid structure 1R01-GM079383-01: “AMBER force field consortium” Research funding focuses on NIH mission (basic science + health relevance) - Funding is results driven, little reward for software optimization NIH does not really fund (or support) supercomputing / CI  …yet NIH funds the bulk of biomolecular simulation research (?) Student PhD’s tend to be in “chemistry” (no expertise in computational science) Codes are complex, legacy, and evolving…
  15. IND SBE 4% GEO 0% 2% CIS PHY 3% 18% ENG 10% AST 14% BIO CHE 30% 11% DMR DMS 8% 0% Curious trends, barriers and limitations in the field… NSF does not directly fund most biomolecular simulation research few agencies or companies support biosimulation code development bulk of cycles in field from NSF centers, then DOE. 10% cap on NIH research vs. inter-agency cooperation?
  16. IND SBE 4% GEO 0% 2% CIS PHY 3% 18% ENG 10% AST 14% BIO CHE 30% 11% DMR DMS 8% 0% Curious trends, barriers and limitations in the field… NSF does not directly fund most biomolecular simulation research few agencies or companies support biosimulation code development bulk of cycles in field from NSF centers, then DOE. 10% cap on NIH research vs. inter-agency cooperation? Threats: Without NSF cycles and the TeraGrid/xD the field of biomolecular simulation would stagnate.
  17. IND SBE 4% GEO 0% 2% CIS PHY 3% 18% ENG 10% AST 14% BIO CHE 30% 11% DMR DMS 8% 0% Curious trends, barriers and limitations in the field… NSF does not directly fund most biomolecular simulation research few agencies or companies support biosimulation code development bulk of cycles in field from NSF centers, then DOE. 10% cap on NIH research vs. inter-agency cooperation? Threats: Without NSF cycles and the TeraGrid/xD the field of biomolecular simulation would stagnate. …we are spending more and more of our time running simulations, managing workflow, transferring data, i.e. doing computational science
  18. bio-molecular simulation at the meta-scale MD simulations ~500ps – 3 ns ~1994-1997 simulations run for ~6 months, 16-32-way parallel, batch < 100 GB data, run remotely, stored and analyzed locally analysis is standard (key values vs. time) required advances (completed): - methods improvement (PME electrostatics) - optimized codes for shared memory, MPI, … - development of general purpose analysis utilities “ptraj”
  19. bio-molecular simulation at the tera+ -scale DNA minor groove binders 7 drugs, 2 binding modes, 4 sequences @ ~50 ns tetraloop receptor 5 simulations @ ~200 ns cyp-P450 2B4 8 simulations @ ~150 ns simulations run for ~6 months, 16-1K -way parallel, batch ~1-5 TB per set, run remotely, stored and analyzed locally analysis has become rate limiting; data too large/slow…
  20. Data is complex: How to simplify? (don’t throw out baby with bathwater) vast time/size scales; granularity scales
  21. …if we know what we want to see, analyzing and visualizing is easy… …and tools are available
  22. force fields vs. sampling we (likely) have systematic problems with structure or converge to incorrect structure we (likely) get trapped in a meta-stable conformations Computer power? the good the bad energy reaction coordinate
  23. David E. Shaw: DESRES 16 microseconds / day !!!
  24. Funny things can and do happen… & we’re experiencing serious data overload… 500 nanosecond simulation of a DNA duplex using generalized Born implicit solvation
  25. Some problems (~2000-2008) Phased A-tract burrowing Mg2+ ion? K+, Cl-, Mg2+ crystal?
  26. Joung / Cheatham, JPCB 113, 13279 (2009)
  27. site Ecomplex EDNA+20w DAPI DG DDG* ATTG -4085.0 -3915.6 -149.7 -19.7 -2.4 AATT -4086.4 -3917.9 -149.7 -18.8 ATTG -4085.7 -3916.4 -149.7 -19.6 +1.0 AATT -4087.5 -3917.2 -149.7 -20.6 ATTG -4087.2 -3918.7 -149.7 -18.8 +1.4 AATT -4092.8 -3922.9 -149.7 -20.2 How about long DNA simulation? > 500 ns on DAPI bound DNA duplexes Cornell et al. force field. J. Amer. Chem. Soc. (2003) Špackova et al. (Cheatham, Sponer) * Includes entropic differences
  28.  (CCAATTGG)2GG at ~350 ns (two separate simulations) …DNA duplex structure goes away and never comes back…
  29. RNA is more difficult… …but also much more interesting! dynamics / flexibility > 1 conformation structure is (very) sensitive to the surroundings un-validated force fields very few drug bound structures…
  30. 8 “Long” MD (~20-100 ns): restraints progressively violated… U U8 U U A7 C A A Ψ G C A U C G U A 9 7 10 STATISTICS d109 DISTANCE between atoms :9@H5 & :7@H1' AVERAGE: 6.8887 (2.7204 stddev) INITIAL: 4.2624 FINAL: 6.5966 NOE SERIES: S < 2.9, M < 3.5, w < 5.0, blank otherwise. |SMMMMWMMWMWW W | NOE < 4.30 for 21.86% of the time NOE < 4.80 for 24.83% of the time < 2.5 2.5-3.5 3.5-4.5 4.5-5.5 5.5-6.5 > 6.5 ------------------------------------------------------- %occupied | 0.7 | 13.1 | 9.2 | 6.2 | 10.0 | 60.7 |
  31. peta- to exa-scale worries…
  32. Petascale science: scaling  It is hard to ||-ize time MD codes scale to ~16-256 processors @ > 70% efficiency ► getting to 1000 is do-able (Bob Duke, UNC; Schulten, UIUC; DE Shaw; E. Lindahl) ► getting to 10,000 is hard (PetaApps). ► getting to 100,000: ??? (ensemble methods) not easily with embarrassingly NOT parallel MD
  33. trajin traj.1.gz trajin traj.2.gz trajin traj.3.gz trajout traj.strip center :1-10 mass origin image origin center center :1-20 mass origin image origin center rms first mass out rms.dat :1-20 distance d1 out d1.dat :1 :10 grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18 strip :WAT average pdb average.pdb :1-20 data management & simulation workflow are limiting… ..the standard means of analysis is breaking down…
  34. trajin traj.1.gz trajin traj.2.gz trajin traj.3.gz trajout traj.strip center :1-10 mass origin image origin center center :1-20 mass origin image origin center rms first mass out rms.dat :1-20 distance d1 out d1.dat :1 :10 grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18 strip :WAT average pdb average.pdb :1-20 data management & simulation workflow are limiting… ..the standard means of analysis is breaking down… New modes of operation: ENSEMBLES replica-exchange path integral EVB DG simulations NEB / path sampling meta-dynamics Essentially a set of loosely coupled 16-1K processor jobs More data, more complicated workflow, … Examples: two DG states, 20 windows = 40 * 20 temperatures = 800 instances 256 frames on a reaction path * 16 beads per particle * … =
  35. Petascale science: the problem will only get worse! tetraloop receptor 5 simulations @ 200 ns > 1 TB of data What if we can run 1000x longer? …or 10x bigger for 100x longer?
  36. Petascale science: the problem will only get worse! tetraloop receptor 5 simulations @ 200 ns > 1 TB of data What if we can run 1000x longer? …or 10x bigger for 100x longer? > 1000 TB of data …factor of 10: OK …factor of 100: hard …factor of 1000: ??? …more and more time is spent moving data / managing simulations; less time spent doing science…
  37. Petascale science: the problem will only get worse! Solutions? Analysis “on the fly…” [ & more coarse-grained sampling ] + workflow tools for ensembles Do not move the data (?) Tiered resources Persistent storage Re-running the simulations …what will we miss? Can we only get low hanging fruit?
  38. Petascale science: Worries as we move forward… Hindrances: Codes have become “simpler” and will need to be restructured. intra-core vs. intra-node vs. inter-node vs. cputype We want to retain high precision / accuracy. We want to be able to enable new methods (with ease). ( Force fields are not yet up to the challenge!!! )
  39. What we need (data/workflow-centric) is: …a means to speed up & enable science… …a means to interact with our simulations: “steer”, inspect, share, search, understand, expose (hidden correlations, meaning, data) …a means to manage large simulation workflows… disseminate, enable re-use How do we make TB’s of raw data available? remote references to data partial analysis, on the fly analysis history, memory, or provenance standards (?) annotation automation – workflow!
  40. My world reinforces Seidel’s CI crises. We need: - Educated people / teams (multidisciplinary, experts) - Software / middleware (workflow, provenance, data handling) - Software – code optimization / parallelization / extensions Ease of use Means to analyze data, distribute data, preserve/archive data… More cycles, more disk space, … More science, less computational science
  41. Hepatitis C virus IRES IRES = internal ribosome entry site (translation initiation in middle of mRNA)
  42. Why is failure important to learn about? These methods are in wide use worldwide: CADD Structure Prediction Mechanisms Molecular association Most people do not have 15M hour allocations Data from failure can be reused! ~500 active NIH grants with “molecular dynamics” in abstract!
  43. home office CHPC NIH RO1-GM081411-01A1  NIH RO1-GM079383-01A1  NSF TG-MCA01S027
More Related