1 / 52

Reliable High-Throughput Computation in Computational Chemistry and Robots

Explore the potential of high-throughput computation as an experimental resource in computational chemistry. Learn about automation protocols, error rates, variations in algorithms, and results validation. Understand the necessity of human validation, protocols conformance, and computation dissemination. Discover tools like Taverna, JUMBOMarker, and R for log files parsing and data analysis. Investigate molecular properties, molecular structures, and accuracy evaluation methods. Examine comparison of computational and experimental data, InChI use, and outliers detection. Gain insights into improving reproducibility and reliability in computational chemistry.

rleclerc
Download Presentation

Reliable High-Throughput Computation in Computational Chemistry and Robots

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Chemistry Robots ACS Sep 2005 Computational ChemistryRobots J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang jat45@cam.ac.uk

  2. Can high-throughput computation provide a reliable “experimental” resource for molecular properties? • Can protocols be automated? • Can we believe the results?

  3. Humans must validate protocols rather than individual data Low rates of error must be addressed Users should know the rates of error and degree of conformance Aspects of complete automation

  4. Explore limits of job behaviour (times, convergence, etc.) Analyse reproducibility Vary and analyse effects of parameters and algorithms Compare output with other “measurements” of same quantity Approaches to conformance

  5. The overall view molecules computation dissemination

  6. The overall view molecules computation dissemination Check results

  7. Workflow for management of jobs (Taverna) Natural Language Processing based parsing of outputs (JUMBOMarker) Pairwise comparison of data sets (R) Analysis of mean and variance Detection and analysis of outliers Components of System

  8. Computing the NCI database MOPAC PM5a aMOPAC PM5 – collaboration with J.J.P. Stewart

  9. Unsuitable Data Program Crashes Pathological Behaviour Inform Developer Protocol System Crashes Log Files Statistics Science Errors Parse Analysis Other Science Disseminate Results

  10. Taverna • Workflow programs allow a series of small tasks to be linked together to develop more complex tasks • Open Source • myGRID, eScience • European Bioinformatics Institute • University of Manchester

  11. An Example Taverna Workflow

  12. Computational Chemistry Log Files Parsing Log Files to CML Coordinates Calculation Type Molecular Formula Point Group Total Energy Dipole

  13. Parsers CompChem Output CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Coordinates Coordinates Energy Levels Energy Level Vibrations Vibration

  14. Dissemination of results LOG FILE CML FILE HUMAN DISPLAY JUMBOMarker NLP-based log file parser Outside world WWMM* Server and DSpace * World Wide Molecular Matrix

  15. InChI: IUPAC International Chemical Identifier • A non-proprietary unique identifier for the representation of chemical structures. • A normal, canonicalised and serialised form of a chemical connection table. • InChI FAQ: http://wwmm.ch.cam.ac.uk/inchifaq/

  16. Proteus molecules* JUNK Cured by MOPAC Calculation * Proteus was a shape changing ocean deity

  17. Proteus molecules Input JUNK Calculation

  18. How do we know our results are valid? Computational Method 1 Computational Method 2 Experiment

  19. J.J.P. Stewart’s example Calculated DHf – Expt DHf

  20. GAMESS MOPAC results GAMESSa 631G* B3LYP Log Files a Project with Kim Baldridge and Wibke Sudholt

  21. Unsuitable Data Program Crashes Pathological Behaviour Inform Developer Protocol System Crashes Log Files Statistics Science Errors Parse Analysis Other Science Disseminate Results

  22. Repeat runs, different methods Multiple runs give same final structure from same input Changing memory allocation doesn’t make a difference

  23. Pathological behaviour - Early detection divinyl ether trans-Crotonaldehyde 100 min 631G*, B3LYP 200 min Z matrix 15 min 631G*, B3LYP 10080 min

  24. Times to run jobs

  25. Analysis of different computational methods Mean - Overall difference Normality - Distribution of values Outliers - Unusual molecules? Variance - Spread of the data, depends on both distributions. (standard deviation)

  26. Probability Plot (Normal QQ plot)

  27. Probability Plot (Normal QQ plot) S.D. 0.020 Å Mean of distribution (Approx - 0.03 Å) Range over which sample distribution is approximately normal Outliers

  28. All bonds* Dr (MOPAC – GAMESS) / Å * Excludes bonds to Hydrogenc

  29. All bonds* Dr (MOPAC – GAMESS) / Å Good agreement S.D. 0.005 Å Nearly normal Outliers * Excludes bonds to Hydrogenc

  30. Bad molecules and data usually cause outliers H O 2- P H O Na

  31. Mean Dr (M - G) / Å Standard Error of the Mean / Å All values given to 3 significant figures

  32. Dr CC bonds (M - G) / Å

  33. Dr CC bonds (M - G) / Å Good agreement S.D. 0.013 Å Nearly normal Outliers JUNK

  34. Selection of molecules with C C Dr (M - G) > 0.05 Angstroms

  35. Non aromatic C C bonds adjacent to CFn Y = 0.0277 X – 0.0061

  36. Dr NN bonds (M - G) / Å

  37. Dr NN bonds (M - G) / Å Good agreement S.D. 0.022 Å Nearly normal Kink

  38. Density plot of Dr NN bonds (M - G) / Å

  39. Density plot of Dr NN bonds (M - G) / Å RIGHT LEFT

  40. Most common fragments found in Left set but not Right set N(ar) S(sp2) N (ar) (sp3) C(sp2) Or C(sp3) C(sp3) N(ar) S(sp2) N (ar) C(sp2)

  41. Comparison of theory and experiment CIF* CIF* GAMESS CIF 2 CML CIF* CIF* CIF* Log Files * CIF: Crystallographic Information File

  42. Reading Acta Crystallographica Section E

  43. All bonds* Dr (Cryst. – GAMESS) /Å Single molecules, no disorder * Excludes bonds to Hydrogenc

  44. All bonds* Dr (Cryst. – GAMESS) /Å Single molecules, no disorder S.D. 0.014 Å Mean Dr - 0.011 Å Nearly normal Outliers * Excludes bonds to Hydrogenc

  45. Dr CC bonds (C – G) /Å

  46. Mean Dr - 0.01 Å Dr CC bonds (C – G) /Å S.D. 0.009 Å Nearly normal

  47. Dr CO bonds (C – G) /Å

  48. Dr CO bonds (C – G) /Å S.D. 0.011 Å Good agreement Nearly normal Outliers ?

  49. Chemistry can cause outliers Dr = +0.08 Å H movement

  50. Conclusions • Protocols can be automated • Machines can highlight unusual behaviour, • geometries and distribution of results for • humans to consider • Computational programs can provide high • quality “experimental” molecular properties

More Related