520 likes | 555 Views
Explore the potential of high-throughput computation as an experimental resource in computational chemistry. Learn about automation protocols, error rates, variations in algorithms, and results validation. Understand the necessity of human validation, protocols conformance, and computation dissemination. Discover tools like Taverna, JUMBOMarker, and R for log files parsing and data analysis. Investigate molecular properties, molecular structures, and accuracy evaluation methods. Examine comparison of computational and experimental data, InChI use, and outliers detection. Gain insights into improving reproducibility and reliability in computational chemistry.
E N D
Computational Chemistry Robots ACS Sep 2005 Computational ChemistryRobots J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang jat45@cam.ac.uk
Can high-throughput computation provide a reliable “experimental” resource for molecular properties? • Can protocols be automated? • Can we believe the results?
Humans must validate protocols rather than individual data Low rates of error must be addressed Users should know the rates of error and degree of conformance Aspects of complete automation
Explore limits of job behaviour (times, convergence, etc.) Analyse reproducibility Vary and analyse effects of parameters and algorithms Compare output with other “measurements” of same quantity Approaches to conformance
The overall view molecules computation dissemination
The overall view molecules computation dissemination Check results
Workflow for management of jobs (Taverna) Natural Language Processing based parsing of outputs (JUMBOMarker) Pairwise comparison of data sets (R) Analysis of mean and variance Detection and analysis of outliers Components of System
Computing the NCI database MOPAC PM5a aMOPAC PM5 – collaboration with J.J.P. Stewart
Unsuitable Data Program Crashes Pathological Behaviour Inform Developer Protocol System Crashes Log Files Statistics Science Errors Parse Analysis Other Science Disseminate Results
Taverna • Workflow programs allow a series of small tasks to be linked together to develop more complex tasks • Open Source • myGRID, eScience • European Bioinformatics Institute • University of Manchester
Computational Chemistry Log Files Parsing Log Files to CML Coordinates Calculation Type Molecular Formula Point Group Total Energy Dipole
Parsers CompChem Output CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Coordinates Coordinates Energy Levels Energy Level Vibrations Vibration
Dissemination of results LOG FILE CML FILE HUMAN DISPLAY JUMBOMarker NLP-based log file parser Outside world WWMM* Server and DSpace * World Wide Molecular Matrix
InChI: IUPAC International Chemical Identifier • A non-proprietary unique identifier for the representation of chemical structures. • A normal, canonicalised and serialised form of a chemical connection table. • InChI FAQ: http://wwmm.ch.cam.ac.uk/inchifaq/
Proteus molecules* JUNK Cured by MOPAC Calculation * Proteus was a shape changing ocean deity
Proteus molecules Input JUNK Calculation
How do we know our results are valid? Computational Method 1 Computational Method 2 Experiment
J.J.P. Stewart’s example Calculated DHf – Expt DHf
GAMESS MOPAC results GAMESSa 631G* B3LYP Log Files a Project with Kim Baldridge and Wibke Sudholt
Unsuitable Data Program Crashes Pathological Behaviour Inform Developer Protocol System Crashes Log Files Statistics Science Errors Parse Analysis Other Science Disseminate Results
Repeat runs, different methods Multiple runs give same final structure from same input Changing memory allocation doesn’t make a difference
Pathological behaviour - Early detection divinyl ether trans-Crotonaldehyde 100 min 631G*, B3LYP 200 min Z matrix 15 min 631G*, B3LYP 10080 min
Analysis of different computational methods Mean - Overall difference Normality - Distribution of values Outliers - Unusual molecules? Variance - Spread of the data, depends on both distributions. (standard deviation)
Probability Plot (Normal QQ plot) S.D. 0.020 Å Mean of distribution (Approx - 0.03 Å) Range over which sample distribution is approximately normal Outliers
All bonds* Dr (MOPAC – GAMESS) / Å * Excludes bonds to Hydrogenc
All bonds* Dr (MOPAC – GAMESS) / Å Good agreement S.D. 0.005 Å Nearly normal Outliers * Excludes bonds to Hydrogenc
Bad molecules and data usually cause outliers H O 2- P H O Na
Mean Dr (M - G) / Å Standard Error of the Mean / Å All values given to 3 significant figures
Dr CC bonds (M - G) / Å Good agreement S.D. 0.013 Å Nearly normal Outliers JUNK
Selection of molecules with C C Dr (M - G) > 0.05 Angstroms
Non aromatic C C bonds adjacent to CFn Y = 0.0277 X – 0.0061
Dr NN bonds (M - G) / Å Good agreement S.D. 0.022 Å Nearly normal Kink
Density plot of Dr NN bonds (M - G) / Å RIGHT LEFT
Most common fragments found in Left set but not Right set N(ar) S(sp2) N (ar) (sp3) C(sp2) Or C(sp3) C(sp3) N(ar) S(sp2) N (ar) C(sp2)
Comparison of theory and experiment CIF* CIF* GAMESS CIF 2 CML CIF* CIF* CIF* Log Files * CIF: Crystallographic Information File
All bonds* Dr (Cryst. – GAMESS) /Å Single molecules, no disorder * Excludes bonds to Hydrogenc
All bonds* Dr (Cryst. – GAMESS) /Å Single molecules, no disorder S.D. 0.014 Å Mean Dr - 0.011 Å Nearly normal Outliers * Excludes bonds to Hydrogenc
Mean Dr - 0.01 Å Dr CC bonds (C – G) /Å S.D. 0.009 Å Nearly normal
Dr CO bonds (C – G) /Å S.D. 0.011 Å Good agreement Nearly normal Outliers ?
Chemistry can cause outliers Dr = +0.08 Å H movement
Conclusions • Protocols can be automated • Machines can highlight unusual behaviour, • geometries and distribution of results for • humans to consider • Computational programs can provide high • quality “experimental” molecular properties