1 / 98

Modelling in Chemistry: High and Low-Throughput Regimes

Modelling in Chemistry: High and Low-Throughput Regimes. Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K. We look at data, analyse data, use data to find correlations ... ... to develop models ...

illias
Download Presentation

Modelling in Chemistry: High and Low-Throughput Regimes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

  2. We look at data, analyse data, use data to find correlations ... ... to develop models ... ... and to make (hopefully) useful predictions. Let’s look at some data ...

  3. New York Times, 4th October 2005.

  4. Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

  5. Outliers? Happiness (GNP/$5000) -2

  6. Fitting with a curve: reduce RMSE

  7. Outliers? Different linear models for different regimes

  8. Only one obvious (to me) conclusion This area is empty: no country is both rich and unhappy. All other combinations are observed. Happiness (GNP/$5000) -2

  9. ... but what is the connection with chemistry?

  10. Modelling in Chemistry PHYSICS-BASED ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA EMPIRICAL NON-ATOMISTIC ATOMISTIC

  11. LOW THROUGHPUT ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA HIGH THROUGHPUT

  12. THEORETICAL CHEMISTRY ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD NO FIRM BOUNDARIES! Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA INFORMATICS

  13. ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA

  14. Theoretical Chemistry • Calculations and simulations based on real physics. • Calculations are either quantum mechanical or use parameters derived from quantum mechanics. • Attempt to model or simulate reality. • Usually Low Throughput.

  15. Informatics and Empirical Models • In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput.

  16. QSPR • Quantitative Structure  Property Relationship • Physical property related to more than one other variable • Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s). • Property-property relationships from 1860’s • General form (for non-linear relationships): y = f (descriptors)

  17. QSPR Y = f (X1, X2, ... , XN ) • Optimisation of Y = f(X1, X2, ... , XN) is called regression. • Model is optimised upon N “training molecules” and then • tested upon M “test” molecules.

  18. QSPR • Quality of the model is judged by three parameters:

  19. QSPR • Different methods for carrying out regression: • LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. • NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

  20. QSPR • However, this does not guarantee a good predictive model….

  21. QSPR • Problems with experimental error. • QSPR only as accurate as data it is trained upon. • Therefore, we are need accurate experimental data.

  22. QSPR • Problems with “chemical space”. • “Sample” molecules must be representative of “Population”. • Prediction results will be most accurate for molecules similar • to training set. • Global or Local models?

  23. Relationship of Chemical Structure With Lattice EnergyCan we predict lattice energy from 2D molecular structure? Dr Carole Ouvrard & Dr John Mitchell Unilever Centre for Molecular Informatics University of Cambridge C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

  24. Why Do We Need a Predictive Model? Existing techniques from Theoretical Chemistry can give us accurate sublimation and lattice energies ... ... but only in very low throughput.

  25. Why Do We Need a Predictive Model? • A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials • From 2-D molecular structure only • Without knowing the crystal packing • Without expensive theoretical calculations • Should help predict solubility.

  26. Why Do We Think it Will Work? • Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule. • Many molecules have a plurality of different experimentally observable polymorphs. • We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

  27. x P1- O P212121 1.60 1.40 1.50 Density (g/cc) + P21/c  P21 x + x + -92.0  x x x O x O  O O + -94.0 x Experimental Crystal Structure  O x x x + O -96.0 x  Lattice Energy (kJ/mol) Calculated Lowest Energy Structure -98.0

  28. Expression for the Lattice Energy • U crystal = U molecule + U lattice • Theoretical lattice energy • Crystal binding = Cohesive energy • Experimental lattice energy is related to -DH sublimation DH sublimation = -Ulattice – 2RT (Gavezzotti & Filippini)

  29. Partitioning of the Lattice Energy • U crystal = U molecule + U lattice • DH sublimation = -U lattice – 2RT • Partitioning the lattice energy in terms of structural contributions • Choice of the significant parameters • number of atoms of each type? • Number of rings, aromatics? • Number of bonds of each type? • Symmetry? • Hydrogen bond donors and acceptors? Intramolecular? • We choose counts of atom type occurrences.

  30. Experimental data: DHsublimation Atom Types SATIS codes : 10-digit connectivity code + bond types Each 2 digit code = atomic number HN 01 07 99 99 99 HO 01 08 99 99 99 O=C 08 06 99 99 99 -O- 08 06 06 99 99 Statistical analysis Multi-Linear Regression Analysis Hsub # atoms of each type Analysis of the Sublimation Energy Data • NIST (National Institute of Standards and Technology, USA) • Scientific literature Typically, several similar SATIS codes are grouped to define an atom type.

  31. 226 organic compounds 19 linear alkanes (19) 14 branched alkanes (33) 17 aromatics (50) 106 other non-H-bonders (156) 70 H-bond formers (226) Non-specific interacting Hydrocarbons Nitrogen compounds Nitro-, CN, halogens, S, Se substituents Pyridine Potential hydrogen bonding interactions Amides Carboxylic acids Amino acids… Training Dataset of Model Molecules

  32. 19 compounds : CH4  C20H24 Limit for van der Waals interactions DHsub= 7.955C-2.714 r2= 0.977 s = 7.096 kJ/mol Study of Non-specific Interactions: Linear Alkanes Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems. BPt DH sub Note odd-even variation in DHsub for this series.

  33. Include Branched Alkanes Add 14 branched alkanes to dataset. The graph below highlights the reduction of sublimation enthalpy due to bulky substituents. • 33 compounds : CH4 C20H24 • DHsub = 7.724Cnonbranched + 3.703 • r2= 0.959 • s = 8.117 kJ/mol • If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

  34. All Hydrocarbons: Include Aromatics Add 17 aromatics to the dataset (note: we have no alkenes or alkynes). • 50 compounds • DHsub = 7.680Cnonbranched + 6.185Caromatic + 4.162 aliphatic • r2= 0.958 • s = 7.478 kJ/mol • As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

  35. All Non-Hydrogen-Bonded Molecules: Add 106 non-hydrocarbons to the dataset. Include elements H, C, N, O, F, S, Cl, Br & I. • 156 compounds • DHsub predicted by 16 parameter model • r2= 0.896 • s = 9.976 kJ/mol Parameters in model are counts of atom type occurrences.

  36. General Predictive Model Add 70 hydrogen bond forming molecules to the dataset. • 226 compounds • DHsub predicted by 19 parameter model • r2= 0.925 • s = 9.579 kJ/mol Parameters in model are counts of atom type occurrences.

  37. Predictive Model Determined by MLRA DHsublimation (kJ mol-1) = 6.942+ 20.141 HN +30.172 HO+ 3.127 F +10.456 Cl + 12.926 Br + 19.763 I+ 3.297 C3 – 3.305 C4+ 5.970Caromatic+ 7.631 Cnonbranched+ 7.341 CO+ 19.676 CS+ 11.415 Nnitrile+ 8.953 Nnonnitrile+ 8.466 NO+ 18.249 Oether+ 20.585 SO + 12.840 Sthioether aliphatic All these parameters are significantly larger than their standard errors

  38. Distribution of Residuals The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

  39. Validation on an Independent Test Set • 35 diverse compounds • r2 = 0.928 • s = 7.420 kJ/mol Very encouraging result: accurate prediction possible. Nitro-compounds are often outliers

  40. Major Conclusion • Lattice energy can be predicted from 2D structure, without knowing the details of the crystal packing!

  41. Conclusions • We have determined a general equation allowing us to estimate the sublimation enthalpy for a large range of organic compounds with an estimated error of  9 kJ/mol. •  A very simple model (counts of atom types) gives a good prediction of lattice & sublimation energies. • Lattice energy can be predicted from 2D structure, without knowing the details of the crystal packing. • Avoids need for expensive calculations. • May help predict solubility. • Model gives good chemical insight.

  42. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.

  43. Drug Disc.Today, 10 (4), 289 (2005) Cohesive interactions in the lattice reduce solubility Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

  44. Classifying the WADA 2005 Prohibited List Using CDK & Unity Fingerprints Ed Cannon, Andreas Bender, David Palmer & John Mitchell,J. Chem. Inf. and Model., 46, 2369-2380 (2006) www-mitchell.ch.cam.ac.uk/ jbom1@cam.ac.uk

  45. Classifying the WADA Prohibited List • Aims & Background. • Methods. • Data. • Results. • Conclusions.

More Related