810 likes | 957 Views
Informed by Informatics?. Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K. Modelling in Chemistry. PHYSICS-BASED. ab initio. Density Functional Theory. Car-Parrinello. Fluid Dynamics. AM1, PM3 etc.
E N D
Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
Modelling in Chemistry PHYSICS-BASED ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA EMPIRICAL NON-ATOMISTIC ATOMISTIC
LOW THROUGHPUT ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA HIGH THROUGHPUT
THEORETICAL CHEMISTRY ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD NO FIRM BOUNDARIES! Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA INFORMATICS
ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA
Informatics and Empirical Models • In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput.
QSPR • Quantitative Structure Property Relationship • Physical property related to more than one other variable • Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s). • Property-property relationships from 1860’s • General form (for non-linear relationships): y = f (descriptors)
QSPR Y = f (X1, X2, ... , XN ) • Optimisation of Y = f(X1, X2, ... , XN) is called regression. • Model is optimised upon N “training molecules” and then • tested upon M “test” molecules.
QSPR • Quality of the model is judged by three parameters:
QSPR • Different methods for carrying out regression: • LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. • NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.
QSPR • However, this does not guarantee a good predictive model….
QSPR • Problems with experimental error. • QSPR only as accurate as data it is trained upon. • Therefore, we are need accurate experimental data.
QSPR • Problems with “chemical space”. • “Sample” molecules must be representative of “Population”. • Prediction results will be most accurate for molecules similar • to training set. • Global or Local models?
1. Solubility • Dave Palmer • Pfizer Institute for Pharmaceutical Materials Science • http://www.msm.cam.ac.uk/pfizer D.S. Palmer, et al., J. Chem. Inf. Model., 47, 150-158 (2007) D.S. Palmer, et al., Molecular Pharmaceutics, ASAP article (2008)
Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.
Datasets • Phase 1 – Literature Data • Compiled from Huuskonen dataset and AquaSol database – pharmaceutically relevant molecules • All molecules solid at room temperature • n = 1000 molecules • Phase 2 – Our Own Experimental Data • ● Measured by Toni Llinàs using CheqSol Machine • ●pharmaceutically relevant molecules • ● n = 135 molecules ●Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)
Diversity-Conserving Partitioning ●MACCS Structural Key fingerprints ●Tanimoto coefficient ●MaxMin Algorithm Full dataset n = 1000 molecules Training n = 670molecules Test n = 330 molecules
Structures & Descriptors ●3D structures from Concord ● Minimised with MMFF94 ● MOE descriptors 2D/ 3D ● Separate analysis of 2D and 3D descriptors ● QuaSAR Contingency Module (MOE) ● 52 descriptors selected
Random Forest Machine Learning Method
Random Forest ● Introduced by Briemann and Cutler (2001) ● Development of Decision Trees (Recursive Partitioning): ● Dataset is partitioned into consecutively smaller subsets ● Each partition is based upon the value of one descriptor ● The descriptor used at each split is selected so as to optimise splitting ● Bootstrap sample of N objects chosen from the N available objects with replacement
Random Forest ● Random Forest is a collection of Decision or Regression Trees grown with the CART algorithm. ● Standard Parameters: ● 500 decision trees ● No pruning back: Minimum node size = 5 ● “mtry” descriptors (square root of number of total number) tried at each split Important features: ● Incorporates descriptor selection ● Incorporates “Out-of-bag” validation – using those molecules not in the bootstrap samples
Random Forest for Solubility Prediction A Forest of Regression Trees • Dataset is partitioned into consecutively • smaller subsets (of similar solubility) • Each partition is based upon the value of • one descriptor • The descriptor used at each split is • selected so as to minimise the MSE Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).
Random Forest for Predicting Solubility • A Forest of Regression Trees • Each tree grown until terminal nodes contain specified number of molecules • No need to prune back • High predictive accuracy • Includes method for descriptor selection • No training problems – largely immune from overfitting.
Random Forest: Solubility Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 RMSE(tr)=0.27 r2(tr)=0.98 Bias(tr)=0.005 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)
Can we use theoretical chemistry to calculate solubility via a thermodynamic cycle?
Gsubfrom lattice energy & an entropy term Ghydrfrom a semi-empirical solvation model (i.e., different kinds of theoretical/computational methods)
Gsubfrom lattice energy (DMAREL) plus entropy Gsolvfrom SCRF using the B3LYP DFT functional Gtrfrom ClogP (i.e., different kinds of theoretical/computational methods)
● Nice idea, but didn’t quite work – errors larger than QSPR ● “Why not add a correction factor to account for the difference between the theoretical methods?” …
… ● Within a week this had become a hybrid method, essentially a QSPR with the theoretical energies as descriptors
Solubility by TD Cycle: Conclusions ● We have a hybrid part-theoretical, part-empirical method. ● An interesting idea, but relatively low throughput as a crystal structure is needed. ● Slightly more accurate than pure QSPR for a druglike set. ● Instructive to compare with literature of theoretical solubility studies.
2. Bioactivity Florian Nigsch F. Nigsch, et al., J. Chem. Inf. Model., 48, 306-318 (2008)
Feature Space - Chemical Space m = (f1,f2,…,fn) f3 f3 f2 COX2 CDK2 f1 Feature spaces of high dimensionality CDK1 f2 DHFR f1
High affinity to protein target Soluble Permeable Absorbable High bioavailability Specific rate of metabolism Renal/hepatic clearance? Volume of distribution? Low toxicity Plasma protein binding? Blood-Brain-Barrier penetration? Dosage (once/twice daily?) Synthetic accessibility Formulation (important in development) Properties of Drugs
Multiobjective Optimisation Synthetic accessibility Bioactivity Solubility Toxicity Permeability Metabolism Huge number of candidates …
Multiobjective Optimisation Synthetic accessibility Bioactivity Drug Solubility Toxicity U S E L E S S Permeability Metabolism Huge number of candidates most of which are useless!
Spam • Unsolicited (commercial) email • Approx. 90% of all email traffic is spam • Where are the legitimate messages? • Filtering
Analogy to Drug Discovery • Huge number of possible candidates • Virtual screening to help in selection process
Winnow Algorithm • Invented in late 1980s by Nick Littlestone to learn Boolean functions • Name from the verb “to winnow” • High-dimensional input data • Natural Language Processing (NLP), text classification, bioinformatics • Different varieties (regularised, Sparse Network Of Winnow - SNOW, …) • Error-driven, linear threshold, online algorithm
Winnow (“Molecular Spam Filter”) Machine Learning Methods
Features of Molecules Based on circular fingerprints
Combinations of Features Combinations of molecular features to account for synergies.
Orthogonal Sparse Bigrams • Technique used in text classification/spam filtering • Sliding window process • Sparse - not all combinations • Orthogonal - non-redundancy
Protein Target Prediction • Which proteins does a given molecule bind to? • Virtual Screening • Multiple endpoint drugs - polypharmacology • New targets for existing drugs • Prediction of adverse drug reactions (ADR) • Computational toxicology