360 likes | 452 Views
Predictive ADME: How Do I Know If My Predictions Will Be Useful? Curt Breneman October 19, 2006. Exploring Chemical Data. WISDOM. UNDERSTANDING. KNOWLEDGE. INFORMATION. DATA. Prediction of Chemical Behavior. Datasets, Information and Descriptors. Modeling and Mining Methods.
E N D
Predictive ADME: How Do I Know If My Predictions Will Be Useful? Curt Breneman October 19, 2006
Exploring Chemical Data WISDOM UNDERSTANDING KNOWLEDGE INFORMATION DATA
Prediction of Chemical Behavior • Datasets, Information and Descriptors • Modeling and Mining Methods • Validation Methods
Critical Analysis of Dataset Properties • Size of the dataset (Gedeck, 2006.) • Quality of the dataset (Eva Gottmann, et.al. 2001) • Single protocols of data acquisition are more reliable. • Be aware of data compilations; different labs, different assays. • Interpretation of outliers in identification of mechanism (Cronin, 2003.) • Found small and specifically reactive molecules had increased toxicity than reported by QSAR • Errors inherent in the dataset • Experimental error • Descriptor noise Modeling method should match quality of dataset
Problem Definition and Method Selection Which approach makes sense? Too Focused Too Broad Solution will depend on dataset quality
QSAR: Quantitative Structure Activity Relationships • The process by which chemical structure is quantitatively correlated with a well-defined process • Biological or Chemical processes • Structure-Activity Relationship • Hypothesis: Similar molecules have similar activities • What does “similarity” really mean?
Encoding Structure : Descriptors AAACCTCATAGGAAGCATACCAGGAATTACATCA… Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors Molecular Structures Descriptors Model Activity
Descriptor Choices • No particular class of descriptors address all problems • May be chosen to be problem specific • May be chosen to be method specific
Software MOE Sybyl Almond / GRIND Dragon Pipeline Pilot – SciTegic Proprietary solutions RECON, PEST and many others…
Descriptor Types • 1-D (molecular weight, # of atoms,bonds) • 2-D (adjacency & distance…) • 3-D Properties (volume, potential energy) • Fragment Counts • Electronic Properties • TAE Properties • Others Combined Descriptors
Dataset and Descriptor Analysis • Standard deviation of the activity > 1.0 is recommended (Gedeck, 2006) • Reduction of collinearity between descriptors • Molecule to descriptor ratio should be high • 5:1 ratio or higher on traditional QSAR (Topliss, 1972.) • Special cases of data strip mining • Consistent scaling of descriptors between training, test, and validation • Single conformation models do not fully represent dynamic systems
Modern QSAR Adventures Using Validated ALL-QSAR Models in Virtual Screening (Tropsha, 2004) • Large chemical databases very chemically diverse • ALL-QSAR models -- locally weighted linear regression models • Well-suited to modeling of sparse or unevenly distributed data sets • Comparative QSAR hydrophobicity study on HIV-1 protease inhibitors (Garg, 2005.) • Established a working optimal value of ClogP • Saw that molecules in small set fell outside range • Determined that more diverse dataset is required
Machine Learning Methods • Support Vector Machines for ADME Property Classification (Trotter, 2003) • Comparing MLR, PLS, and ANN QSPR Models (Erösa, 2004) • Best model generated was an ANN with a Q2 of 0.85 • Comparison of MLR, PLS, and GA-MLR in QSAR analysis (Saxena, 2003) • Training of 70, testing of 27, activity spanned five orders of magnitude • Combined GA-MLR provided simple, robust models
Popularity of Methods(a highly scientific analysis) • Genetic Algorithm • Single GA method • 74,700 hits (Genetic Algorithm QSAR) • Combined with other methods (MLR, PLS, ANN) • 98,600 hits (GA QSAR) • Artificial Neural Network • 94,300 hits (Artificial Neural Network QSAR) • Partial Least Squares • 56,400 hits (Partial Least Squares QSAR) • Support Vector Machines • 31,300 hits (Support Vector Machines QSAR)
Metrics for Measuring Models For training set we use: • LMSE: least mean square error for training set • r2 : correlation coefficient for training set • R2: PRESS R2 • For validation/test set we use: • LMSE: least mean square error for validation set • q2 : 1 – rtest2 • Q2: 1 – Rtest2
DATASET Training set Test set Y-scrambling method validation! Models will not reveal mechanism Bootstrap sample k Predictive Model Training Validation Learning Model Tuning / Prediction Prediction Model Building and Validation
Validation Strategies • Y-scrambling • Randomization of the modeled property • External validation • Split ratio (training and test data sets) • Bootstraps • Leave-group-out • Leave-one-out
Case Studies • Caco-2 Model Feature Selection • hERG Channel Inhibition ROC Analysis
Predicted values Observed values Case 1: Caco-2 Data“or…Why feature selection is crucial” • Human intestinal cell line • Predicts drug absorption • 27 molecules with tested permeability • 718 descriptors generated • Electronic TAE • Shape/Property (PEST) • Traditional
DRNB10 DRNB00 PIPB04 PEOE.VSA.FHYD PEOE.VSA.FNEG KB11 PEOE.VSA.4 BNPB31 PEOE.VSA.FPOL PEOE.VSA.PPOS PEOE.VSA.FPPOS SlogP.VSA6 PIPMAX EP2 FUKB14 ANGLEB45 apol BNPB50 SlogP.VSA9 PIPB53 BNP8 ABSFUKMIN BNPB21 ABSKMIN Feature Importance Starplot Caco-2 – 31 Variables ABSDRN6 SlogP.VSA0 a.don KB54 SMR.VSA2 pmiZ SIKIA
a.don DRNB10 PEOE.VSA.FNEG BNPB31 KB54 ABSKMIN FUKB14 ABSDRN6 SMR.VSA2 SIKIA SlogP.VSA0 PEOE.VSA.FPPOS DRNB00 ANGLEB45 pmiZ Feature Importance StarplotCaco-2 – 15 Variables
Predicted values Predicted values Observed values Observed values Bagged SVM Results Caco-2 - 718 Variables Caco-2 - 15 Variables
hERG: ROC Curve Comparisons leave-one-out results Before Feature Selection After Feature Selection
Characteristics of high-quality Predictive Models • All descriptors used in the model are significant, • None of the descriptors account for single peculiarities • No leverage or outlier compounds in the training set (Gisbert, 2006.) • Cross-validation performance should show: • Significantly better performance than that of randomized tests • Training set and external test set homogeneity.
Pitfalls In QSAR: Addressed? • Data Sets • Problems: Compilation of data, outliers, size of samples • Solutions: Well-standardized assays, clear and unambiguous endpoints • Descriptors • Problems: Collinearity, Interpretability, error in data, too many variables • Solutions: Domain knowledge, combined descriptors, feature selection • Statistical Methods • Problems: Overfitting of data, non-linearity, interpretability • Solutions: Simple models using validation Development of QSARs is more an art, than a science - Mark T.D. Cronin and T. Wayne Schultz
The Eight Commandments of Successful QSPR/QSAR Modeling There should be a PLAUSIBLE (not necessarily known or well understood) mechanism or connection between the descriptors and response. Otherwise we could be doing numerology… Robustness: you cannot keep tweaking parameters until you find one that works just right for a particular problem or dataset and then apply it to another. A generalizable model should be applicable across a broad range of parameter space. Know the domain of applicability of the model and stay within it. What is sauce for the goose is sauce for the gander, but not necessarily for the alligator. Likewise, know the error bars of your data.
The Eight Commandments of Successful QSPR/QSAR Modeling No cheating... no looking at the answer. This is the minimum requirement for developing a predictive model or hypothesis Not all datasets contain a useful QSAR/QSPR “signal”. Don’t look too hard for something that isn’t there… Consider the use of “filters” to scale and then remove correlated, invariant and “noise” descriptors from the data, and to remove outliers from consideration. Use your head and try to understand the chemistry of the problem that you are working on – modeling is meant to assist human intelligence – not to replace it…
ACKNOWLEDGMENTS • Current and Former members of the DDASSL group • Breneman Research Group (RPI Chemistry) • N. Sukumar • M. Sundling • Min Li • Long Han • Jed Zaretski • Theresa Hepburn • Mike Krein • Steve Mulick • Shiina Akasaka • Hongmei Zhang • C. Whitehead (Pfizer Global Research) • L. Shen (BNPI) • L. Lockwood (Syracuse Research Corporation) • M. Song (Synta Pharmaceuticals) • D. Zhuang (Simulations Plus) • W. Katt (Yale University chemistry graduate program) • Q. Luo (J & J) • Embrechts Research Group (RPI DSES) • Tropsha Research Group (UNC Chapel Hill) • Bennett Research Group (RPI Mathematics) • Collaborators: • Tropsha Group (UNC Chapel Hill - CECCR) • Cramer Research Group (RPI Chemical Engineering) • Funding • NIH (GM047372-07) • NIH (1P20HG003899-01) • NSF (BES-0214183, BES-0079436, IIS-9979860) • GE Corporate R&D Center • Millennium Pharmaceuticals • Concurrent Pharmaceuticals • Pfizer Pharmaceuticals • ICAGEN Pharmaceuticals • Eastman Kodak Company • Chemical Computing Group (CCG)
References • Matthew W. B. Trotter,Sean B. Holden Support Vector Machines for ADME Property Classification QSAR (2003) 533-548. • Saxena, A. K. and Prathipati, P. Comparison of MLR, PLS, and GA-MLR in QSAR analysis. Medicinal Chemistry Division, Central Drug Research Institute (CDRI). 9/1/2003. • Cronin, Mark T.D. and Schultz, Wayne T. Pitfalls in QSAR. Journal of Molecular Structure (Theochem). 622. (2003) 39-51. • Rajarshi. Guha, Peter C. Jurs, Determining the Validity of a QSAR Model – A Classification Approach J. Chem. Inf. Model 45, (2005) 65-73 • Sabcho. Dimitrov, Gergana Dimitrova, Todor Pavlov, Nadezhda Dimitrova, Grace Patlewicz, Jay Niemela, and Ovanes Mekenyan. A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models J. Chem. Inf. Model 45, (2005) 839-849 • Rajarshi. Guha and Peter C. jurs. Interpreting Computational Neural Network QSAR Models: A Measure of Descriptor Importance J. Chem. Inf. Model 45 (2005) 800-806 • R. Kawakami, et.al. A method for calibration and validation subset partitioning (Talanta 2005) • Garg, Rajni. And Bhhatarai, Barun. From SAR to comparative QSAR: role of hydrophobicity in the design of 4-hydroxy-5,6-dihydropyran-2-ones HIV-1 protease inhibitors. Department of Chemistry, Clarkson University. Bioorganic & Medicinal Chemistry 13 (2005). 4078-4084. • Shuxing. Zhang, Alexander Golbraikh, Scott Oloff, Harold Kohn, and Alexander Tropshal A Novel Automated Lazy Learning QSAR (ALL-QSAR) Approach: Method Development, Applications, and Virtual Screening of Chemical Databases Using Validated ALL-QSAR Models J. Chem. Inf. Model. 2006 • Peter Gedeck, Bernhard Rohde, and Christian Bartels QSAR –How Good Is It in practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets J. Chem. Inf. Model. 46, (2006) 1924-1936 • Schneider, Gisbert. Development of QSAR Models . Eurekah Bioscience Database. 2006.
Pitfalls In QSAR • Data Sets • Problems • Solutions • Descriptors • Problems • Solutions • Statistical Methods • Problems • Solutions
Model Parsimony Rules • Simple models are better • Interpretable models are better • Reality: need to balance predictive ability and interpretability