1 / 29

Analysis of Multiple QSAR Models - A Basis for Experimental Design

Analysis of Multiple QSAR Models - A Basis for Experimental Design. Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D. QSAR Problems. Find and report unsuspected trends between biological activity and molecular descriptors Linear or non-linear behavior Variable interactions. …and Solutions.

apiatan
Download Presentation

Analysis of Multiple QSAR Models - A Basis for Experimental Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of Multiple QSAR Models-A Basis for Experimental Design Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D.

  2. QSAR Problems • Find and report unsuspected trends between biological activity and molecular descriptors • Linear or non-linear behavior • Variable interactions …and Solutions • Look at large ensembles of solutions • Consider variables in combinations • Efficient variable selection via GA • Provide multiple models • Critical analysis • Experimental design

  3. Single QSAR Model Training Set MLR, PLS, PCR Stepwise Linear…etc QSAR Model Candidate Molecules Prioritized Candidates

  4. Simplistic Approach • Large error in biological activities • Largely under determined situation • Consider 2 models: CV-R2 0.81 vs 0.79 Multiple Models • Consider many possible models • All equally probable • Use quantitative and qualitative analysis • Statistics • Intuition

  5. Multiple QSAR Models Training Set GFA, G/PLS Model 1 Model 2 Model 3 ... QSAR Models - “Prediction safe” Consensus Prediction or - “Prediction sensitive” Experimental Design Candidate Molecules

  6. GA Variable Selection Generation #1 Term 1 Term 2 Term 3 Term 4 Eq. 1 Eq. 2 Crossover point Generation #2 Eq. 1 Eq. 2

  7. GA Variable Usage

  8. GA Model Selection

  9. Analysis of Models from GFA • Relationship between models (which are similar/different to each other?) • Are the best models (based on LOF) similar to each other or different ? • How to sample the ensemble of models ? • How many to select ? • Which ones ? • Can we find an average/consensus model ?

  10. Dataset D. L. Selwood et al. “Structure-Activity Relationships of Antifilarial Antimycin Analogues: A Multivariate Pattern Recognition Study”, J. Med. Chem., 33, 136-142, 1990. Conditions for GFA • Run GFA with: • Population size: 200 models • Initial equation length 3 and d=2 • Number of iterations: 20,000

  11. Analysis of GFA Models • Objectives: • Identify each models on a graphical representation • Similar models (eg giving similar predicted values) are close to each other. Significantly different models are distant. • Provide capabilities for analyzing models: Do the best models correspond to a single prediction scheme (cluster of equivalent models) or several schemes (dispersed models) ?

  12. Analysis of GFA Models Usual Method: 1. Select the top 1/2 (100) from the 200 models 2. Generate predicted values columns for these 100 models 3. Generate residual value columns (Pred. - Actual) for these 100 models 4. Perform PCA on residual columns 5. Inspect descriptor plot Problem: Visualization only

  13. Graphical Representation of Models PCA Descriptor Plot

  14. Analysis of GFA Models Proposed Method: 1. Select the top 1/2 (100) from the 200 models 2. Generate predicted values columns for these 100 models 3. Generate residual value columns (Pred. - Actual) for these 100 models 4. Generate a correlation matrix 100x100 between the residual columns 5. Analyze matrix by MDS and generate MDS coordinates for N=3

  15. Graphical Representation of Models PCA Descriptor Plot MDS Samples Plot

  16. Analysis of GFA Models Proposed Method: MDS on Correlation Matrix

  17. Graphical Analysis of Models • Models can be colored by LOF(a) or Adj. R2 (b) • Other possible coloring schemes: • F-test values • Nvars • R • LSE a b

  18. Clustering/Selection of Models Clustering of models Selection of 10 diverse models

  19. Selection of Basis Models D. Rogers “Evolutionary Statistics: Using a Genetic Algorithm and Model Reduction to Isolate Alternate Statistical Hypotheses of Experimental Data”, Proceedings 7th International Conference on Genetic Algorithms, East Lansing, MI (1997) Procedure: • Perform PCA on residual columns (last component should retain at least 1/N of original variance • For each retained component, select model which residual most correlates with that component

  20. Selection of Basis Models Component: % Variance Base Model Correlation PC1 79.05 M36 0.972 PC2 7.12 M98 0.772 PC3 3.5 M72 0.488 Alternate Selection: Component: % Variance Base Model Correlation PC1 79.05 M23 0.970 PC2 7.12 M58 0.720 PC3 3.5 M72 0.488

  21. Representation of Basis Models Initial Selection: M36, M98, M72 Alternate Selection: M23, M58, M72

  22. Identification of Consensus Model • Objective: • Select a “central” model with minimal contradictions vs other possible models Consensus Model: M86

  23. Use in Experimental Design • Objective: • First: Identify orthogonal models (basis models) corresponding to radically different prediction schemes • Then: Identify compounds which are “prediction safe”: Predicted active by selected models • or • Identify compounds which are “prediction sensitive”: Predicted differently across selected models

  24. Application of Multiple QSAR Analysis Methodology on a set of Dopamine beta-Hydroxylase Inhibitors

  25. Training Set and Test Set Selection • Chemically and Biologically Diverse • Divided into three classes based on Biological Activity • Selected a set of diverse compounds from each group using distance based MaxMin diversity Metric • 37 molecules in Training set and 10 molecules in test set (1/37 = 2.7%)

  26. Selection of Basis Models 4 Components needed so that the last component explains at least 2.7 % of variance Component: % Variance Base Model Correlation PC1 70.3 M5 0.92 PC2 14.5 M12 0.80 PC3 7.3 M49 0.74 PC4 3.6 M52 0.45

  27. Prediction of Test Set Using Basis Models

  28. Experimental Design in DBH Set

  29. Conclusions • We can provide new capabilities for analyzing multiple models from GFA using standard tools in Cerius2. • Differentiate between original and redundant models • Identify clustered or dispersed nature of “best” models • Provide reasonable sampling of models • Critical analysis and causality remain paramount.

More Related