310 likes | 481 Views
Analysis of Multiple QSAR Models - A Basis for Experimental Design. Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D. QSAR Problems. Find and report unsuspected trends between biological activity and molecular descriptors Linear or non-linear behavior Variable interactions. …and Solutions.
E N D
Analysis of Multiple QSAR Models-A Basis for Experimental Design Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D.
QSAR Problems • Find and report unsuspected trends between biological activity and molecular descriptors • Linear or non-linear behavior • Variable interactions …and Solutions • Look at large ensembles of solutions • Consider variables in combinations • Efficient variable selection via GA • Provide multiple models • Critical analysis • Experimental design
Single QSAR Model Training Set MLR, PLS, PCR Stepwise Linear…etc QSAR Model Candidate Molecules Prioritized Candidates
Simplistic Approach • Large error in biological activities • Largely under determined situation • Consider 2 models: CV-R2 0.81 vs 0.79 Multiple Models • Consider many possible models • All equally probable • Use quantitative and qualitative analysis • Statistics • Intuition
Multiple QSAR Models Training Set GFA, G/PLS Model 1 Model 2 Model 3 ... QSAR Models - “Prediction safe” Consensus Prediction or - “Prediction sensitive” Experimental Design Candidate Molecules
GA Variable Selection Generation #1 Term 1 Term 2 Term 3 Term 4 Eq. 1 Eq. 2 Crossover point Generation #2 Eq. 1 Eq. 2
Analysis of Models from GFA • Relationship between models (which are similar/different to each other?) • Are the best models (based on LOF) similar to each other or different ? • How to sample the ensemble of models ? • How many to select ? • Which ones ? • Can we find an average/consensus model ?
Dataset D. L. Selwood et al. “Structure-Activity Relationships of Antifilarial Antimycin Analogues: A Multivariate Pattern Recognition Study”, J. Med. Chem., 33, 136-142, 1990. Conditions for GFA • Run GFA with: • Population size: 200 models • Initial equation length 3 and d=2 • Number of iterations: 20,000
Analysis of GFA Models • Objectives: • Identify each models on a graphical representation • Similar models (eg giving similar predicted values) are close to each other. Significantly different models are distant. • Provide capabilities for analyzing models: Do the best models correspond to a single prediction scheme (cluster of equivalent models) or several schemes (dispersed models) ?
Analysis of GFA Models Usual Method: 1. Select the top 1/2 (100) from the 200 models 2. Generate predicted values columns for these 100 models 3. Generate residual value columns (Pred. - Actual) for these 100 models 4. Perform PCA on residual columns 5. Inspect descriptor plot Problem: Visualization only
Graphical Representation of Models PCA Descriptor Plot
Analysis of GFA Models Proposed Method: 1. Select the top 1/2 (100) from the 200 models 2. Generate predicted values columns for these 100 models 3. Generate residual value columns (Pred. - Actual) for these 100 models 4. Generate a correlation matrix 100x100 between the residual columns 5. Analyze matrix by MDS and generate MDS coordinates for N=3
Graphical Representation of Models PCA Descriptor Plot MDS Samples Plot
Analysis of GFA Models Proposed Method: MDS on Correlation Matrix
Graphical Analysis of Models • Models can be colored by LOF(a) or Adj. R2 (b) • Other possible coloring schemes: • F-test values • Nvars • R • LSE a b
Clustering/Selection of Models Clustering of models Selection of 10 diverse models
Selection of Basis Models D. Rogers “Evolutionary Statistics: Using a Genetic Algorithm and Model Reduction to Isolate Alternate Statistical Hypotheses of Experimental Data”, Proceedings 7th International Conference on Genetic Algorithms, East Lansing, MI (1997) Procedure: • Perform PCA on residual columns (last component should retain at least 1/N of original variance • For each retained component, select model which residual most correlates with that component
Selection of Basis Models Component: % Variance Base Model Correlation PC1 79.05 M36 0.972 PC2 7.12 M98 0.772 PC3 3.5 M72 0.488 Alternate Selection: Component: % Variance Base Model Correlation PC1 79.05 M23 0.970 PC2 7.12 M58 0.720 PC3 3.5 M72 0.488
Representation of Basis Models Initial Selection: M36, M98, M72 Alternate Selection: M23, M58, M72
Identification of Consensus Model • Objective: • Select a “central” model with minimal contradictions vs other possible models Consensus Model: M86
Use in Experimental Design • Objective: • First: Identify orthogonal models (basis models) corresponding to radically different prediction schemes • Then: Identify compounds which are “prediction safe”: Predicted active by selected models • or • Identify compounds which are “prediction sensitive”: Predicted differently across selected models
Application of Multiple QSAR Analysis Methodology on a set of Dopamine beta-Hydroxylase Inhibitors
Training Set and Test Set Selection • Chemically and Biologically Diverse • Divided into three classes based on Biological Activity • Selected a set of diverse compounds from each group using distance based MaxMin diversity Metric • 37 molecules in Training set and 10 molecules in test set (1/37 = 2.7%)
Selection of Basis Models 4 Components needed so that the last component explains at least 2.7 % of variance Component: % Variance Base Model Correlation PC1 70.3 M5 0.92 PC2 14.5 M12 0.80 PC3 7.3 M49 0.74 PC4 3.6 M52 0.45
Prediction of Test Set Using Basis Models
Conclusions • We can provide new capabilities for analyzing multiple models from GFA using standard tools in Cerius2. • Differentiate between original and redundant models • Identify clustered or dispersed nature of “best” models • Provide reasonable sampling of models • Critical analysis and causality remain paramount.