Analysis of Multiple QSAR Models - A Basis for Experimental Design

Analysis of Multiple QSAR Models-A Basis for Experimental Design Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D.

QSAR Problems • Find and report unsuspected trends between biological activity and molecular descriptors • Linear or non-linear behavior • Variable interactions …and Solutions • Look at large ensembles of solutions • Consider variables in combinations • Efficient variable selection via GA • Provide multiple models • Critical analysis • Experimental design

Single QSAR Model Training Set MLR, PLS, PCR Stepwise Linear…etc QSAR Model Candidate Molecules Prioritized Candidates

Simplistic Approach • Large error in biological activities • Largely under determined situation • Consider 2 models: CV-R2 0.81 vs 0.79 Multiple Models • Consider many possible models • All equally probable • Use quantitative and qualitative analysis • Statistics • Intuition

Multiple QSAR Models Training Set GFA, G/PLS Model 1 Model 2 Model 3 ... QSAR Models - “Prediction safe” Consensus Prediction or - “Prediction sensitive” Experimental Design Candidate Molecules

GA Variable Selection Generation #1 Term 1 Term 2 Term 3 Term 4 Eq. 1 Eq. 2 Crossover point Generation #2 Eq. 1 Eq. 2

GA Variable Usage

GA Model Selection

Analysis of Models from GFA • Relationship between models (which are similar/different to each other?) • Are the best models (based on LOF) similar to each other or different ? • How to sample the ensemble of models ? • How many to select ? • Which ones ? • Can we find an average/consensus model ?

Dataset D. L. Selwood et al. “Structure-Activity Relationships of Antifilarial Antimycin Analogues: A Multivariate Pattern Recognition Study”, J. Med. Chem., 33, 136-142, 1990. Conditions for GFA • Run GFA with: • Population size: 200 models • Initial equation length 3 and d=2 • Number of iterations: 20,000

Analysis of GFA Models • Objectives: • Identify each models on a graphical representation • Similar models (eg giving similar predicted values) are close to each other. Significantly different models are distant. • Provide capabilities for analyzing models: Do the best models correspond to a single prediction scheme (cluster of equivalent models) or several schemes (dispersed models) ?

Analysis of GFA Models Usual Method: 1. Select the top 1/2 (100) from the 200 models 2. Generate predicted values columns for these 100 models 3. Generate residual value columns (Pred. - Actual) for these 100 models 4. Perform PCA on residual columns 5. Inspect descriptor plot Problem: Visualization only

Graphical Representation of Models PCA Descriptor Plot

Analysis of GFA Models Proposed Method: 1. Select the top 1/2 (100) from the 200 models 2. Generate predicted values columns for these 100 models 3. Generate residual value columns (Pred. - Actual) for these 100 models 4. Generate a correlation matrix 100x100 between the residual columns 5. Analyze matrix by MDS and generate MDS coordinates for N=3

Graphical Representation of Models PCA Descriptor Plot MDS Samples Plot

Analysis of GFA Models Proposed Method: MDS on Correlation Matrix

Graphical Analysis of Models • Models can be colored by LOF(a) or Adj. R2 (b) • Other possible coloring schemes: • F-test values • Nvars • R • LSE a b

Clustering/Selection of Models Clustering of models Selection of 10 diverse models

Selection of Basis Models D. Rogers “Evolutionary Statistics: Using a Genetic Algorithm and Model Reduction to Isolate Alternate Statistical Hypotheses of Experimental Data”, Proceedings 7th International Conference on Genetic Algorithms, East Lansing, MI (1997) Procedure: • Perform PCA on residual columns (last component should retain at least 1/N of original variance • For each retained component, select model which residual most correlates with that component

Selection of Basis Models Component: % Variance Base Model Correlation PC1 79.05 M36 0.972 PC2 7.12 M98 0.772 PC3 3.5 M72 0.488 Alternate Selection: Component: % Variance Base Model Correlation PC1 79.05 M23 0.970 PC2 7.12 M58 0.720 PC3 3.5 M72 0.488

Representation of Basis Models Initial Selection: M36, M98, M72 Alternate Selection: M23, M58, M72

Identification of Consensus Model • Objective: • Select a “central” model with minimal contradictions vs other possible models Consensus Model: M86

Use in Experimental Design • Objective: • First: Identify orthogonal models (basis models) corresponding to radically different prediction schemes • Then: Identify compounds which are “prediction safe”: Predicted active by selected models • or • Identify compounds which are “prediction sensitive”: Predicted differently across selected models

Application of Multiple QSAR Analysis Methodology on a set of Dopamine beta-Hydroxylase Inhibitors

Training Set and Test Set Selection • Chemically and Biologically Diverse • Divided into three classes based on Biological Activity • Selected a set of diverse compounds from each group using distance based MaxMin diversity Metric • 37 molecules in Training set and 10 molecules in test set (1/37 = 2.7%)

Selection of Basis Models 4 Components needed so that the last component explains at least 2.7 % of variance Component: % Variance Base Model Correlation PC1 70.3 M5 0.92 PC2 14.5 M12 0.80 PC3 7.3 M49 0.74 PC4 3.6 M52 0.45

Prediction of Test Set Using Basis Models

Experimental Design in DBH Set

Conclusions • We can provide new capabilities for analyzing multiple models from GFA using standard tools in Cerius2. • Differentiate between original and redundant models • Identify clustered or dispersed nature of “best” models • Provide reasonable sampling of models • Critical analysis and causality remain paramount.

Analysis of Multiple QSAR Models - A Basis for Experimental Design