Test Set Validation Revisited – Good Validation Practice in QSAR Knut Baumann

Test Set Validation Revisited – Good Validation Practice in QSAR Knut Baumann Department of Pharmacy, University of Würzburg, Germany

= f( ) k Quantitative Structure-Activity Relationships • Build mathematical model: Activity = f(Structural Properties) • Use it to predict activity of novel compounds

Model Validation Ultimate Goal of QSAR •  Predictivity • Prerequisites: • Valid biological and structural data • Stable mathematical model • Exclusion of chance correlation and overfitting

Outline • Conditions for good external predictivity • Practice of external validation

Levels of Model Validity • Data fit • Internal predictivity  internal validation • External predictivity  external validation

10 9 8 7 6 5 Fitted 4 3 3 4 5 6 7 8 9 10 Observed Definition: Data Fit The same data are used to build and to assess the model  Resubstitution Error GRID-PLS R2 = 0.94 R2: squared multiple correlation coefficient Data: HEPT; n = 53

1.0 0.9 GRID-PLS 0.8 R2 / R2CV-1 max. R2CV-1 0.7 0.6 0.5 0 2 4 6 8 10 Number of PLS-Factors Fit Cross-Validation Definition: Internal Predictivity A measure of predictivity (cross-validation, validation set prediction) that is used for model selection R2CV-1: leave-one-out cross-validated squared correlation coefficient (Q2) Data: HEPT; n = 53

Definition: External Predictivity A measure of predictivity (cross-validation, test set prediction) for a set of data that did not influence model selection The activity values of the test set are concealed and not known to the user during model selection

GRID-PLS 1.0 max. R2Test 0.9 0.8 R2 / R2CV-1 / R2Test max. R2CV-1 0.7 Fit 0.6 Cross-Validation Test Set Prediction 0.5 0 2 4 6 8 10 Number of PLS-Factors Example: External Predictivity Data: HEPT; n = 53, nTest = 27

1.0 max. R2 0.8 0.6 R2 / R2CV-1 / R2Test 0.4 Fit 0.2 Cross-Validation Test Set Prediction 0.0 0 5 10 15 20 25 30 35 Number of PLS-Factors Importance of Selection Criterion Good external predictivity  Quality of measure of predictivity for model selection! Data: HEPT; n = 53, nTest = 27

Usefulness of Internal Predictivity Do internal measures of predictivity provide useful information? It depends …

CV: Test: Case 1: No Model Selection Multiple Linear Regression: R2CV-1 R2Test MSEP: Mean squared error of prediction

GRID-PLS 1.0 0.9 0.8 R2CV-1 / R2Test 0.7 0.6 Cross-Validation Test Set Prediction 0.5 0 2 4 6 8 10 Number of PLS-Factors Stable mathematical modelling technique & Few models are compared Internal  External Case 2: Little Model Selection

1.0 0.8 0.6 R2CV-1 0.4 0.2 Internal 0.0 9000 18000 27000 36000 45000 0 No. Models eval. Case 3: Extensive Model Selection Here: Variable Subset Selection

1.0 max. R2CV-1 0.8 0.6 R2CV-1 /R2Test 0.4 0.2 Internal External 0.0 9000 18000 27000 36000 45000 0 No. Models eval. Case 3: Extensive Model Selection Here: Variable Subset Selection Extensive model selection  (danger of) overfitting  internal measures of predictivity are of limited usefulness Data: Steroids; n = 21, nTest = 9

Outline • Conditions for good external predictivity • Practice of external validation

Meaningful External Validation • The two Problems of external Validation: • Data splitting • Variability

Problem 1: Data Splitting Training set Activity values Structure descriptors Test set • Techniques for splitting • Experimental design using descriptors • Random partition  biased1  variability  Use multiple random splits into training and test sets 1) E. Roecker, Technometrics1991, 33, 459-468.

Problem 2: Variability nTest = 5 rel sdv(RMSEP) = 32% nTest = 10 rel sdv(RMSEP) = 22% nTest = 50 rel sdv(RMSEP) = 10% RMSEP: Root mean squared error of prediction

Problem 2: Variability Example Steroid data set nTest = 9 RMSEP = 0.53  R2Test = 0.73 RMSEP  2  sdv(RMSEP) = 0.53  0.25  R2Test = [ 0.40 0.92 ] RMSEP: Root mean squared error of prediction

Problem 2: Variability Until the test data set is huge (nTest  100)  Use multiple random splits into training and test sets RMSEP: Root mean squared error of prediction

1.0 0.9 0.8 0.7 R2Test 0.6 0.5 0.4 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R2CV-1 Variability Illustrated I GRID - PLS n = 29 nTest = 15 Data: W84

1.0 0.9 0.8 0.7 R2Test 0.6 0.5 0.4 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R2CV-1 Variability Illustrated I GRID - PLS 100 random splits into: n = 29 nTest = 15 mean Data: W84

Variable Selection GRID-PLS mean mean Variability Illustrated II Influence of extensive model selection 1.0 100 random splits into: n = 29 nTest = 15 0.5 R2Test 0.0 -0.5 -1.0 -1.0 -0.5 0.0 0.5 1.0 R2CV  Extensive model selection causes instability Data: W84

Financial Support German Research Foundation: SFB 630 – TP C5 Conclusion • Internal predictivity must reliably characterize model performance • Avoid extensive model selection if possible • Do not use the activity values of the test set until the final model is selected • Model selection: variation of any operational parameter • Use multiple splits into test and training set unless test set is huge knut.baumann@chemometrix.de

Kubinyi-Pardoxon Explained Data: Log P

Definition: Data Fit GRID-PLS 8 R2 = 0.99 7 6 Fitted 5 4 4 5 6 7 8 Observed The same data are used to build and to assess the model  Resubstitution Error Usefulness: strongly biased

8 R2 = 0.99 R2CV-1 = 0.62 7 6 Predicted 5 Fit 4 Cross-Validation 4 5 6 7 8 Observed Internal Predictivity GRID-PLS Does internal predictivity provide useful information?  It depends!

Definition: Internal Predictivity GRID-PLS 1 0.8 0.6 R2 / R2CV-1 0.4 0.2 Fit Cross-Validation 0 0 2 4 6 8 10 Number of PLS-Factors A measure of predictivity (cross-validation, test set prediction) that was used for model selection Usefulness: it depends …

1 0.9 0.8 0.7 R2Test 0.6 0.5 data 26 data 27 0.4 data 28 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 R2CV-1 Variability Illustrated

Conclusion • Internal figures of merit in VS are largely inflated and can, in general, not be trusted • The resulting models are far more complex than anticipated • VS is prone to chance correlation, in particular with LOO-CV and similar statistics as objective function • rigorous validation mandatory „Trau, Schau, Wem!“ – “Try before you trust” • similar in spirit to: • „The importance of being earnest“, Tropsha et al. For a PDF-reprint of the slides email to: knut.baumann@chemometrix.de

Test Set Validation Revisited – Good Validation Practice in QSAR Knut Baumann