340 likes | 536 Views
Test Set Validation Revisited – Good Validation Practice in QSAR Knut Baumann Department of Pharmacy, University of Würzburg, Germany. = f ( ). k. Quantitative Structure-Activity Relationships. Build mathematical model: Activity = f (Structural Properties)
E N D
Test Set Validation Revisited – Good Validation Practice in QSAR Knut Baumann Department of Pharmacy, University of Würzburg, Germany
= f( ) k Quantitative Structure-Activity Relationships • Build mathematical model: Activity = f(Structural Properties) • Use it to predict activity of novel compounds
Model Validation Ultimate Goal of QSAR • Predictivity • Prerequisites: • Valid biological and structural data • Stable mathematical model • Exclusion of chance correlation and overfitting
Outline • Conditions for good external predictivity • Practice of external validation
Levels of Model Validity • Data fit • Internal predictivity internal validation • External predictivity external validation
10 9 8 7 6 5 Fitted 4 3 3 4 5 6 7 8 9 10 Observed Definition: Data Fit The same data are used to build and to assess the model Resubstitution Error GRID-PLS R2 = 0.94 R2: squared multiple correlation coefficient Data: HEPT; n = 53
1.0 0.9 GRID-PLS 0.8 R2 / R2CV-1 max. R2CV-1 0.7 0.6 0.5 0 2 4 6 8 10 Number of PLS-Factors Fit Cross-Validation Definition: Internal Predictivity A measure of predictivity (cross-validation, validation set prediction) that is used for model selection R2CV-1: leave-one-out cross-validated squared correlation coefficient (Q2) Data: HEPT; n = 53
Definition: External Predictivity A measure of predictivity (cross-validation, test set prediction) for a set of data that did not influence model selection The activity values of the test set are concealed and not known to the user during model selection
GRID-PLS 1.0 max. R2Test 0.9 0.8 R2 / R2CV-1 / R2Test max. R2CV-1 0.7 Fit 0.6 Cross-Validation Test Set Prediction 0.5 0 2 4 6 8 10 Number of PLS-Factors Example: External Predictivity Data: HEPT; n = 53, nTest = 27
1.0 max. R2 0.8 0.6 R2 / R2CV-1 / R2Test 0.4 Fit 0.2 Cross-Validation Test Set Prediction 0.0 0 5 10 15 20 25 30 35 Number of PLS-Factors Importance of Selection Criterion Good external predictivity Quality of measure of predictivity for model selection! Data: HEPT; n = 53, nTest = 27
Usefulness of Internal Predictivity Do internal measures of predictivity provide useful information? It depends …
CV: Test: Case 1: No Model Selection Multiple Linear Regression: R2CV-1 R2Test MSEP: Mean squared error of prediction
GRID-PLS 1.0 0.9 0.8 R2CV-1 / R2Test 0.7 0.6 Cross-Validation Test Set Prediction 0.5 0 2 4 6 8 10 Number of PLS-Factors Stable mathematical modelling technique & Few models are compared Internal External Case 2: Little Model Selection
1.0 0.8 0.6 R2CV-1 0.4 0.2 Internal 0.0 9000 18000 27000 36000 45000 0 No. Models eval. Case 3: Extensive Model Selection Here: Variable Subset Selection
1.0 max. R2CV-1 0.8 0.6 R2CV-1 /R2Test 0.4 0.2 Internal External 0.0 9000 18000 27000 36000 45000 0 No. Models eval. Case 3: Extensive Model Selection Here: Variable Subset Selection Extensive model selection (danger of) overfitting internal measures of predictivity are of limited usefulness Data: Steroids; n = 21, nTest = 9
Outline • Conditions for good external predictivity • Practice of external validation
Meaningful External Validation • The two Problems of external Validation: • Data splitting • Variability
Problem 1: Data Splitting Training set Activity values Structure descriptors Test set • Techniques for splitting • Experimental design using descriptors • Random partition biased1 variability Use multiple random splits into training and test sets 1) E. Roecker, Technometrics1991, 33, 459-468.
Problem 2: Variability nTest = 5 rel sdv(RMSEP) = 32% nTest = 10 rel sdv(RMSEP) = 22% nTest = 50 rel sdv(RMSEP) = 10% RMSEP: Root mean squared error of prediction
Problem 2: Variability Example Steroid data set nTest = 9 RMSEP = 0.53 R2Test = 0.73 RMSEP 2 sdv(RMSEP) = 0.53 0.25 R2Test = [ 0.40 0.92 ] RMSEP: Root mean squared error of prediction
Problem 2: Variability Until the test data set is huge (nTest 100) Use multiple random splits into training and test sets RMSEP: Root mean squared error of prediction
1.0 0.9 0.8 0.7 R2Test 0.6 0.5 0.4 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R2CV-1 Variability Illustrated I GRID - PLS n = 29 nTest = 15 Data: W84
1.0 0.9 0.8 0.7 R2Test 0.6 0.5 0.4 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 R2CV-1 Variability Illustrated I GRID - PLS 100 random splits into: n = 29 nTest = 15 mean Data: W84
Variable Selection GRID-PLS mean mean Variability Illustrated II Influence of extensive model selection 1.0 100 random splits into: n = 29 nTest = 15 0.5 R2Test 0.0 -0.5 -1.0 -1.0 -0.5 0.0 0.5 1.0 R2CV Extensive model selection causes instability Data: W84
Financial Support German Research Foundation: SFB 630 – TP C5 Conclusion • Internal predictivity must reliably characterize model performance • Avoid extensive model selection if possible • Do not use the activity values of the test set until the final model is selected • Model selection: variation of any operational parameter • Use multiple splits into test and training set unless test set is huge knut.baumann@chemometrix.de
Kubinyi-Pardoxon Explained Data: Log P
Definition: Data Fit GRID-PLS 8 R2 = 0.99 7 6 Fitted 5 4 4 5 6 7 8 Observed The same data are used to build and to assess the model Resubstitution Error Usefulness: strongly biased
8 R2 = 0.99 R2CV-1 = 0.62 7 6 Predicted 5 Fit 4 Cross-Validation 4 5 6 7 8 Observed Internal Predictivity GRID-PLS Does internal predictivity provide useful information? It depends!
Definition: Internal Predictivity GRID-PLS 1 0.8 0.6 R2 / R2CV-1 0.4 0.2 Fit Cross-Validation 0 0 2 4 6 8 10 Number of PLS-Factors A measure of predictivity (cross-validation, test set prediction) that was used for model selection Usefulness: it depends …
1 0.9 0.8 0.7 R2Test 0.6 0.5 data 26 data 27 0.4 data 28 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 R2CV-1 Variability Illustrated
Conclusion • Internal figures of merit in VS are largely inflated and can, in general, not be trusted • The resulting models are far more complex than anticipated • VS is prone to chance correlation, in particular with LOO-CV and similar statistics as objective function • rigorous validation mandatory „Trau, Schau, Wem!“ – “Try before you trust” • similar in spirit to: • „The importance of being earnest“, Tropsha et al. For a PDF-reprint of the slides email to: knut.baumann@chemometrix.de