600 likes | 875 Views
In the name of GOD. 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009. QSAR/QSPR Model development and Validation for successful prediction and interpretation. Mohsen Kompany-Zareh. Contents:. Introduction Selwood data set (all descriptors Model development
E N D
In the name of GOD 8th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 QSAR/QSPR Model development and Validationfor successful prediction and interpretation Mohsen Kompany-Zareh
Contents: • Introduction • Selwood data set (all descriptors • Model development • Model validation • Statistical diagnostics (R2, q2, RMSEC, RMSEP, RMSECV • Internal validation • QUIK • Selwood data (a # descriptors • Descriptor selection • LMO and Jackknife • Cross model validation • Bootstrapping • Training and test set selection • Leverage
Introduction QSPR/QSAR (Quantitative structure activity relationship) Mathematical relation between structural attribute(s) and a property(an activity) of a set of chemicals. Application: Prediction of property for a variety of chemicals, • prior to expensivesynthesis and experimental measurement. • To determine environmentalrisk of thousands of untested industrial chemicals. Description of a mechanism of action for a variety of chemicals,
Descriptors Introduction X y Activities Surf. Area MW Lipoph. LUMO molec. 1 molec. 2 QSARmodel molec. 3 molec. 4 ? molec. 5 ? molec. 6
Introduction Data preparation: 1. Collection and cleaning of target property data; selection of accurate, precise and consistent experimental data. 2. Calculation of molecular descriptors for chemicals with acceptable target properties;(After optimiz. of conform.) more than 3000 descr.s DRAGON (Todeschini et al, 2001 ADAPT (Jurs 2002; Stuper and Hurs 1976 OASIS (Mekenyan and Bonchev 1986 CODESSA (Katritzky et al, 1994 Gaussian …
Introduction Unique numerical representation of molecular structure in term of fewmolecular descriptors that capture salient compositional, electronic andstericattributes; From a very large number of descriptors from different softwares As few explanatory descriptors as possible for simpleinterpretation of model (sometimes by variable select Descriptors: Topologic (edges and vertices Geometric (surface, volume, … Electronic (e dencity, local charges Constitutional (#C, #OH, … …. Activity Structure Model
Data set 31 molecules 53 descriptors Selwood data: D (31x53) , Y(31x1) Selwood, et al J Med Chem (1990) 33, 136. 31 antifilarialantimycin analogous characterized by 53 physicochemical descriptors >> load selwood.txt; >> D=selwood(:,1:end-1); >> y=selwood(:,end);
Model development Model generation: Indep variables: descriptors Depend variables: properties (activities) • Model developm methods: • Multiple linear regression MLR, • Partial least squares PLS, • Artificial neural netorks (ANNs), • k-nearest neighbor #samples<#descr.s !!
Model development Multiple Linear Regression Simplest model: D b = y b = D+ y >> b= D\y; >> yEST= D*b; Model is developed Application of model ? Validation? 22 of 53 coeff.s are zero!! b y D R2=1 b0
Model development Other statistical diagnostics: Coefficient of determination, R2 Fraction of dependent variable variance explained by a model (e.g. MLR model). Closer to unity is better. It is a measure of the quality of fit between model-predicted and experimental values, and does not reflect the predictive power, at all.
Model development Many QSPR/QSAR practitioners find data preparation and model generation steps sufficient to arrive at acceptable model !! They do not include model validation in model development. Schultz, et al Toxicity of TetrahymenaPyriformis QSAR 2002 meeting, May 25-29, Ottawa, Canada. Ex log(1/IGC50)=0.54 logKw – 8.90 LUMO – 0.99 n=11, r2=0.82, s=0.28, r2cv =0.64 n/#descr=11/2>5 r2cv < r2 fit : unstable model
Model development Ex Akers et al Struc.-tox. Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, 33-39. x Claim: The goodness of fit is satisfactory for predictive purposes. Ex Benigni et al QSAR of mutagenic and carcinogenic aromatic amines, Chem. Rev. (2000) 100, 3697-3714. “..use of a limited set of individual parameters with clear mechanistic significance is still the best approach that ensure the optimal comprehension of the results and gives the possibility of performing non-formal validations much superior to those provided by statistics” !!
Problem: Sometimes a highly fitted and accurate model for training set is not proper for validation sets !! ..so, the model is not reliable !!
Model validation Model validation: Real utility of a QSAR/QSPR model is its ability to accurately predict the modeled property/activity for new chemicals. Quantitative assessment of model robustness and its predictive power. Definition of the application domain of the model in the space of applied chemical descriptors
Model validation External validation Division to calibration and test sets • 4 7 10 13 … • 2 5 8 11 14 … • 3 6 9 12 15… calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:); There are many different methods for selection of members in training and test set. Model calD Developm. caly b=calD\caly; %model development valD validation valy
Model validation >> calyEST=calD*b; R2=1 >> valyEST=valD*b; % model validation Not good prediction
Model validation >> calyEST=calD*b; %root mean square error of calibr >> rmsec=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr) RMSEC=2.9396e-014 >> valyEST=valD*b; % root mean square error validation >> rmsep=sqrt(((valy-valyEST)'*(valy-valyEST))/valDr) Not good prediction RMSEP=2.2940
Model validation • A model with high R2could be a poor predictor: • Variable muticollinearity, • Statistically insignificant model descriptors, • High leverage points in the training set. A regression model with k descriptors and n training set compounds may be acceptable for validation only if : n > 4 k For any of k descriptors Pair-wise correlation coefficient <0.9, Tolerance >0.1.
Model validation • Validation strategies: • Randimization of model property (Y-scrambling). • Internal validation. • Only training • External validation. Division to training and test sets.
Model validation Predictive power of QSAR models: From sufficiently large external test set of compounds that were not used in the model development. Golbraikh, et al Beware of q2 !, J Mol Graph Model (2002) 20, 269-276. Zefirov, et al QSAR for boiling points of “small” sulfides. Are the “high-quality structure-property-activity regressions” the real high quality QSAR models? , J ChemInfComputSci(2001) 41, 1022-1027.
Model validation Train Test residual SS
Model validation Train Test Tot variance SS
Model validation Train R2 = 1.0000 Test q2 = -8.5220
Internal validation Internal validation: Cross validation (CV) (applied to training set ) Leave-one-out (LOO) (common Leave-many-out (LMO) (sometimes CV corr coeff Similar to R2 !
Internal validation Training set, only Internal validation Cross validation Leave-one-out Useful when small number of molecules are available.
Internal validation Subsamples (copies from Training set # subsamples = # molec.s
Internal validation SubTrain1 SubValid1 SubTrain2 SubValid2 SubTrain3 SubValid3 SubTrain5 SubValid5 # subsamples = # molec.s in training set cumPRESS
Internal validation LOO CV for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; end cumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y)))
Internal validation q2LOO = -4.8574 RMSECV = 2.0397 >> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2 q2ASYMPTOT = 1.0000 >> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end q2LOO and R2 should not be considerably different . REJECT
Internal validation Many authors consider q2LOO>0.5 as an indicator of the high predictive power of model and do not evaluate the model on an external test set or use only one- or two-compounds test set. Ex Cronin, et al The importace of hydrophobicty and … in mechanistically based QSARs for toxicological endpoints, SAR QSAR Environ. Res. (2002) 13, 167-176. Ex Moss, et al Q. S. Permeability Relationships for percutaneous absorption, Toxicol. In Vitro (2002) 16, 299-317. Ex Suzuki, et al Classification of environ. estrogens by physicochem. properties using PCA and hierachical cluster analysis, J ChemInfComputSci(2001) 41, 718-726.
Internal validation Small value of q2LOO or q2LMO test indicates lowprediction ability, But opposite is not necessarily true. (high q2LOO is necess and not enough) It indicates robustness, but not the prediction ability of model.
Internal validation It has been shown that there exist no correlation between LOO cross-validation q2LOO and the correlation coefficient R2between the predicted and observed activities for an external test set. Kubinyi, et al Three dimensional quant. similarity-activ. relationships (QSiAR) from SEAL similarity matrices, J Med Chem(1998) 41, 2553-2564. Golbraikh, et al Beware of q2 !, J Mol Graph Model (2002) 20, 269-276. High q2LOO is the necessary condition for a model to have a high predictive power, but not a sufficient condition.
QUIK QUIK R. Todeschini, et al Detecting bad Regression models: Multicriteria fitness functions in regression analysis Anal. ChimActa(2004) 515, 199-208. For illustration of correlation (collinearity) among independent variables. Based on Multivariate correlation indexK
QUIK 4 correlated descriptors M= y= >> corr(M) >> p=size(M,2); >> CorrEV=svds(corr(M),p); It seems possible to use svd(M)
QUIK >> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p); function >> [KM]=QUIK(M) KM = 1.0000 Maximum correlation between descriptors >> [KMY]=QUIK([M Y]) %in the pres of depend var KMY = 1.0000 if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end REJECT
QUIK >> M=rand(4,5) M= y= >> corr(M)
QUIK >> [KM]=QUIK(M) KM = 0.5000 >> [KMY]=QUIK([M Y]) KMY = 0.6000 if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end NOT REJECTED
QUIK >> [KM]=QUIK(calD) % Selwood data, all descriptors KM = 0.7919 >> [KMY]=QUIK([calD Y]) KMY = 0.7923 >>if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end REJECTED
Development of MLR model using all descriptors is not acceptable. Model can be improved, using a factor based method, …and by descriptor selection.
A number of descriptors Development of MLRmodel using a number of descriptors. >> D=Dini(:,[51 37 35 38 39 36 15]); RMSEP= 0.4993 RMSEC= 0.4989 Comparable Improved
A number of descriptors R2 = 0.6495 q2 = 0.5490 Comparable Improved q2LOO = 0.2816 NOT REJECTED
A number of descriptors D=Dini(:,[51 37 35 38 39 36 15]); QUIK KX = 0.6384 KXY = 0.5996 if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end REJECTED
A number of descriptors D=Dini(:,[51 1 38]); QUIK KX = 0.3159 KXY = 0.3953 if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end NOT REJECTED
Using proper set of descriptors, improved results from MLR can be obtained. But how the proper set of descriptors can be selected.
Descriptor Selection Descriptor selection: • Forward selection, • Backward elimination, • Genetic algorithm • Kohonen map • SPA • CWSPA
Descriptor Selection Rows (descriptors) as input for Kohonen map: selwood data matrix 53 × 31 Kohonen Map 1. Sampling from all regions in descriptors space 2. Sampling from regions which descriptors have high correlation with Y (activity) By: MehdiVasighi
Descriptor Selection Successive projections algorithm (SPA) SPA is a forward selection method that starts with one variable, and incorporates a new one at each iteration, until a specified number N of variables is reached. In SPA, to minimize the the collinearity between the selected descriptors, the criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable. Y. Akhlaghi and M. Kompany-Zareh Application of RBFNN and successive projections algorithm in a QSAR study of anti-HIV activity of HEPT derivatives, Journal of Chemometrics, (2006) 20, 1-12
Descriptor Selection Important parameters: 1- Starting vector 2- N, maximum number of descriptors Araujo, et al The successive projections algorithm for variable selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. (2001) 57, 65–73.
Descriptor Selection Correlation weighted SPA A limitation of SPA is that the only criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable, relation of entered vector as an independent variable to the response is not considered. Incorporation of a form of correlation ranking procedure by which the variables are weighted by their correlation coefficient with dependent variable, within SPA procedure will overcome this limitation of SPA. M. Kompany-Zareh and Y. Akhlaghi Correlation weighted successive projections algorithm: A QSAR study of anti-HIV activity of HEPT derivatives, J of Chemom, (2007) 21, 239-250.