190 likes | 336 Views
8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009. QSAR/QSPR Model development and Validation Essential for successful application and interpretation. Mohsen Kompany-Zareh. Content:. 31 molecules 53 descriptors. Selwood data: D (31x53) , Y(31x1). >> load selwood.txt;
E N D
8th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 QSAR/QSPR Model development and ValidationEssential for successful application and interpretation Mohsen Kompany-Zareh
31 molecules 53 descriptors Selwood data: D (31x53) , Y(31x1) >> load selwood.txt; >> D=selwood(:,1:end-1); >> y=selwood(:,end); D Model y
Simplest model: Multiple Linear Regression D b = y b = D+ y >> b0= D\y; >> yEST= D*b0; Model is developed Validation? 22 of 53 coeff.s are zero!! b0
Problem: Sometimes a highly fitted and accurate model for training set is not proper for validation sets !! Is not reliable !!
External validation There are many different methods for selection of members in training and test set. Division to calibration and test sets calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:); Model calD Developm. caly valD validation valy b1=calD\caly; %model development
>> calyEST=calD*b1; >> valyEST=valD*b1; %external model validation Not good prediction
>> calyEST=calD*b1; %root mean square error of calibr >> rmsec1=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr) RMSEC=2.9396e-014 >> testyEST=testD*b1; %external model validation >> rmsep1=sqrt(((testy-testyEST)'*(testy-testyEST))/testDr) Not good prediction RMSEP=2.2940
Train Test residual SS
Train Test Tot variance SS
Train R2 = 1.0000 Test q2 = -8.5220
Training set Internal validation Cross validation Leave-one-out
validation developm # subsamples = # molec.s in training set cumPRESS
LOO CV for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; end cumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y)))
q2LOO = -4.8574 RMSECV = 2.0397 >> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2 q2ASYMPTOT = 1.0000 >> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end REJECT
QUIK 4 correlated descriptors M= y= >> corr(M) >> p=size(M,2); >> CorrEV=svds(corr(M),p); It seems possible to use svd(M)
>> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p); All in afunction >> [KM]=QUIK(M) KM = 1.0000 Maximum correlation between descriptors >> [KMY]=QUIK([M Y]) KMY = 1.0000 if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end REJECT