800 likes | 969 Views
Topic 18: Model Selection and Diagnostics. Variable Selection. We want to choose a “best” model that is a subset of the available explanatory variables Two separate problems How many explanatory variables should we use (i.e., subset size)
E N D
Variable Selection • We want to choose a “best” model that is a subset of the available explanatory variables • Two separate problems • How many explanatory variables should we use (i.e., subset size) • Given the subset size, which variables should we choose
KNNL Example • Page 350, Section 9.2 • Y : survival time of patient (liver op) • X’s (explanatory variables) are • Blood clotting score • Prognostic index • Enzyme function test • Liver function test
KNNL Example cont. • n = 54 patients • Start with the usual plots and descriptive statistics • Time-to-event data is often heavily skewed and typically transformed with a log
Data Tab delimited Data a1; infile'U:\.www\datasets512\CH09TA01.txt‘ delimiter='09'x; input blood prog enz liver age gender alcmod alcheavy surv lsurv; run; Ln(surv) Dummy variables for alcohol use
Log Transform of Y • Recall that regression model does not require Y to be Normally distributed • In this case, transform reduces influence of long right tail and often stabilizes the variance of the residuals
Scatterplots proccorr plot=matrix; var blood prog enz liver; run; proccorr plot=scatter; var blood prog enz liver; with lsurv; run;
The Two Problems in Variable Selection • To determine an appropriate subset size • Might use adjusted R2, Cp, MSE, PRESS, AIC, SBC (BIC) • To determine best model of this fixed size • Might use R2
Adjusted R2 • R2 by its construction is guaranteed to increase with p • SSE cannot decrease with additional X and SSTO constant • Adjusted R2 uses df to account for p
Adjusted R2 • Want to find model that maximizes • Since MSTO will remain constant for a given data set • Depends only on Y • Equivalent information to MSE • Thus could also find choice of model that minimizes MSE • Details on pages 354-356
Cp Criterion • The basic idea is to compare subset models with the full model • A subset model is good if there is not substantial “bias” in the predicted values (relative to the full model) • Looks at the ratio of total mean squared error and the true error variance • See page 357-359 for details
Cp Criterion SSE based on a specific choice of p-1 variables MSE based on the full set of variables Select the full set and Cp=(n-p)-(n-2p)=p
Use of Cp • p is the number of regression coefficients including the intercept • A model is good according to this criterion if Cp ≤p • Rule: Pick the smallest model for which Cp is smaller than por pick the model that minimizes Cp, provided the minimum Cp is smaller than p
SBC (BIC) and AIC Criterion based on log(likelihood) plus a penalty for more complexity • AIC – minimize • SBC – minimize
Other approaches • PRESS (prediction SS) • For each case i • Delete the case and predict Y using a model based on the other n-1 cases • Look at the SS for observed minus predicted • Want to minimize the PRESS
Variable Selection • Additional proc reg model statement options useful in variable selection • INCLUDE=n forces the first n explanatory variables into all models • BEST=n limits the output to the best n models of each subset size or total • START=n limits output to models that include at least n explanatory variables
Variable Selection • Step type procedures • Forward selection (Step up) • Backward elimination (Step down) • Stepwise (forward selection with a backward glance) • Very popular but now have much better search techniques like BEST
Ordering models of the same subset size • Use R2 or SSE • This approach can lead us to consider several models that give us approximately the same predicted values • May need to apply knowledge of the subject matter to make a final selection • Not that important if prediction is the key goal
Proc Reg proc reg data=a1; model lsurv= blood prog enz liver/ selection=rsquare cp aic sbc b best=3; run;
Proc Reg proc reg data=a1; model lsurv= blood prog enz liver/ selection=cp aic sbc b best=3; run;
Selection Results WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not
How to Choose with C(p) • Want small C(p) • Want C(p) near p In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria Can be somewhat subjective when determining “near”
Creates data set with estimates & criteria Proc Reg proc reg data=a1 outest=b1; model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b; run;quit; symbol1 v=circle i=none; symbol2 v=none i=join; proc gplot data=b1; plot _Cp_*_P_ _P_*_P_ / overlay; run;
Model Validation • Since data used to generate parameter estimates, you’d expect model to predict fitted Y’s well • Want to check model predictive ability for a separate data set • Various techniques of cross validation (data split, leave one out) are possible
Regression Diagnostics • Partial regression plots • Studentized deleted residuals • Hat matrix diagonals • Dffits, Cook’s D, DFBETAS • Variance inflation factor • Tolerance
KNNL Example • Page 386, Section 10.1 • Y is amount of life insurance • X1 is average annual income • X2 is a risk aversion score • n = 18 managers
Read in the data set data a1; infile ‘../data/ch10ta01.txt'; input income risk insur;
Partial regression plots • Also called added variable plots or adjusted variable plots • One plot for each Xi
Partial regression plots • These plots show the strength of the marginal relationship between Y and Xi in the full model . • They can also detect • Nonlinear relationships • Heterogeneous variances • Outliers
Partial regression plots • Consider plot for X1 • Use the other X’s to predict Y • Use the other X’s to predict X1 • Plot the residuals from the first regression vs the residuals from the second regression
The partial option with proc reg and plots= proc reg data=a1 plots=partialplot; model insur=income risk /partial; run;
Output • The partial option on the model statement in proc reg generates graphs in the output window • These are ok for some purposes but we prefer better looking plots • To generate these plots we follow the regression steps outlined earlier and use gplot or plots=partialplot
Partial regression plots *partial regression plot for risk; proc reg data=a1; model insur risk = income; output out=a2 r=resins resris; symbol1 v=circle i=sm70; proc gplot data=a2; plot resins*resris; run;