1 / 80

Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics. Variable Selection. We want to choose a “best” model that is a subset of the available explanatory variables Two separate problems How many explanatory variables should we use (i.e., subset size)

penha
Download Presentation

Topic 18: Model Selection and Diagnostics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 18: Model Selection and Diagnostics

  2. Variable Selection • We want to choose a “best” model that is a subset of the available explanatory variables • Two separate problems • How many explanatory variables should we use (i.e., subset size) • Given the subset size, which variables should we choose

  3. KNNL Example • Page 350, Section 9.2 • Y : survival time of patient (liver op) • X’s (explanatory variables) are • Blood clotting score • Prognostic index • Enzyme function test • Liver function test

  4. KNNL Example cont. • n = 54 patients • Start with the usual plots and descriptive statistics • Time-to-event data is often heavily skewed and typically transformed with a log

  5. Data Tab delimited Data a1; infile'U:\.www\datasets512\CH09TA01.txt‘ delimiter='09'x; input blood prog enz liver age gender alcmod alcheavy surv lsurv; run; Ln(surv) Dummy variables for alcohol use

  6. Data

  7. Log Transform of Y • Recall that regression model does not require Y to be Normally distributed • In this case, transform reduces influence of long right tail and often stabilizes the variance of the residuals

  8. Scatterplots proccorr plot=matrix; var blood prog enz liver; run; proccorr plot=scatter; var blood prog enz liver; with lsurv; run;

  9. Correlation Summary

  10. The Two Problems in Variable Selection • To determine an appropriate subset size • Might use adjusted R2, Cp, MSE, PRESS, AIC, SBC (BIC) • To determine best model of this fixed size • Might use R2

  11. Adjusted R2 • R2 by its construction is guaranteed to increase with p • SSE cannot decrease with additional X and SSTO constant • Adjusted R2 uses df to account for p

  12. Adjusted R2 • Want to find model that maximizes • Since MSTO will remain constant for a given data set • Depends only on Y • Equivalent information to MSE • Thus could also find choice of model that minimizes MSE • Details on pages 354-356

  13. Cp Criterion • The basic idea is to compare subset models with the full model • A subset model is good if there is not substantial “bias” in the predicted values (relative to the full model) • Looks at the ratio of total mean squared error and the true error variance • See page 357-359 for details

  14. Cp Criterion SSE based on a specific choice of p-1 variables MSE based on the full set of variables Select the full set and Cp=(n-p)-(n-2p)=p

  15. Use of Cp • p is the number of regression coefficients including the intercept • A model is good according to this criterion if Cp ≤p • Rule: Pick the smallest model for which Cp is smaller than por pick the model that minimizes Cp, provided the minimum Cp is smaller than p

  16. SBC (BIC) and AIC Criterion based on log(likelihood) plus a penalty for more complexity • AIC – minimize • SBC – minimize

  17. Other approaches • PRESS (prediction SS) • For each case i • Delete the case and predict Y using a model based on the other n-1 cases • Look at the SS for observed minus predicted • Want to minimize the PRESS

  18. Variable Selection • Additional proc reg model statement options useful in variable selection • INCLUDE=n forces the first n explanatory variables into all models • BEST=n limits the output to the best n models of each subset size or total • START=n limits output to models that include at least n explanatory variables

  19. Variable Selection • Step type procedures • Forward selection (Step up) • Backward elimination (Step down) • Stepwise (forward selection with a backward glance) • Very popular but now have much better search techniques like BEST

  20. Ordering models of the same subset size • Use R2 or SSE • This approach can lead us to consider several models that give us approximately the same predicted values • May need to apply knowledge of the subject matter to make a final selection • Not that important if prediction is the key goal

  21. Proc Reg proc reg data=a1; model lsurv= blood prog enz liver/ selection=rsquare cp aic sbc b best=3; run;

  22. Selection Results

  23. Selection Results

  24. Proc Reg proc reg data=a1; model lsurv= blood prog enz liver/ selection=cp aic sbc b best=3; run;

  25. Selection Results WARNING: “selection=cp” just lists the models in order based on lowest C(p), regardless of whether it is good or not

  26. How to Choose with C(p) • Want small C(p) • Want C(p) near p In original paper, it was suggested to plot C(p) versus p and consider the smallest model that satisfies these criteria Can be somewhat subjective when determining “near”

  27. Creates data set with estimates & criteria Proc Reg proc reg data=a1 outest=b1; model lsurv=blood prog enz liver/ selection=rsquare cp aic sbc b; run;quit; symbol1 v=circle i=none; symbol2 v=none i=join; proc gplot data=b1; plot _Cp_*_P_ _P_*_P_ / overlay; run;

  28. Start to approach C(p)=p line here

  29. Model Validation • Since data used to generate parameter estimates, you’d expect model to predict fitted Y’s well • Want to check model predictive ability for a separate data set • Various techniques of cross validation (data split, leave one out) are possible

  30. Regression Diagnostics • Partial regression plots • Studentized deleted residuals • Hat matrix diagonals • Dffits, Cook’s D, DFBETAS • Variance inflation factor • Tolerance

  31. KNNL Example • Page 386, Section 10.1 • Y is amount of life insurance • X1 is average annual income • X2 is a risk aversion score • n = 18 managers

  32. Read in the data set data a1; infile ‘../data/ch10ta01.txt'; input income risk insur;

  33. Partial regression plots • Also called added variable plots or adjusted variable plots • One plot for each Xi

  34. Partial regression plots • These plots show the strength of the marginal relationship between Y and Xi in the full model . • They can also detect • Nonlinear relationships • Heterogeneous variances • Outliers

  35. Partial regression plots • Consider plot for X1 • Use the other X’s to predict Y • Use the other X’s to predict X1 • Plot the residuals from the first regression vs the residuals from the second regression

  36. The partial option with proc reg and plots= proc reg data=a1 plots=partialplot; model insur=income risk /partial; run;

  37. Output

  38. Output

  39. Output • The partial option on the model statement in proc reg generates graphs in the output window • These are ok for some purposes but we prefer better looking plots • To generate these plots we follow the regression steps outlined earlier and use gplot or plots=partialplot

  40. Partial regression plots *partial regression plot for risk; proc reg data=a1; model insur risk = income; output out=a2 r=resins resris; symbol1 v=circle i=sm70; proc gplot data=a2; plot resins*resris; run;

More Related