170 likes | 403 Views
Choosing the “best” model. (Session 08). Learning Objectives. At the end of this session, you will be able to use a simple descriptive approach to select of the most appropriate subset of explanatory variables
E N D
Choosing the “best” model (Session 08)
Learning Objectives At the end of this session, you will be able to • use a simple descriptive approach to select of the most appropriate subset of explanatory variables • apply methods of variable selection (based on statistical tests) in a meaningful way to get the “best” model • appreciate the effect on t-probabilities when x’s are added or dropped from a model • understand dangers of using automatic selection procedures
Example of choosing “best” set of x’s Consider data (fictitious) from a retrospective study of patients surviving less than 4 months after being diagnosed as having acute leukaemia. Objective: To identify factors affecting survival time. Variables were: y = survival time (days) after diagnosis x1 = no: of chemotherapy sessions x2 = total volume of blood transfused x3 = no: of days of hospital care x4 = age of patient (years).
Summary statistics for all regressions How many possible regression models exist? Example with x1 and x3 to show summaries: ---------+--------------------------------------- Source | SS df MS F Prob>F ---------+--------------------------------------- Model | 1488.691 2 744.346 6.07 0.0188 Residual | 1227.072 10 122.707 ---------+--------------------------------------- Total | 2715.763 12 226.314 ---------+--------------------------------------- No. of parameters fitted (p) = 3 R2p = 1488.69 / 2715.07 = 0.5482 Adjusted R2p =1 – 122.71 / 226.31 = 0.4578
A descriptive approach… continued Plot R2 versus no. of parameters (p) in model Which model would you select on the basis of these results?
A descriptive approach… continued Alternatively, plot residual mean square. Small residual mean square is good! Which model would you select on the basis of the residual mean square?
An inferential approach… • Use a sequential procedure to select variables that contribute most, and significantly, to the regression model. • Three popular methods exist: • Forward selection • Backward elimination • Stepwise regression
Forward selection … Select the “best” single variable - see slide 6 Ask, “Is it contributing significantly?” Answer: Yes (see below) ----------------------------------------- y | Coef. Std. Err. t P>|t| -------+--------------------------------- x4 | -.73816 .1546 -4.77 0.001 const. | 117.57 5.2622 22.34 0.000 ----------------------------------------- Now consider 2-variable models with x4.
Two-variable models with x4 ----------------------------------------- y | Coef. Std.Err. t P>|t| -------------+--------------------------- x4 | -.61395 .04864 -12.62 0.000 x1 | 1.4400 .13842 10.40 0.000 const.| 103.10 2.1240 48.54 0.000 ----------------------------------------- x4 | -.45694 .69595 -0.66 0.526 x2 | .31090 .74861 0.42 0.687 const.| 94.160 56.627 1.66 0.127 ----------------------------------------- x4 | -.72460 .07233 -10.02 0.000 x3 | -1.1999 .18902 -6.35 0.000 const.| 131.28 3.2748 40.09 0.000 -----------------------------------------
Three-variable models with x4, x1 ----------------------------------------- y | Coef. Std.Err. t P>|t| -------------+--------------------------- x4 | -.23654 .17329 -1.37 0.205 x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052 const. | 71.648 14.142 5.07 0.001 ----------------------------------------- x4 | -.64280 .04454 -14.43 0.000 x1 | 1.0519 .22368 4.70 0.001 x3 | -.41004 .19923 -2.06 0.070 const. | 111.68 4.5625 24.48 0.000 ----------------------------------------- Model with x1, x2 and x4 would be selected! - despite x4 now being non-significant!
Backward elimination gives x1,x2 --------------------------------------- y | Coef. Std.Err. t P>|t| -----+--------------------------------- x1 | 1.5511 .74477 2.08 0.071 x2 | .51017 .7238 0.70 0.501 x3 | .10191 .7547 0.14 0.896 x4 | -.14406 .7091 -0.20 0.844 --------------------------------------- x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052 x4 | -.23654 .17329 -1.37 0.205 --------------------------------------- x1 | 1.4683 .12130 12.10 0.000 x2 | .66225 .04585 14.44 0.000 ---------------------------------------
Stepwise selection procedure… This is similar to forward selection, but at each stage of the process, all x’s in the model are re-assessed to check if those that entered the model at an earlier stage still remain “important”. Note: Software packages allow automatic use of one of these with pre-specified p-values for selection and deletion of variables. Usually available only with quantitative x’s.
Discussion… in small groups • Look back at results. What do you observe with the forward and backward procedures. Do they give the same results? • Did the selection using forward seem sensible, given that for x4, the p-value =0.205? • Can you work out what model would results with a stepwise selection procedures? • Is it a good idea to use such automatic selection procedures available in software packages? If not, why not?
Discussion continued… Suppose a medical researcher told you that a model without x2 was not meaningful, how would you proceed with your model selection? What other latent (lurking) variables, measurable or non-measurable, might affect y? What further steps would you undertaken before accepting the final model?
Practical work follows to ensure learning objectives are achieved…