Model selection and model building

Model selectionand model building

Model selection Selection of predictor variables

Statement of problem • A common problem is that there is a large set of candidate predictor variables. • Goal is to choose a small subset from the larger set so that the resulting regression model is simple, yet have good predictive ability.

Example: Cement data • Response y: heat evolved in calories during hardening of cement on a per gram basis • Predictor x1: % of tricalcium aluminate • Predictor x2: % of tricalcium silicate • Predictor x3: % of tetracalcium alumino ferrite • Predictor x4: % of dicalcium silicate

Example: Cement data

Two basic methods of selecting predictors • Stepwise regression: Enter and remove variables, in a stepwise manner, until no justifiable reason to enter or remove more. • Best subsets regression: Select the subset of variables that do the best at meeting some well-defined objective criterion.

Stepwise regression: the idea • Start with no predictors in the model. • At each step, enter or remove a variable based on partial F-tests. • Stop when no more variables can be justifiably entered or removed.

Stepwise regression: the steps • Specify an Alpha-to-Enter (0.15) and an Alpha-to-Remove (0.15). • Start with no predictors in the model. • Put the predictor with the smallest P-value based on the partial F statistic (a t-statistic) in the model. If P-value > 0.15, then stop. None of the predictors have good predictive ability. Otherwise …

Stepwise regression: the steps • Add the predictor with the smallest P-value (below 0.15) based on the partial F-statistic (a t-statistic) in the model. If none of the predictors yield P-values < 0.15, stop. • If P-value > 0.15 for any of the partial F statistics, then remove violating predictor. • Continue above two steps, until no more predictors can be entered or removed.

Example: Cement data Predictor Coef SE Coef T P Constant 81.479 4.927 16.54 0.000 x1 1.8687 0.5264 3.55 0.005 Predictor Coef SE Coef T P Constant 57.424 8.491 6.76 0.000 x2 0.7891 0.1684 4.69 0.001 Predictor Coef SE Coef T P Constant 110.203 7.948 13.87 0.000 x3 -1.2558 0.5984 -2.10 0.060 Predictor Coef SE Coef T P Constant 117.568 5.262 22.34 0.000 x4 -0.7382 0.1546 -4.77 0.001

Predictor Coef SE Coef T P Constant 103.097 2.124 48.54 0.000 x4 -0.61395 0.04864 -12.62 0.000 x1 1.4400 0.1384 10.40 0.000 Predictor Coef SE Coef T P Constant 94.16 56.63 1.66 0.127 x4 -0.4569 0.6960 -0.66 0.526 x2 0.3109 0.7486 0.42 0.687 Predictor Coef SE Coef T P Constant 131.282 3.275 40.09 0.000 x4 -0.72460 0.07233 -10.02 0.000 x3 -1.1999 0.1890 -6.35 0.000

Predictor Coef SE Coef T P Constant 71.65 14.14 5.07 0.001 x4 -0.2365 0.1733 -1.37 0.205 x1 1.4519 0.1170 12.41 0.000 x2 0.4161 0.1856 2.24 0.052 Predictor Coef SE Coef T P Constant 111.684 4.562 24.48 0.000 x4 -0.64280 0.04454 -14.43 0.000 x1 1.0519 0.2237 4.70 0.001 x3 -0.4100 0.1992 -2.06 0.070

Predictor Coef SE Coef T P Constant 52.577 2.286 23.00 0.000 x1 1.4683 0.1213 12.10 0.000 x2 0.66225 0.04585 14.44 0.000

Predictor Coef SE Coef T P Constant 71.65 14.14 5.07 0.001 x1 1.4519 0.1170 12.41 0.000 x2 0.4161 0.1856 2.24 0.052 x4 -0.2365 0.1733 -1.37 0.205 Predictor Coef SE Coef T P Constant 48.194 3.913 12.32 0.000 x1 1.6959 0.2046 8.29 0.000 x2 0.65691 0.04423 14.85 0.000 x3 0.2500 0.1847 1.35 0.209

Predictor Coef SE Coef T P Constant 52.577 2.286 23.00 0.000 x1 1.4683 0.1213 12.10 0.000 x2 0.66225 0.04585 14.44 0.000

Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13 Step 1 2 3 4 Constant 117.57 103.10 71.65 52.58 x4 -0.738 -0.614 -0.237 T-Value -4.77 -12.62 -1.37 P-Value 0.001 0.000 0.205 x1 1.44 1.45 1.47 T-Value 10.40 12.41 12.10 P-Value 0.000 0.000 0.000 x2 0.416 0.662 T-Value 2.24 14.44 P-Value 0.052 0.000 S 8.96 2.73 2.31 2.41 R-Sq 67.45 97.25 98.23 97.87 R-Sq(adj) 64.50 96.70 97.64 97.44 C-p 138.7 5.5 3.0 2.7

Drawbacks of stepwise regression • The final model is not guaranteed to be optimal in any specified sense. • The procedure yields a single final model, although in practice there are often several almost equally good models.

Best subsets regression • If there are p-1 possible predictors, then there are 2p-1possible regression models containing the predictors. • For example, 10 predictors yields 210 = 1024 possible regression models. • A best subsets algorithm determines the best subsets of each size, so that choice of the final model can be made by researcher.

What is used to judge “best”? • R-square • Adjusted R-square • MSE (or S = square root of MSE) • Mallow’s Cp

R-squared Use the R-squared values to find the point where adding more predictors is not worthwhile because it leads to a very small increase in R-squared.

Adjusted R-squared or MSE Adjusted R-squared increases only if MSE decreases, so adjusted R-squared and MSE provide equivalent information. Find a few subsets for which MSE is smallest (or adjusted R-squared is largest) or so close to the smallest (largest) that adding more predictors is not worthwhile.

Mallow’s Cp statistic: is an estimator of total standardized mean square error of prediction: which equals: Mallow’s Cp criterion

Using the Cp criterion • Subsets with small Cpvalues have a small total (standardized) mean square error of prediction. • When the Cp value is also near p, the bias of the regression model is small.

Using the Cp criterion (cont’d) • So, identify subsets of predictors for which: • the Cp value is smallest, and • the Cp value is near p (if possible) • Note though that for the full model, Cp= p. So, the fullest model is always judged to be a good candidate model.

Best Subsets Regression: y versus x1, x2, x3, x4 Response is y x x x x Vars R-Sq R-Sq(adj) C-p S 1 2 3 4 1 67.5 64.5 138.7 8.9639 X 1 66.6 63.6 142.5 9.0771 X 2 97.9 97.4 2.7 2.4063 X X 2 97.2 96.7 5.5 2.7343 X X 3 98.2 97.6 3.0 2.3087 X X X 3 98.2 97.6 3.0 2.3121 X X X 4 98.2 97.4 5.0 2.4460 X X X X

Example: Modeling PIQ

Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38 Step 1 2 Constant 4.652 111.276 MRI 1.18 2.06 T-Value 2.45 3.77 P-Value 0.019 0.001 Height -2.73 T-Value -2.75 P-Value 0.009 S 21.2 19.5 R-Sq 14.27 29.49 R-Sq(adj) 11.89 25.46 C-p 7.3 2.0

Best Subsets Regression: PIQ versus MRI, Height, Weight Response is PIQ H W e e i i M g g R h h Vars R-Sq R-Sq(adj) C-p S I t t 1 14.3 11.9 7.3 21.212 X 1 0.9 0.0 13.8 22.810 X 2 29.5 25.5 2.0 19.510 X X 2 19.3 14.6 6.9 20.878 X X 3 29.5 23.3 4.0 19.794 X X X

The regression equation is PIQ = 111 + 2.06 MRI - 2.73 Height Predictor Coef SE Coef T P Constant 111.28 55.87 1.99 0.054 MRI 2.0606 0.5466 3.77 0.001 Height -2.7299 0.9932 -2.75 0.009 S = 19.51 R-Sq = 29.5% R-Sq(adj) = 25.5% Analysis of Variance Source DF SS MS F P Regression 2 5572.7 2786.4 7.32 0.002 Error 35 13321.8 380.6 Total 37 18894.6 Source DF Seq SS MRI 1 2697.1 Height 1 2875.6

Example: Modeling BP

Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20 Step 1 2 3 Constant 2.205 -16.579 -13.667 Weight 1.201 1.033 0.906 T-Value 12.92 33.15 18.49 P-Value 0.000 0.000 0.000 Age 0.708 0.702 T-Value 13.23 15.96 P-Value 0.000 0.000 BSA 4.6 T-Value 3.04 P-Value 0.008 S 1.74 0.533 0.437 R-Sq 90.26 99.14 99.45 R-Sq(adj) 89.72 99.04 99.35 C-p 312.8 15.1 6.4

Best Subsets Regression: BP versus Age, Weight, ... Response is BP D u W r S e a P t i t u r A g B i l e g h S o s s Vars R-Sq R-Sq(adj) C-p S e t A n e s 1 90.3 89.7 312.8 1.7405 X 1 75.0 73.6 829.1 2.7903 X 2 99.1 99.0 15.1 0.53269 X X 2 92.0 91.0 256.6 1.6246 X X 3 99.5 99.4 6.4 0.43705 X X X 3 99.2 99.1 14.1 0.52012 X X X 4 99.5 99.4 6.4 0.42591 X X X X 4 99.5 99.4 7.1 0.43500 X X X X 5 99.6 99.4 7.0 0.42142 X X X X X 5 99.5 99.4 7.7 0.43078 X X X X X 6 99.6 99.4 7.0 0.40723 X X X X X X

The regression equation is BP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA Predictor Coef SE Coef T P Constant -13.667 2.647 -5.16 0.000 Age 0.70162 0.04396 15.96 0.000 Weight 0.90582 0.04899 18.49 0.000 BSA 4.627 1.521 3.04 0.008 S = 0.4370 R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression 3 556.94 185.65 971.93 0.000 Error 16 3.06 0.19 Total 19 560.00 Source DF Seq SS Age 1 243.27 Weight 1 311.91 BSA 1 1.77

Stepwise regression in Minitab • Stat >> Regression >> Stepwise … • Specify response and all possible predictors. • If desired, specify predictors that must be included in every model. • Select OK. Results appear in session window.

Best subsets regression • Stat >> Regression >> Best subsets … • Specify response and all possible predictors. • If desired, specify predictors that must be included in every model. • Select OK. Results appear in session window.

Model building strategy

The first step • Decide on the type of model needed • Predictive: model used to predict the response variable from a chosen set of predictors. • Theoretical: model based on theoretical relationship between response and predictors. • Control: model used to control a response variable by manipulating predictor variables.

The first step (cont’d) • Decide on the type of model needed • Inferential: model used to explore strength of relationships between response and predictors. • Data summary: model used merely as a way to summarize a large set of data by a single equation.

The second step • Decide which predictor variables and response variable on which to collect the data. • Collect the data.

The third step • Explore the data • Check for outliers, gross data errors, missing values on a univariate basis. • Study bivariate relationships to reveal other outliers, to suggest possible transformations, to identify possible multicollinearities.

The fourth step • Randomly divide the data into a training set and a test set: • The training set, with at least 15-20 error d.f., is used to fit the model. • The test set is used for cross-validation of the fitted model.

The fifth step • Using the training set, fit several candidate models: • Use best subsets regression. • Use stepwise regression (only gives one model unless specifies different alpha-to-remove and alpha-to-enter values).

The sixth step • Select and evaluate a few “good” models: • Select based on adjusted R2, Mallow’s Cp, number and nature of predictors. • Evaluate selected models for violation of model assumptions. • If none of the models provide a satisfactory fit, try something else, such as more data, different predictors, a different class of model …

The final step • Select the final model: • Compare competing models by cross-validating them against the test data. • The model with a larger cross-validation R2 is a better predictive model. • Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.

Model selection and model building