Stats 330: Lecture 19

Stats 330: Lecture 19 Models with many continuous and categorical explanatory variables

Plan of the day In today’s lecture , we look at some general strategies for choosing models having lots of continuous and categorical explanatory variables, and discuss an example.

General Principle • For a problem with both continuous and categorical explanatory variables, the most general model is to fit separate regressions for each possible combination of the factor levels. • That is, we allow the categorical variables to interact with each other and the continuous variables.

Illustration • Two factors A and B, two continuous explanatory variables X and Z • General model is y ~ A*B*X + A*B*Z • Suppose A has a levels and B has b levels, so there are a ´b factor level combinations • Each combination has a separate regression with 3 parameters • Constant term • Coefficient of X • Coefficient of Z

Illustration (Cont) • There are a ´ b constant terms, we can arrange them in a table • Can split the table up into main effects and interactions as in 2 way anova • Listed in output as Intercept, A, B and A:B

Illustration (Cont) • There are a ´ b X-coefficients, we can also arrange them in a table • Again, we can split the table up into main effects and interactions as in 2 way anova • Listed in output as X, A:X, B:X and A:B:X • Ditto for Z • If all the A:X, B:X, A:B:X interactions are zero, coefficient of X is the same for all the a ´ b regressions

Model selection • In these situations, the number of possible models is large • Need variable selection techniques • Anova • stepwise • Don’t include high order interactions unless you include lower order interactions

Caution • Sometimes we don’t have enough data to fit a separate regression to each factor level combination (need at least one more data point than number of continuous variables per combination) • In this case we drop out the higher level interactions, forcing coefficients to have common values.

Example: Risk factors for low birthweight These data were collected at Baystate Medical Center, Springfield, Mass. during 1986, as part of a study to identify risk factors for low-birthweight babies. The response variable was birthweight, and data was collected on a variety of continuous and categorical explanatory variables

Variables age : mother's age in years, continuous lwt: mother's weight in pounds, continuous race: mother's race (`1' = white, `2' = black, `3' = other), factor smoke: smoking during pregnancy ( 1 =smoked, 0=didn’t smoke), factor ht: history of hypertension (0=No, 1=Yes), factor ui: presence of uterine irritability (0=No, 1=Yes), factor bwt: birth weight in grams, continuous, response Must be a factor!!

Preliminary plots

Plotting conclusions some relationships between bwt and the covariates • Slight relationship with lwt • Small effects due to the categorical variables On to fitting models……

Factor level combinations • There are 2 continuous explanatory variables, and 4 categorical explanatory variables, race (3 levels), smoke (2 levels) ht (2 levels) and ui (2 levels). There are 3x2x2x2=24 factor level combinations. • 24 regressions in all !!

Models • The most general model would fit separate regression surfaces to each of the 24 combinations • Assuming planes are appropriate, this means 24 x 3 = 72 parameters. There are 189 observations, so this is rather a lot of parameters. (usually we want at least 5 observations per parameter). In fact not all factor level combinations have enough data to fit a plane (need at least 3 points) • The model fitting separate planes to each combination is bwt ~ age*race*smoke*ht*ui + lwt*race*smoke*ht*ui

Fitting • Can fit the model and use the anova function to reduce number of variables > births.lm<-lm(bwt~age*race*smoke*ui*ht +lwt*race*smoke*ui*ht, data=births.df) > anova(births.lm) • Also use the stepwise function with the forward option > null.lm<-lm(bwt~1,data=births.df) > step(null.lm, formula(births.lm), direction="forward")

Results: anova Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) age 1 806927 806927 2.0610 0.153251 race 2 4456772 2228386 5.6916 0.004167 ** smoke 1 7098861 7098861 18.1314 3.674e-05 *** ui 1 6513795 6513795 16.6370 7.414e-05 *** ht 1 2458238 2458238 6.2786 0.013317 * lwt 1 2779537 2779537 7.0993 0.008579 ** age:race 2 368694 184347 0.4708 0.625420 age:smoke 1 2220991 2220991 5.6727 0.018520 * race:smoke 2 1085210 542605 1.3859 0.253374 age:ui 1 187617 187617 0.4792 0.489886 race:ui 2 774013 387006 0.9885 0.374625 smoke:ui 1 43060 43060 0.1100 0.740641age:ht 1 1573461 1573461 4.0188 0.046844 * race:ht 2 318415 159207 0.4066 0.666639 smoke:ht 1 115215 115215 0.2943 0.588322 race:lwt 2 1008962 504481 1.2885 0.278798 smoke:lwt 1 86923 86923 0.2220 0.638215

Results: anova (cont) Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) ui:lwt 1 196810 196810 0.5027 0.479457 ht:lwt 1 1145508 1145508 2.9258 0.089300 . age:race:smoke 2 1063946 531973 1.3587 0.260218 age:race:ui 2 108742 54371 0.1389 0.870455 age:smoke:ui 1 533 533 0.0014 0.970632 race:smoke:ui 1 617235 617235 1.5765 0.211272 age:race:ht 2 1220320 610160 1.5584 0.213948 age:smoke:ht 1 406773 406773 1.0389 0.309752 race:smoke:lwt 2 1052747 526373 1.3444 0.263898 race:ui:lwt 2 786735 393367 1.0047 0.368668 smoke:ui:lwt 1 1128102 1128102 2.8813 0.091744 . race:ht:lwt 1 435519 435519 1.1124 0.293310 age:race:smoke:ui 1 2544108 2544108 6.4980 0.011832 * race:smoke:ui:lwt 1 150811 150811 0.3852 0.535806 Residuals 146 57162471 391524

Results: stepwise (forward/both) Step: AIC= 2451.34 bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke Df Sum of Sq RSS AIC <none> 73000256 2451 - race:smoke 2 1657370 74657625 2452 + ui:lwt 1 304152 72696104 2453 + smoke:ht 1 168685 72831571 2453 - ht:lwt 1 1397486 74397742 2453 + age 1 149901 72850355 2453 + smoke:lwt 1 11843 72988412 2453 + race:ht 2 497275 72502981 2454 + race:lwt 2 441336 72558920 2454 - ui 1 6968046 79968302 2467

Comparisons • 3 models to compare • Full model • Model indicated by anova (model 2) bwt ~ age +ui + race + smoke + ht + lwt + age:ht + age:smoke, • Model chosen by stepwise (model 3) bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke,

extractAIC(model3.lm)

Deleting? • Point 133 seems influential – big Cov ratio, HMD • Refitting without 133 now makes model 3 the best – will go with model 3 • Could also just use a purely additive model (i.e parallel planes) - but adjusted R2 and AIC are slightly worse.

Summary Model 3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3158.801 267.867 11.792 < 2e-16 *** ui1 -548.459 133.567 -4.106 6.12e-05 *** race2 -561.784 187.680 -2.993 0.003152 ** race3 -500.440 133.004 -3.763 0.000228 *** smoke1 -529.973 133.865 -3.959 0.000109 *** ht1 -1978.134 711.642 -2.780 0.006026 ** lwt 2.426 1.788 1.357 0.176520 ht1:lwt 10.236 4.535 2.257 0.025217 * race2:smoke1 255.066 300.258 0.849 0.396750 race3:smoke1 510.755 244.031 2.093 0.037768 *

Interpretation (cont) Other things being equal: • Uterine irritability associated with lower birthweight • Smoking associated with lower birthweight, but differently for different races • Hypertension associated with lower birthweight • Race associated with lower birthweight • Black lower than white • “Other” lower than white • Higher mother’s weight associated with higher birthweight, for hypertension group • Smoking lowers birthweight more for race 1 (white). • These effects significant but small compared to variability.

Interpretation of interactions -836 = -530 -561 + 255

Diagnostics for model 2 Point 133 !! Check for high-influence etc

Stats 330: Lecture 19