Lecture 7: Multiple Linear Regression Interpretation with different types of predictors

Lecture 7:Multiple Linear RegressionInterpretation with different types of predictors BMTRY 701Biostatistical Methods II

Interpreting regression coefficients • So far, we’ve considered continuous covariates • Covariates can take other forms: • binary • nominal categorical • quadratics (or other transforms) • interactions • Interpretations may vary depending on the nature of your covariate

Binary covariates • Considered ‘qualitative’ • The ordering of numeric assignments does not matter • Examples: MEDSCHL: 1 = yes; 2 = no • More popular examples: • gender • mutation • pre vs. post-menopausal • Two age categories

How is MEDSCHL related to LOS? • How to interpret β1? • Coding of variables: • 2 vs. 1 • i prefer 1 vs. 0 • difference? the intercept. • Let’s make a new variable: • MS = 1 if MEDSCHL = 1 (yes) • MS = 0 if MEDSCHL = 2 (no)

How is MEDSCHOOL related to LOS? • What does β1 mean? • Same, yet different interpretation as continuous • What if we had used old coding?

R code > data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL) > table(data$ms, data$MEDSCHL) 1 2 0 0 96 1 17 0 > > reg <- lm(LOS ~ ms, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4105 0.1871 50.290 < 2e-16 *** ms 1.5807 0.4824 3.276 0.00140 ** --- Residual standard error: 1.833 on 111 degrees of freedom Multiple R-squared: 0.08818, Adjusted R-squared: 0.07997 F-statistic: 10.73 on 1 and 111 DF, p-value: 0.001404

Scatterplot? Residual Plot? res <- reg$residuals plot(data$ms, res) abline(h=0)

Fitted Values • Only two fitted values: • Diagnostic plots are not as informative • Extrapolation and Interpolation are meaningless! • We can estimate LOS for MS=0.5 • LOS = 9.41 + 1.58*0.5 = 10.20 • Try to interpret the result…

“Linear” regression? • But, what about ‘linear’ assumption? • we still need to adhere to the model assumptions • recall that they relate to the residuals primarily • residuals are independent and identically distributed: • The model is still linear in the parameters!

MLR example: Add infection risk to our model > reg <- lm(LOS ~ ms + INFRISK, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 *** ms 0.9717 0.4316 2.251 0.0263 * INFRISK 0.6998 0.1156 6.054 2e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.595 on 110 degrees of freedom Multiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10 How does interpretation change?

What about more than two categories? • We looked briefly at region a few lectures back. • How to interpret? • You need to define a reference category • For med school: • reference was ms=0 • almost ‘subconscious’ with only two categories • With >2 categories, need to be careful of interpretation

LOS ~ REGION • Note how ‘indicator’ or ‘dummy’ variable is defined: • I(condition) = 1 if condition is true • I(condition) = 0 if condition is false

Interpretation • β0 = • β1 = • β2 = • β3 =

hypothesis tests? • Ho: β1 = 0 • Ha: β1 ≠ 0 • What does that test (in words)? • Ho: β2 = 0 • Ha: β2 ≠ 0 • What does that test (in words)? • What if we want to test region, in general? • One of our next topics!

R > reg <- lm(LOS ~ factor(REGION), data=data) > summary(reg) Call: lm(formula = LOS ~ factor(REGION), data = data) Residuals: Min 1Q Median 3Q Max -3.05893 -1.03135 -0.02344 0.68107 8.47107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07

Interpreting • Is mean LOS different in region 2 vs. 1? • What about region 3 vs. 1 and 4 vs. 1? • What about region 4 vs. 3? • How to test that? • Two options? • recode the data so that 3 or 4 is the reference • use knowledge about the variance of linear combinations to estimate the p-value for the difference in the coefficients • For now…we’ll focus on the first.

Make REGION=4 the reference • Our model then changes:

R code: recoding so last category is reference > data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) ) > reg <- lm(LOS ~ rev.region, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07

Quite a few differences: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 *** Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 ***

But the “model” is the same • Model 1: Residual standard error: 1.675 on 109 degrees of freedom • Model 2: Residual standard error: 1.675 on 109 degrees of freedom • The model represent the data equally well. • However, the ‘reparameterization’ yields a difference of interpretation for the model parameters.

Diagnostics # residual plot reg <- lm(LOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)

Diagnostics # residual plot reg <- lm(logLOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)

Next type: polynomials • Most common: quadratic • What does that mean? • Including a linear and ‘squared’ term • Why? to adhere to model assumptions! • Example: • last week we saw LOS ~ NURSE • quadratic actually made some sense

Scatterplot

Fitting the model > data$nurse2 <- data$NURSE^2 > reg <- lm(logLOS ~ NURSE + nurse2, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.090e+00 3.645e-02 57.355 < 2e-16 *** NURSE 1.430e-03 3.525e-04 4.058 9.29e-05 *** nurse2 -1.789e-06 6.262e-07 -2.857 0.00511 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.161 on 110 degrees of freedom Multiple R-squared: 0.1965, Adjusted R-squared: 0.1819 F-statistic: 13.45 on 2 and 110 DF, p-value: 5.948e-06 Interpretable?

How does it fit? Note: ‘abline’ only will work for simple linear regression. when there is more than one predictor, you need to make the line another way. # make regression line plot(data$NURSE, data$logLOS, pch=16) coef <- reg$coefficients nurse.values <- seq(15,650,5) fit.line <- coef[1] + coef[2]*nurse.values + coef[3]*nurse.values^2 lines(nurse.values, fit.line, lwd=2)

Another approach to the same data • Does it make sense that it increases, and then decreases? • Or, would it make more sense to increase, and then plateau? • Which do you think makes more sense? • How to tell? • use a data driven approach • tells us “what do the data suggest?”

Smoothing • Empirical way to look at the relationship • data is ‘binned’ by x • for each ‘bin’, the average y is estimated • but, it is a little fancier • it is a ‘moving average’ • each x value is in multiple bins • Modern methods use models within bins • Lowess smoothing • Cubic spline smoothing • Specifics are not so important: the “empirical” result is

smoother <- lowess(data$NURSE, data$logLOS) plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) legend(450,3,c("Quadratic Model","Lowess Smooth"), lty=c(1,1),lwd=c(2,2),col=c(1,2))

Inference? • What do the data say? • Looks like a plateau • How can we model that? • One option: a spline • Zeger: “broken arrow” model • Example: looks like a “knot” at NURSE = 250 • there is a linear increase in logLOS until about NURSE=250 • then, the relationship is flat • this implies a slope prior to NURSE=250, and one after NURSE=250

Implementing a spline • A little tricky • We need to define a new variable, NURSE* • And then we write the model as follows:

How to interpret? • When in doubt, condition on different scenarios • What is E(logLOS) when NURSE<250? • What is E(logLOS) when NURSE>250?

R data$nurse.star <- ifelse(data$NURSE<=250,0,data$NURSE-250) data$nurse.star reg.spline <- lm(logLOS ~ NURSE + nurse.star, data=data) # make regression line coef.spline <- reg.spline$coefficients nurse.values <- seq(15,650,5) nurse.values.star <- ifelse(nurse.values<=250,0,nurse.values-250) spline.line <- coef.spline[1] + coef.spline[2]*nurse.values + coef.spline[3]*nurse.values.star plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) lines(nurse.values, spline.line, col=4, lwd=3) legend(450,3,c("Quadratic Model","Lowess Smooth","Spline Model"), lty=c(1,1,1),lwd=c(2,2,3),col=c(1,2,4))

Interpreting the output > summary(reg.spline) Call: lm(formula = logLOS ~ NURSE + nurse.star, data = data) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1073877 0.0332135 63.450 < 2e-16 *** NURSE 0.0010278 0.0002336 4.399 2.52e-05 *** nurse.star -0.0011114 0.0004131 -2.690 0.00825 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1616 on 110 degrees of freedom Multiple R-squared: 0.1902, Adjusted R-squared: 0.1754 F-statistic: 12.91 on 2 and 110 DF, p-value: 9.165e-06 How do we interpret the coefficient on nurse.star?

Why subtract the 250 in defining nurse.star? • It ‘calibrates’ where the two pieces of the lines meet • if it is not included, then they will not connect

Why a spline vs. the quadratic? • it fits well! • it is more interpretable • it makes sense • it is less sensitive to the outliers • Can be generalized to have more ‘knots’

Next time • ANOVA • F-tests

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors

Presentation Transcript

Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression uses 2 or more predictors

Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression

Multiple linEAr regression

Multiple Linear Regression with Mediator

Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression

Multiple linear regression

Multiple Linear Regression

Lecture 13: Multiple linear regression

Multiple Regression Interpretation

Multiple Linear Regression.

Linear Regression with Multiple Regressors

Lecture 8: Multiple Linear Regression Interpretation with different types of predictors

Multiple Linear Regression

Linear Regression with Multiple Regressors