380 likes | 510 Views
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors. BMTRY 701 Biostatistical Methods II. Interpreting regression coefficients. So far, we’ve considered continuous covariates Covariates can take other forms: binary nominal categorical
E N D
Lecture 7:Multiple Linear RegressionInterpretation with different types of predictors BMTRY 701Biostatistical Methods II
Interpreting regression coefficients • So far, we’ve considered continuous covariates • Covariates can take other forms: • binary • nominal categorical • quadratics (or other transforms) • interactions • Interpretations may vary depending on the nature of your covariate
Binary covariates • Considered ‘qualitative’ • The ordering of numeric assignments does not matter • Examples: MEDSCHL: 1 = yes; 2 = no • More popular examples: • gender • mutation • pre vs. post-menopausal • Two age categories
How is MEDSCHL related to LOS? • How to interpret β1? • Coding of variables: • 2 vs. 1 • i prefer 1 vs. 0 • difference? the intercept. • Let’s make a new variable: • MS = 1 if MEDSCHL = 1 (yes) • MS = 0 if MEDSCHL = 2 (no)
How is MEDSCHOOL related to LOS? • What does β1 mean? • Same, yet different interpretation as continuous • What if we had used old coding?
R code > data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL) > table(data$ms, data$MEDSCHL) 1 2 0 0 96 1 17 0 > > reg <- lm(LOS ~ ms, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4105 0.1871 50.290 < 2e-16 *** ms 1.5807 0.4824 3.276 0.00140 ** --- Residual standard error: 1.833 on 111 degrees of freedom Multiple R-squared: 0.08818, Adjusted R-squared: 0.07997 F-statistic: 10.73 on 1 and 111 DF, p-value: 0.001404
Scatterplot? Residual Plot? res <- reg$residuals plot(data$ms, res) abline(h=0)
Fitted Values • Only two fitted values: • Diagnostic plots are not as informative • Extrapolation and Interpolation are meaningless! • We can estimate LOS for MS=0.5 • LOS = 9.41 + 1.58*0.5 = 10.20 • Try to interpret the result…
“Linear” regression? • But, what about ‘linear’ assumption? • we still need to adhere to the model assumptions • recall that they relate to the residuals primarily • residuals are independent and identically distributed: • The model is still linear in the parameters!
MLR example: Add infection risk to our model > reg <- lm(LOS ~ ms + INFRISK, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 *** ms 0.9717 0.4316 2.251 0.0263 * INFRISK 0.6998 0.1156 6.054 2e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.595 on 110 degrees of freedom Multiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10 How does interpretation change?
What about more than two categories? • We looked briefly at region a few lectures back. • How to interpret? • You need to define a reference category • For med school: • reference was ms=0 • almost ‘subconscious’ with only two categories • With >2 categories, need to be careful of interpretation
LOS ~ REGION • Note how ‘indicator’ or ‘dummy’ variable is defined: • I(condition) = 1 if condition is true • I(condition) = 0 if condition is false
Interpretation • β0 = • β1 = • β2 = • β3 =
hypothesis tests? • Ho: β1 = 0 • Ha: β1 ≠ 0 • What does that test (in words)? • Ho: β2 = 0 • Ha: β2 ≠ 0 • What does that test (in words)? • What if we want to test region, in general? • One of our next topics!
R > reg <- lm(LOS ~ factor(REGION), data=data) > summary(reg) Call: lm(formula = LOS ~ factor(REGION), data = data) Residuals: Min 1Q Median 3Q Max -3.05893 -1.03135 -0.02344 0.68107 8.47107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07
Interpreting • Is mean LOS different in region 2 vs. 1? • What about region 3 vs. 1 and 4 vs. 1? • What about region 4 vs. 3? • How to test that? • Two options? • recode the data so that 3 or 4 is the reference • use knowledge about the variance of linear combinations to estimate the p-value for the difference in the coefficients • For now…we’ll focus on the first.
Make REGION=4 the reference • Our model then changes:
R code: recoding so last category is reference > data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) ) > reg <- lm(LOS ~ rev.region, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07
Quite a few differences: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 *** Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 ***
But the “model” is the same • Model 1: Residual standard error: 1.675 on 109 degrees of freedom • Model 2: Residual standard error: 1.675 on 109 degrees of freedom • The model represent the data equally well. • However, the ‘reparameterization’ yields a difference of interpretation for the model parameters.
Diagnostics # residual plot reg <- lm(LOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)
Diagnostics # residual plot reg <- lm(logLOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)
Next type: polynomials • Most common: quadratic • What does that mean? • Including a linear and ‘squared’ term • Why? to adhere to model assumptions! • Example: • last week we saw LOS ~ NURSE • quadratic actually made some sense
Fitting the model > data$nurse2 <- data$NURSE^2 > reg <- lm(logLOS ~ NURSE + nurse2, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.090e+00 3.645e-02 57.355 < 2e-16 *** NURSE 1.430e-03 3.525e-04 4.058 9.29e-05 *** nurse2 -1.789e-06 6.262e-07 -2.857 0.00511 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.161 on 110 degrees of freedom Multiple R-squared: 0.1965, Adjusted R-squared: 0.1819 F-statistic: 13.45 on 2 and 110 DF, p-value: 5.948e-06 Interpretable?
How does it fit? Note: ‘abline’ only will work for simple linear regression. when there is more than one predictor, you need to make the line another way. # make regression line plot(data$NURSE, data$logLOS, pch=16) coef <- reg$coefficients nurse.values <- seq(15,650,5) fit.line <- coef[1] + coef[2]*nurse.values + coef[3]*nurse.values^2 lines(nurse.values, fit.line, lwd=2)
Another approach to the same data • Does it make sense that it increases, and then decreases? • Or, would it make more sense to increase, and then plateau? • Which do you think makes more sense? • How to tell? • use a data driven approach • tells us “what do the data suggest?”
Smoothing • Empirical way to look at the relationship • data is ‘binned’ by x • for each ‘bin’, the average y is estimated • but, it is a little fancier • it is a ‘moving average’ • each x value is in multiple bins • Modern methods use models within bins • Lowess smoothing • Cubic spline smoothing • Specifics are not so important: the “empirical” result is
smoother <- lowess(data$NURSE, data$logLOS) plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) legend(450,3,c("Quadratic Model","Lowess Smooth"), lty=c(1,1),lwd=c(2,2),col=c(1,2))
Inference? • What do the data say? • Looks like a plateau • How can we model that? • One option: a spline • Zeger: “broken arrow” model • Example: looks like a “knot” at NURSE = 250 • there is a linear increase in logLOS until about NURSE=250 • then, the relationship is flat • this implies a slope prior to NURSE=250, and one after NURSE=250
Implementing a spline • A little tricky • We need to define a new variable, NURSE* • And then we write the model as follows:
How to interpret? • When in doubt, condition on different scenarios • What is E(logLOS) when NURSE<250? • What is E(logLOS) when NURSE>250?
R data$nurse.star <- ifelse(data$NURSE<=250,0,data$NURSE-250) data$nurse.star reg.spline <- lm(logLOS ~ NURSE + nurse.star, data=data) # make regression line coef.spline <- reg.spline$coefficients nurse.values <- seq(15,650,5) nurse.values.star <- ifelse(nurse.values<=250,0,nurse.values-250) spline.line <- coef.spline[1] + coef.spline[2]*nurse.values + coef.spline[3]*nurse.values.star plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) lines(nurse.values, spline.line, col=4, lwd=3) legend(450,3,c("Quadratic Model","Lowess Smooth","Spline Model"), lty=c(1,1,1),lwd=c(2,2,3),col=c(1,2,4))
Interpreting the output > summary(reg.spline) Call: lm(formula = logLOS ~ NURSE + nurse.star, data = data) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1073877 0.0332135 63.450 < 2e-16 *** NURSE 0.0010278 0.0002336 4.399 2.52e-05 *** nurse.star -0.0011114 0.0004131 -2.690 0.00825 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1616 on 110 degrees of freedom Multiple R-squared: 0.1902, Adjusted R-squared: 0.1754 F-statistic: 12.91 on 2 and 110 DF, p-value: 9.165e-06 How do we interpret the coefficient on nurse.star?
Why subtract the 250 in defining nurse.star? • It ‘calibrates’ where the two pieces of the lines meet • if it is not included, then they will not connect
Why a spline vs. the quadratic? • it fits well! • it is more interpretable • it makes sense • it is less sensitive to the outliers • Can be generalized to have more ‘knots’
Next time • ANOVA • F-tests