1 / 29

Lecture 4 Linear Models III

Advanced Research Skills. Lecture 4 Linear Models III. Olivier MISSA, om502@york.ac.uk. Outline. "Refresher" on different types of model: Multiple regression Polynomial regression Model building Finding the "best" model. Multiple Regression.

sawyer
Download Presentation

Lecture 4 Linear Models III

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Research Skills Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.uk

  2. Outline • "Refresher" on different types of model: • Multiple regression • Polynomial regression • Model building Finding the "best" model.

  3. Multiple Regression • When more than one continuous or discrete variables are used to predict a response variable. • Example: The 38 car drivers dataset > data (seatpos) ## from faraway package > attach(seatpos) > names(seatpos) [1] "Age" "Weight" "HtShoes" "Ht" "Seated" [6] "Arm" "Thigh" "Leg" "hipcenter" > summary(lm(hipcenter ~ Age)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -192.9645 24.3015 -7.940 2.00e-09 *** Age 0.7963 0.6331 1.258 0.217 ... Multiple R-squared: 0.0421, Adjusted R-squared: 0.01549 ...

  4. Multiple Regression > summary(lm(hipcenter ~ Age)) Estimate Std. Error t value Pr(>|t|) Age 0.7963 0.6331 1.258 0.217 Multiple R-squared: 0.0421 > summary(lm(hipcenter ~ Weight)) Estimate Std. Error t value Pr(>|t|) Weight -1.0674 0.2134 -5.002 1.49e-05 *** Multiple R-squared: 0.41 > summary(lm(hipcenter ~ HtShoes)) Estimate Std. Error t value Pr(>|t|) HtShoes -4.2621 0.5391 -7.907 2.21e-09 *** Multiple R-squared: 0.6346 > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383 > summary(lm(hipcenter ~ Seated)) Estimate Std. Error t value Pr(>|t|) Seated -8.844 1.375 -6.432 1.84e-07 *** Multiple R-squared: 0.5347

  5. Multiple Regression > summary(lm(hipcenter ~ Arm)) Estimate Std. Error t value Pr(>|t|) Arm -10.351 2.391 -4.329 0.000114 *** Multiple R-squared: 0.3423 > summary(lm(hipcenter ~ Thigh)) Estimate Std. Error t value Pr(>|t|) Thigh -9.100 2.069 -4.398 9.29e-05 *** Multiple R-squared: 0.3495 > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196 Leg Ht Age

  6. No Longer Significant Still Significant Multiple Regression slope parameters change > summary(mod <- lm(hipcenter ~ Ht + Leg)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 491.244 99.543 4.935 1.95e-05 *** Ht -2.565 1.268 -2.022 0.0509 Leg -6.136 4.164 -1.473 0.1496 --- Residual standard error: 35.79 on 35 degrees of freedom Multiple R-squared: 0.6594, Adjusted R-squared: 0.6399 F-statistic: 33.88 on 2 and 35 DF, p-value: 6.517e-09 > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383 > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196

  7. Beware of strong collinearity Multiple Regression > drop1(mod, test="F") Single term deletions Model: hipcenter ~ Ht + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 44835 275 Ht 1 5236 50071 277 4.0877 0.05089 . Leg 1 2781 47616 275 2.1711 0.14957 > cor(Ht,Leg) [1] 0.9097524 > plot(Ht ~ Leg)

  8. No Added Variable Significantly Improves the Model Beware of strong collinearity Multiple Regression > add1( lm(hipcenter~Ht), ~. +Age +Weight +HtShoes +Seated +Arm +Thigh +Leg, test="F") Single term additions Model: hipcenter ~ Ht Df Sum of Sq RSS AIC F value Pr(F) <none> 47616 275 Age 1 2354 45262 275 1.8199 0.1860 Weight 1 196 47420 277 0.1446 0.7061 HtShoes 1 26 47590 277 0.0189 0.8913 Seated 1 102 47514 277 0.0748 0.7861 Arm 1 76 47540 277 0.0558 0.8147 Thigh 1 5 47611 277 0.0034 0.9538 Leg 1 2781 44835 275 2.1711 0.1496

  9. Comparing models • How can we compare models on an equal footing ? (regardless of the number of parameters). • The multiple-R2 can only increase as more variables enter a model • (because the RSS can only decrease). • The adjusted R2 corrects for the different number of parameters to some extent. no good to compare models

  10. Maximized value of the Likelihood function Number of Parameters Akaike Information Criterion • Invented by Hirotugu Akaike in 1971 • after a few weeks of sleepless nights stressing over a conference presentation. • The AIC originally called 'An Information Criterion'penalizes the likelihood of a model accordingto the number of parameters being estimated. The lower the AIC value the better the model is

  11. Akaike Information Criterion • Invented by Hirotugu Akaike in 1971 • after a few weeks of sleepless nights stressing over a conference presentation. • When residuals are normally & independently distributed: Exact expression Simplified expression(not equal)

  12. Akaike Information Criterion • May need to be corrected for small sample sizes • A variant, the BIC,'Bayesian Information Criterion'(Schwartz Criterion) penalizes free parameters more strongly than AIC • When residuals are normally & independently distributed: Exact expression Simplified expression(not equal)

  13. Deletions ranked in increasing order of AIC Searching for the "best solution" Multiple Regression > mod <- lm(hipcenter ~. , data=seatpos) ## starts with everything > s.mod <- step(mod) ## by default prunes variables out ('backward') Start: AIC=283.62 hipcenter ~ Age +Weight +HtShoes +Ht +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Ht 1 5 41267 282 - Weight 1 9 41271 282 - Seated 1 29 41290 282 - HtShoes 1 108 41370 282 - Arm 1 165 41427 282 - Thigh 1 263 41525 282 <none> 41262 284 - Age 1 2632 43894 284 - Leg 1 2655 43917 284

  14. Searching for the "best solution" Multiple Regression Step: AIC=281.63 ## after removing Ht hipcenter ~ Age +Weight +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Weight 1 11 41278 280 - Seated 1 31 41297 280 - Arm 1 161 41427 280 - Thigh 1 269 41536 280 - HtShoes 1 972 42239 281 <none> 41267 282 - Leg 1 2665 43931 282 - Age 1 2809 44075 282

  15. Searching for the "best solution" Multiple Regression Step: AIC=279.64 ## after removing Ht & Weight hipcenter ~ Age +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Seated 1 35 41313 278 - Arm 1 156 41434 278 - Thigh 1 285 41563 278 - HtShoes 1 975 42253 279 <none> 41278 280 - Leg 1 2661 43939 280 - Age 1 3012 44290 280

  16. Searching for the "best solution" Multiple Regression Step: AIC=277.67 ## after removing Ht, Weight & Seated hipcenter ~ Age +HtShoes +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Arm 1 172 41485 276 - Thigh 1 345 41658 276 - HtShoes 1 1853 43166 277 <none> 41313 278 - Leg 1 2871 44184 278 - Age 1 2977 44290 278 Step: AIC=275.83 ## after removing Arm as well hipcenter ~ Age + HtShoes + Thigh + Leg Df Sum of Sq RSS AIC - Thigh 1 473 41958 274 <none> 41485 276 - HtShoes 1 2341 43826 276 - Age 1 3501 44986 277 - Leg 1 3592 45077 277

  17. Searching for the "best solution" Multiple Regression Step: AIC=274.26 ## after removing Thigh too hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC <none> 41958 274 - Age 1 3109 45067 275 - Leg 1 3476 45434 275 - HtShoes 1 4219 46176 276 > summary(s.mod) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 456.2137 102.8078 4.438 9.1e-05 *** Age 0.5998 0.3779 1.587 0.1217 HtShoes -2.3023 1.2452 -1.849 0.0732 . Leg -6.8297 4.0693 -1.678 0.1024 --- Residual standard error: 35.13 on 34 degrees of freedom Multiple R-squared: 0.6813, Adjusted R-squared: 0.6531 F-statistic: 24.22 on 3 and 34 DF, p-value: 1.437e-08

  18. Searching for the "best solution" Multiple Regression > anova(s.mod) Response: hipcenter Df Sum Sq Mean Sq F value Pr(>F) Age 1 5541 5541 4.4904 0.04147 * HtShoes 1 80664 80664 65.3647 1.996e-09 *** Leg 1 3476 3476 2.8169 0.10244 Residuals 34 41958 1234 > drop1(s.mod, test="F") hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 41958 274 Age 1 3109 45067 275 2.5192 0.12173 HtShoes 1 4219 46176 276 3.4185 0.07318 . Leg 1 3476 45434 275 2.8169 0.10244

  19. Multiple Regression > plot(hipcenter ~ s.mod$fit) > abline(0,1, col="red") > null <- lm(hipcenter ~1, data=seatpos) ## starting from a null model > s.mod <- step(null, ~. +Age+Weight+Ht+HtShoes+Seated+Arm+Thigh+Leg, direction="forward") ## intermediate steps removed Step: AIC=274.24 hipcenter ~ Ht + Leg + Age Df Sum of Sq RSS AIC <none> 41938 274 + Thigh 1 373 41565 276 + Arm 1 257 41681 276 + Seated 1 121 41817 276 + Weight 1 47 41891 276 + HtShoes 1 13 41925 276 No guarantee that the forward & backward searches will find the same solution

  20. Multiple Regression Another example: How is ozone concentration in the atmosphere related to solar radiation, ambient temperature & wind speed ? > ozone.pollution <- read.table("ozone.data.txt", header=T) > dim(ozone.pollution) [1] 111 4 > names(ozone.pollution) [1] "rad" "temp" "wind" "ozone" > attach(ozone.pollution) > pairs(ozone.pollution, panel=panel.smooth, pch=16, lwd=2) > model <- lm(ozone ~ ., data=ozone.pollution)

  21. Multiple Regression > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -64.23208 23.04204 -2.788 0.00628 ** rad 0.05980 0.02318 2.580 0.01124 * temp 1.65121 0.25341 6.516 2.43e-09 *** wind -3.33760 0.65384 -5.105 1.45e-06 *** --- Residual standard error: 21.17 on 107 degrees of freedom Multiple R-squared: 0.6062, Adjusted R-squared: 0.5952 F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16 > drop1(model, test="F") ozone ~ rad + temp + wind Df Sum of Sq RSS AIC F value Pr(F) <none> 47964 682 rad 1 2984 50948 686 6.6565 0.01124 * temp 1 19032 66996 717 42.4567 2.429e-09 *** wind 1 11680 59644 704 26.0567 1.450e-06 ***

  22. Multiple Regression > plot(ozone ~ model$fitted, pch=16, xlab="Model Predictions", ylab="Ozone Concentration") > abline(0,1, col="red", lwd=2) > shapiro.test(model$res) Shapiro-Wilk normality test data: model$res W = 0.9173, p-value = 3.704e-06 > plot(model, which=1)

  23. rad temp wind Multiple Regression Which predictor variable is non-linearly related to Ozone ? > library(car) > cr.plot(model, rad, pch=16, main="") > cr.plot(model, temp, pch=16, main="") > cr.plot(model, wind, pch=16, main="")

  24. Polynomial Regression When the trend between a predictor variable and the response variable is not linear, the curvature can be "captured" using polynomials of various degrees. > model2 <- lm(ozone ~ poly(rad,2)+poly(temp,2)+poly(wind,2)) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.099 1.735 24.269 < 2e-16 *** poly(rad, 2)1 65.085 19.379 3.359 0.00110 ** poly(rad, 2)2 -16.259 20.174 -0.806 0.42213 poly(temp, 2)1 142.708 23.502 6.072 2.09e-08 *** poly(temp, 2)2 56.043 19.138 2.928 0.00419 ** poly(wind, 2)1 -125.800 21.391 -5.881 4.99e-08 *** poly(wind, 2)2 88.636 19.199 4.617 1.12e-05 *** --- Residual standard error: 18.28 on 104 degrees of freedom Multiple R-squared: 0.7148, Adjusted R-squared: 0.6984 F-statistic: 43.44 on 6 and 104 DF, p-value: < 2.2e-16

  25. Polynomial Regression But how many degrees should we choose ? > extractAIC(model2) ## Simplified Version [1] 7.0000 651.8105 > model3 <- lm(ozone ~ rad +poly(temp,2) +poly(wind,2)) > extractAIC(model3) [1] 6.0000 650.5016 > model4 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,2) ) > extractAIC(model4) [1] 7.0000 648.0394 > model5 <- lm(ozone ~ rad+poly(temp,2)+poly(wind,3) ) > extractAIC(model5) [1] 7.0000 651.6489 > model6 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,3) ) > extractAIC(model6) [1] 8.0000 649.0149 > extractAIC(model) ## original, strictly linear model [1] 4.0000 681.6233 Best Model

  26. Polynomial Regression > plot(model4, which=1) > shapiro.test(model4$res) Shapiro-Wilk normality test data: model4$res W = 0.9309, p-value = 2.267e-05 > plot(model4, which=2)

  27. Polynomial Regression Finding the best transformation of our response variable > library(MASS) > boxcox(model4, plotit=T) > boxcox(model4, plotit=T, lambda=seq(0,1,by=.1) ) > new.ozone <- ozone^(1/3) > mod4 <- lm(new.ozone ~ rad +poly(temp,3) +poly(wind,2) ) > extractAIC(mod4) [1] 7.0000 -162.6569 > shapiro.test(mod4$res) Shapiro-Wilk normality test data: mod4$res W = 0.9899, p-value = 0.5855

  28. Polynomial Regression > par(mfrow=c(2,2)) > plot(mod4) > anova(mod4) Analysis of Variance Table Response: new.ozone Df Sum Sq Mean Sq F value Pr(>F) rad 1 15.531 15.531 71.467 1.859e-13 *** poly(temp, 3) 3 40.947 13.649 62.804 < 2.2e-16 *** poly(wind, 2) 2 8.129 4.065 18.703 1.152e-07 *** Residuals 104 22.602 0.217 > summary(mod4) Residual standard error: 0.4662 on 104 degrees of freedom Multiple R-squared: 0.7408, Adjusted R-squared: 0.7259 F-statistic: 49.55 on 6 and 104 DF, p-value: < 2.2e-16

  29. temp wind rad Polynomial Regression > par(mfrow=c(1,1)) > plot(new.ozone ~ mod4$fitted) > abline(0,1, col="red", lwd=2) > cr.plot(mod4, rad, pch=16, main="") > cr.plot(mod4, poly(temp,3), pch=16, main="") > cr.plot(mod4, poly(wind,2), pch=16, main="")

More Related