310 likes | 456 Views
Advanced Research Skills. Lecture 4 Linear Models III. Olivier MISSA, om502@york.ac.uk. Outline. "Refresher" on different types of model: Multiple regression Polynomial regression Model building Finding the "best" model. Multiple Regression.
E N D
Advanced Research Skills Lecture 4 Linear Models III Olivier MISSA, om502@york.ac.uk
Outline • "Refresher" on different types of model: • Multiple regression • Polynomial regression • Model building Finding the "best" model.
Multiple Regression • When more than one continuous or discrete variables are used to predict a response variable. • Example: The 38 car drivers dataset > data (seatpos) ## from faraway package > attach(seatpos) > names(seatpos) [1] "Age" "Weight" "HtShoes" "Ht" "Seated" [6] "Arm" "Thigh" "Leg" "hipcenter" > summary(lm(hipcenter ~ Age)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -192.9645 24.3015 -7.940 2.00e-09 *** Age 0.7963 0.6331 1.258 0.217 ... Multiple R-squared: 0.0421, Adjusted R-squared: 0.01549 ...
Multiple Regression > summary(lm(hipcenter ~ Age)) Estimate Std. Error t value Pr(>|t|) Age 0.7963 0.6331 1.258 0.217 Multiple R-squared: 0.0421 > summary(lm(hipcenter ~ Weight)) Estimate Std. Error t value Pr(>|t|) Weight -1.0674 0.2134 -5.002 1.49e-05 *** Multiple R-squared: 0.41 > summary(lm(hipcenter ~ HtShoes)) Estimate Std. Error t value Pr(>|t|) HtShoes -4.2621 0.5391 -7.907 2.21e-09 *** Multiple R-squared: 0.6346 > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383 > summary(lm(hipcenter ~ Seated)) Estimate Std. Error t value Pr(>|t|) Seated -8.844 1.375 -6.432 1.84e-07 *** Multiple R-squared: 0.5347
Multiple Regression > summary(lm(hipcenter ~ Arm)) Estimate Std. Error t value Pr(>|t|) Arm -10.351 2.391 -4.329 0.000114 *** Multiple R-squared: 0.3423 > summary(lm(hipcenter ~ Thigh)) Estimate Std. Error t value Pr(>|t|) Thigh -9.100 2.069 -4.398 9.29e-05 *** Multiple R-squared: 0.3495 > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196 Leg Ht Age
No Longer Significant Still Significant Multiple Regression slope parameters change > summary(mod <- lm(hipcenter ~ Ht + Leg)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 491.244 99.543 4.935 1.95e-05 *** Ht -2.565 1.268 -2.022 0.0509 Leg -6.136 4.164 -1.473 0.1496 --- Residual standard error: 35.79 on 35 degrees of freedom Multiple R-squared: 0.6594, Adjusted R-squared: 0.6399 F-statistic: 33.88 on 2 and 35 DF, p-value: 6.517e-09 > summary(lm(hipcenter ~ Ht)) Estimate Std. Error t value Pr(>|t|) Ht -4.2650 0.5351 -7.970 1.83e-09 *** Multiple R-squared: 0.6383 > summary(lm(hipcenter ~ Leg)) Estimate Std. Error t value Pr(>|t|) Leg -13.795 1.801 -7.658 4.59e-09 *** Multiple R-squared: 0.6196
Beware of strong collinearity Multiple Regression > drop1(mod, test="F") Single term deletions Model: hipcenter ~ Ht + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 44835 275 Ht 1 5236 50071 277 4.0877 0.05089 . Leg 1 2781 47616 275 2.1711 0.14957 > cor(Ht,Leg) [1] 0.9097524 > plot(Ht ~ Leg)
No Added Variable Significantly Improves the Model Beware of strong collinearity Multiple Regression > add1( lm(hipcenter~Ht), ~. +Age +Weight +HtShoes +Seated +Arm +Thigh +Leg, test="F") Single term additions Model: hipcenter ~ Ht Df Sum of Sq RSS AIC F value Pr(F) <none> 47616 275 Age 1 2354 45262 275 1.8199 0.1860 Weight 1 196 47420 277 0.1446 0.7061 HtShoes 1 26 47590 277 0.0189 0.8913 Seated 1 102 47514 277 0.0748 0.7861 Arm 1 76 47540 277 0.0558 0.8147 Thigh 1 5 47611 277 0.0034 0.9538 Leg 1 2781 44835 275 2.1711 0.1496
Comparing models • How can we compare models on an equal footing ? (regardless of the number of parameters). • The multiple-R2 can only increase as more variables enter a model • (because the RSS can only decrease). • The adjusted R2 corrects for the different number of parameters to some extent. no good to compare models
Maximized value of the Likelihood function Number of Parameters Akaike Information Criterion • Invented by Hirotugu Akaike in 1971 • after a few weeks of sleepless nights stressing over a conference presentation. • The AIC originally called 'An Information Criterion'penalizes the likelihood of a model accordingto the number of parameters being estimated. The lower the AIC value the better the model is
Akaike Information Criterion • Invented by Hirotugu Akaike in 1971 • after a few weeks of sleepless nights stressing over a conference presentation. • When residuals are normally & independently distributed: Exact expression Simplified expression(not equal)
Akaike Information Criterion • May need to be corrected for small sample sizes • A variant, the BIC,'Bayesian Information Criterion'(Schwartz Criterion) penalizes free parameters more strongly than AIC • When residuals are normally & independently distributed: Exact expression Simplified expression(not equal)
Deletions ranked in increasing order of AIC Searching for the "best solution" Multiple Regression > mod <- lm(hipcenter ~. , data=seatpos) ## starts with everything > s.mod <- step(mod) ## by default prunes variables out ('backward') Start: AIC=283.62 hipcenter ~ Age +Weight +HtShoes +Ht +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Ht 1 5 41267 282 - Weight 1 9 41271 282 - Seated 1 29 41290 282 - HtShoes 1 108 41370 282 - Arm 1 165 41427 282 - Thigh 1 263 41525 282 <none> 41262 284 - Age 1 2632 43894 284 - Leg 1 2655 43917 284
Searching for the "best solution" Multiple Regression Step: AIC=281.63 ## after removing Ht hipcenter ~ Age +Weight +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Weight 1 11 41278 280 - Seated 1 31 41297 280 - Arm 1 161 41427 280 - Thigh 1 269 41536 280 - HtShoes 1 972 42239 281 <none> 41267 282 - Leg 1 2665 43931 282 - Age 1 2809 44075 282
Searching for the "best solution" Multiple Regression Step: AIC=279.64 ## after removing Ht & Weight hipcenter ~ Age +HtShoes +Seated +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Seated 1 35 41313 278 - Arm 1 156 41434 278 - Thigh 1 285 41563 278 - HtShoes 1 975 42253 279 <none> 41278 280 - Leg 1 2661 43939 280 - Age 1 3012 44290 280
Searching for the "best solution" Multiple Regression Step: AIC=277.67 ## after removing Ht, Weight & Seated hipcenter ~ Age +HtShoes +Arm +Thigh +Leg Df Sum of Sq RSS AIC - Arm 1 172 41485 276 - Thigh 1 345 41658 276 - HtShoes 1 1853 43166 277 <none> 41313 278 - Leg 1 2871 44184 278 - Age 1 2977 44290 278 Step: AIC=275.83 ## after removing Arm as well hipcenter ~ Age + HtShoes + Thigh + Leg Df Sum of Sq RSS AIC - Thigh 1 473 41958 274 <none> 41485 276 - HtShoes 1 2341 43826 276 - Age 1 3501 44986 277 - Leg 1 3592 45077 277
Searching for the "best solution" Multiple Regression Step: AIC=274.26 ## after removing Thigh too hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC <none> 41958 274 - Age 1 3109 45067 275 - Leg 1 3476 45434 275 - HtShoes 1 4219 46176 276 > summary(s.mod) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 456.2137 102.8078 4.438 9.1e-05 *** Age 0.5998 0.3779 1.587 0.1217 HtShoes -2.3023 1.2452 -1.849 0.0732 . Leg -6.8297 4.0693 -1.678 0.1024 --- Residual standard error: 35.13 on 34 degrees of freedom Multiple R-squared: 0.6813, Adjusted R-squared: 0.6531 F-statistic: 24.22 on 3 and 34 DF, p-value: 1.437e-08
Searching for the "best solution" Multiple Regression > anova(s.mod) Response: hipcenter Df Sum Sq Mean Sq F value Pr(>F) Age 1 5541 5541 4.4904 0.04147 * HtShoes 1 80664 80664 65.3647 1.996e-09 *** Leg 1 3476 3476 2.8169 0.10244 Residuals 34 41958 1234 > drop1(s.mod, test="F") hipcenter ~ Age + HtShoes + Leg Df Sum of Sq RSS AIC F value Pr(F) <none> 41958 274 Age 1 3109 45067 275 2.5192 0.12173 HtShoes 1 4219 46176 276 3.4185 0.07318 . Leg 1 3476 45434 275 2.8169 0.10244
Multiple Regression > plot(hipcenter ~ s.mod$fit) > abline(0,1, col="red") > null <- lm(hipcenter ~1, data=seatpos) ## starting from a null model > s.mod <- step(null, ~. +Age+Weight+Ht+HtShoes+Seated+Arm+Thigh+Leg, direction="forward") ## intermediate steps removed Step: AIC=274.24 hipcenter ~ Ht + Leg + Age Df Sum of Sq RSS AIC <none> 41938 274 + Thigh 1 373 41565 276 + Arm 1 257 41681 276 + Seated 1 121 41817 276 + Weight 1 47 41891 276 + HtShoes 1 13 41925 276 No guarantee that the forward & backward searches will find the same solution
Multiple Regression Another example: How is ozone concentration in the atmosphere related to solar radiation, ambient temperature & wind speed ? > ozone.pollution <- read.table("ozone.data.txt", header=T) > dim(ozone.pollution) [1] 111 4 > names(ozone.pollution) [1] "rad" "temp" "wind" "ozone" > attach(ozone.pollution) > pairs(ozone.pollution, panel=panel.smooth, pch=16, lwd=2) > model <- lm(ozone ~ ., data=ozone.pollution)
Multiple Regression > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -64.23208 23.04204 -2.788 0.00628 ** rad 0.05980 0.02318 2.580 0.01124 * temp 1.65121 0.25341 6.516 2.43e-09 *** wind -3.33760 0.65384 -5.105 1.45e-06 *** --- Residual standard error: 21.17 on 107 degrees of freedom Multiple R-squared: 0.6062, Adjusted R-squared: 0.5952 F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16 > drop1(model, test="F") ozone ~ rad + temp + wind Df Sum of Sq RSS AIC F value Pr(F) <none> 47964 682 rad 1 2984 50948 686 6.6565 0.01124 * temp 1 19032 66996 717 42.4567 2.429e-09 *** wind 1 11680 59644 704 26.0567 1.450e-06 ***
Multiple Regression > plot(ozone ~ model$fitted, pch=16, xlab="Model Predictions", ylab="Ozone Concentration") > abline(0,1, col="red", lwd=2) > shapiro.test(model$res) Shapiro-Wilk normality test data: model$res W = 0.9173, p-value = 3.704e-06 > plot(model, which=1)
rad temp wind Multiple Regression Which predictor variable is non-linearly related to Ozone ? > library(car) > cr.plot(model, rad, pch=16, main="") > cr.plot(model, temp, pch=16, main="") > cr.plot(model, wind, pch=16, main="")
Polynomial Regression When the trend between a predictor variable and the response variable is not linear, the curvature can be "captured" using polynomials of various degrees. > model2 <- lm(ozone ~ poly(rad,2)+poly(temp,2)+poly(wind,2)) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.099 1.735 24.269 < 2e-16 *** poly(rad, 2)1 65.085 19.379 3.359 0.00110 ** poly(rad, 2)2 -16.259 20.174 -0.806 0.42213 poly(temp, 2)1 142.708 23.502 6.072 2.09e-08 *** poly(temp, 2)2 56.043 19.138 2.928 0.00419 ** poly(wind, 2)1 -125.800 21.391 -5.881 4.99e-08 *** poly(wind, 2)2 88.636 19.199 4.617 1.12e-05 *** --- Residual standard error: 18.28 on 104 degrees of freedom Multiple R-squared: 0.7148, Adjusted R-squared: 0.6984 F-statistic: 43.44 on 6 and 104 DF, p-value: < 2.2e-16
Polynomial Regression But how many degrees should we choose ? > extractAIC(model2) ## Simplified Version [1] 7.0000 651.8105 > model3 <- lm(ozone ~ rad +poly(temp,2) +poly(wind,2)) > extractAIC(model3) [1] 6.0000 650.5016 > model4 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,2) ) > extractAIC(model4) [1] 7.0000 648.0394 > model5 <- lm(ozone ~ rad+poly(temp,2)+poly(wind,3) ) > extractAIC(model5) [1] 7.0000 651.6489 > model6 <- lm(ozone ~ rad+poly(temp,3)+poly(wind,3) ) > extractAIC(model6) [1] 8.0000 649.0149 > extractAIC(model) ## original, strictly linear model [1] 4.0000 681.6233 Best Model
Polynomial Regression > plot(model4, which=1) > shapiro.test(model4$res) Shapiro-Wilk normality test data: model4$res W = 0.9309, p-value = 2.267e-05 > plot(model4, which=2)
Polynomial Regression Finding the best transformation of our response variable > library(MASS) > boxcox(model4, plotit=T) > boxcox(model4, plotit=T, lambda=seq(0,1,by=.1) ) > new.ozone <- ozone^(1/3) > mod4 <- lm(new.ozone ~ rad +poly(temp,3) +poly(wind,2) ) > extractAIC(mod4) [1] 7.0000 -162.6569 > shapiro.test(mod4$res) Shapiro-Wilk normality test data: mod4$res W = 0.9899, p-value = 0.5855
Polynomial Regression > par(mfrow=c(2,2)) > plot(mod4) > anova(mod4) Analysis of Variance Table Response: new.ozone Df Sum Sq Mean Sq F value Pr(>F) rad 1 15.531 15.531 71.467 1.859e-13 *** poly(temp, 3) 3 40.947 13.649 62.804 < 2.2e-16 *** poly(wind, 2) 2 8.129 4.065 18.703 1.152e-07 *** Residuals 104 22.602 0.217 > summary(mod4) Residual standard error: 0.4662 on 104 degrees of freedom Multiple R-squared: 0.7408, Adjusted R-squared: 0.7259 F-statistic: 49.55 on 6 and 104 DF, p-value: < 2.2e-16
temp wind rad Polynomial Regression > par(mfrow=c(1,1)) > plot(new.ozone ~ mod4$fitted) > abline(0,1, col="red", lwd=2) > cr.plot(mod4, rad, pch=16, main="") > cr.plot(mod4, poly(temp,3), pch=16, main="") > cr.plot(mod4, poly(wind,2), pch=16, main="")