1 / 25

Lecture 3 Linear Models II

Advanced Research Skills. Lecture 3 Linear Models II. Olivier MISSA, om502@york.ac.uk. Outline. "Refresher" on different types of model: Two-way Anova Ancova. Two-Way Anova. When your observations are categorized according to two different factors.

gyda
Download Presentation

Lecture 3 Linear Models II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Research Skills Lecture 3 Linear Models II Olivier MISSA, om502@york.ac.uk

  2. Outline • "Refresher" on different types of model: • Two-way Anova • Ancova

  3. Two-Way Anova • When your observations are categorized according to two different factors. • Example: Blood calcium concentration in a population of birds,10 males and 10 females randomly split in 2 groups, and given either hormonal treatment or not. > dataset <- read.table("2wayAnova.csv", header=T, sep=",") > attach(dataset) > names(dataset) [1] "plasma" "sex" "hormone" > tapply(plasma, list(sex, hormone), length) no yes female 5 5 male 5 5 balanced design equal replications among groupings

  4. Two-Way Anova > stripchart(plasma ~ hormone, + col=c("orange","blue") > h.ave <- tapply(plasma, hormone, mean) > h.ave no yes 13.50 30.15 > stripchart(plasma ~ sex, + col=c("red","green") > s.ave <- tapply(plasma, sex, mean) > s.ave female male 23.70 19.95 30.15 13.50 23.70 19.95

  5. Two-Way Anova > h.ave no yes 13.50 30.15 > gd.ave <- mean(plasma); gd.ave [1] 21.825 > summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 56.501 5.883e-07 *** Residuals 18 441.59 24.53 > SS.h <- 10*(21.825-13.50)^2 + ## sum of squares due to 10*(21.825-30.15)^2 ## hormone treatment [1] 1386.112 > TSS <- sum((plasma-gd.ave)^2) [1] 1827.697 30.15 13.50

  6. Two-Way Anova > s.ave female male 23.70 19.95 > gd.ave [1] 21.825 > summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F) sex 1 70.31 70.31 0.7202 0.4072 Residuals 18 1757.39 97.63 > SS.s <- 10*(21.825-23.70)^2 + ## sum of squares due to sex 10*(21.825-19.95)^2 [1] 70.3125 > TSS <- sum((plasma-gd.ave)^2) [1] 1827.697 23.70 19.95

  7. Two-Way Anova Analysing the two factors at the same time can make a big difference to the outcome > summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 56.501 5.883e-07 *** Residuals 18 441.59 24.53 > summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F) sex 1 70.31 70.31 0.7202 0.4072 Residuals 18 1757.39 97.63 > summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 63.4680 3.863e-07 *** sex 1 70.31 70.31 3.2195 0.09057 . Residuals 17 371.27 21.84

  8. Two-Way Anova Including an interaction term is important too > summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 63.4680 3.863e-07 *** sex 1 70.31 70.31 3.2195 0.09057 . Residuals 17 371.27 21.84 > summary(aov(plasma ~ hormone * sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone 1 1386.11 1386.11 60.5336 7.943e-07 *** sex 1 70.31 70.31 3.0706 0.09886 . hormone:sex 1 4.90 4.90 0.2140 0.64987 Residuals 16 366.37 22.90

  9. Two-Way Anova What is the meaning of this interaction term ? Assess whether the impact of the two factors are mutually interdependent. > gd.ave [1] 21.825 > h.diff <- h.ave - gd.ave no yes -8.325 8.325 > s.diff <- s.ave - gd.ave female male 1.875 -1.875 > predicted <- predict(aov(plasma~hormone+sex)) 1 2 3 4 5 6 7 8 ... 15.375 15.375 15.375 15.375 15.375 11.625 11.625 11.625 ... 11 12 13 14 15 16 17 18 ... 32.025 32.025 32.025 32.025 32.025 28.275 28.275 28.275 ...

  10. Two-Way Anova What is the meaning of the interaction term ? Assess whether the impact of the two factors are mutually interdependent. > pred.ave <- tapply(predicted, list(sex, hormone), mean) > pred.ave no yes female 15.375 32.025 male 11.625 28.275 > averages <- tapply(plasma, list(sex, hormone), mean) > averages no yes female 14.88 32.52 male 12.12 27.78 > sum((averages-pred.ave)^2)*5 [1] 4.9005 ## SS due to interaction

  11. Observed averages Predicted averages Two-Way Anova Graphical representations > averages no yes female 14.88 32.52 male 12.12 27.78 > barplot(averages, beside=T, col=c("red","green"), ...) > pred.ave no yes female 15.375 32.025 male 11.625 28.275 > barplot(pred.ave, beside=T, col=c("red2","green3"), ...)

  12. Two-Way Anova Graphical representations > interaction.plot(hormone, sex, plasma, col=c("red","green2"), ... ) > interaction.plot(sex, hormone, plasma, col=c("orange","blue"), ...)

  13. Two-Way Anova as a linear model > mod <- lm(plasma ~ sex + hormone) > summary(mod) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.38 1.81 8.495 1.60e-07 *** sexmale -3.75 2.09 -1.794 0.0906 . hormoneyes 16.65 2.09 7.967 3.86e-07 *** --- Residual standard error: 4.673 on 17 degrees of freedom Multiple R-squared: 0.7969, Adjusted R-squared: 0.773 F-statistic: 33.34 on 2 and 17 DF, p-value: 1.307e-06 > pred.ave no yes female 15.375 32.025 male 11.625 28.275

  14. Two-Way Anova as a linear model > mod2 <- lm(plasma ~ sex * hormone) > summary(mod2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.880 2.140 6.953 3.25e-06 *** sexmale -2.760 3.026 -0.912 0.375 hormoneyes 17.640 3.026 5.829 2.57e-05 *** sexmale:hormoneyes -1.980 4.280 -0.463 0.650 --- Residual standard error: 4.785 on 16 degrees of freedom Multiple R-squared: 0.7995, Adjusted R-squared: 0.762 F-statistic: 21.27 on 3 and 16 DF, p-value: 7.89e-06 > averages no yes female 14.88 32.52 male 12.12 27.78

  15. Are the assumptions met ? • 1: Is the response variable continuous ? • 2: Are the residuals normally distributed ? YES ! > shapiro.test(mod$residuals) Shapiro-Wilk normality test data: mod$residuals W = 0.9788, p-value = 0.9172 > plot(mod, which=2) ## qqplot of std.residuals Answer: YES !

  16. 3a : Are the residuals independent and identically distributed? > plot(mod, which=1) ## residuals vs fitted values. > plot(mod, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Answer: Female treated with hormone seem to vary more in their response

  17. Ancova • When one predictor variable is continuous or discrete, • and the other predictor variable is categorical. • Example: Fruit production in a biennial plant,40 plants were allocated to two treatments, grazed and ungrazed. The grazed plants were exposed to rabbits during the first two weeks of stem elongation, then allowed to regrow protected by a fence. > dataset2 <- read.table("ipomopsis.txt", header=T, sep="\t") > attach(dataset2) > names(dataset2) [1] "Root" "Fruit" "Grazing" > str(dataset2) 'data.frame': 40 obs. of 3 variables: $ Root : num 6.22 6.49 4.92 5.13 5.42 ... $ Fruit : num 59.8 61.0 14.7 19.3 34.2 ... $ Grazing: Factor w/ 2 levels "Grazed","Ungrazed": 2 2 2 ...

  18. 67.9 50.9 Ancova > summary(dataset2) Root Fruit Grazing Min. : 4.43 Min. : 14.7 Grazed :20 1st Qu.: 6.08 1st Qu.: 41.1 Ungrazed:20 Median : 7.12 Median : 60.9 Mean : 7.18 Mean : 59.4 3rd Qu.: 8.51 3rd Qu.: 76.2 Max. :10.25 Max. :116.0 > stripchart(Fruit ~ Grazing, col=c("blue","green2"), ...) > summary( lm(Fruit ~ Grazing) ) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 67.941 5.236 12.976 1.54e-15 *** GrazingUngrazed -17.060 7.404 -2.304 0.0268 * --- Residual standard error: 23.41 on 38 degrees of freedom Multiple R-squared: 0.1226, Adjusted R-squared: 0.09949 F-statistic: 5.309 on 1 and 38 DF, p-value: 0.02678

  19. Ancova > plot(Fruit ~ Root, col=c("blue","green2")[as.numeric(Grazing)], ...) ## a few graphical functions omitted > summary( lm(Fruit ~ Root * Grazing) ) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -125.173 12.811 -9.771 1.15e-11 *** Root 23.240 1.531 15.182 < 2e-16 *** GrazingUngrazed 30.806 16.842 1.829 0.0757 . Root:GrazingUngrazed 0.756 2.354 0.321 0.7500 --- Residual standard error: 6.831 on 36 degrees of freedom Multiple R-squared: 0.9293, Adjusted R-squared: 0.9234 F-statistic: 157.6 on 3 and 36 DF, p-value: < 2.2e-16 slope 24.00 slope 23.24

  20. The variables order matters in the F table Ancova > anova( lm(Fruit ~ Root * Grazing) ) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root 1 16795.0 16795.0 359.9681 < 2.2e-16 *** Grazing 1 5264.4 5264.4 112.8316 1.209e-12 *** Root:Grazing 1 4.8 4.8 0.1031 0.75 Residuals 36 1679.6 46.7 --- > anova( lm(Fruit ~ Grazing * Root) ) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Grazing 1 2910.4 2910.4 62.3795 2.262e-09 *** Root 1 19148.9 19148.9 410.4201 < 2.2e-16 *** Grazing:Root 1 4.8 4.8 0.1031 0.75 Residuals 36 1679.6 46.7 ---

  21. A safer test of significance, dropping each term from the model Ancova > drop1( lm(Fruit ~ Root * Grazing), test="F" ) Single term deletions Model: Fruit ~ Root * Grazing Df Sum of Sq RSS AIC F value Pr(F) <none> 1679.65 157.50 Root:Grazing 1 4.81 1684.46 155.61 0.1031 0.75 > drop1( lm(Fruit ~ Root + Grazing), test="F" ) Single term deletions Model: Fruit ~ Root + Grazing Df Sum of Sq RSS AIC F value Pr(F) <none> 1684.5 155.6 Root 1 19148.9 20833.4 254.2 420.62 < 2.2e-16 *** Grazing 1 5264.4 6948.8 210.3 115.63 6.107e-13 *** Akaike Information Criteria

  22. Ancova After simplifying the model, the p-values in the summary table agree with those in the F table > mod3 <- lm(Fruit ~ Root + Grazing) > summary(mod3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -127.829 9.664 -13.23 1.35e-15 *** Root 23.560 1.149 20.51 < 2e-16 *** GrazingUngrazed 36.103 3.357 10.75 6.11e-13 *** --- Residual standard error: 6.747 on 37 degrees of freedom Multiple R-squared: 0.9291, Adjusted R-squared: 0.9252 F-statistic: 242.3 on 2 and 37 DF, p-value: < 2.2e-16 > anova(mod3) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root 1 16795.0 16795.0 368.91 < 2.2e-16 *** Grazing 1 5264.4 5264.4 115.63 6.107e-13 *** Residuals 37 1684.5 45.5

  23. Are the assumptions met ? • 1: Is the response variable continuous ? • 2: Are the residuals normally distributed ? YES ! > shapiro.test(mod3$residuals) Shapiro-Wilk normality test data: mod3$resid W = 0.9736, p-value = 0.4637 > plot(mod3, which=2) ## qqplot of std.residuals Answer: YES !

  24. 3a : Are the residuals independent and identically distributed? > plot(mod3, which=1) ## residuals vs fitted values. > plot(mod3, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Answer: Looks OK

  25. Any major influential points ? > plot(mod3, which=5) ## Residuals vs Leverages. Answer: No !

More Related