190 likes | 466 Views
BIOL 582. Lecture Set 13 Combining factors and covariates Analysis of Covariance (ANCOVA). BIOL 582. Linear Model Set-up. Often, biological research is concerned with the relationship of two or more variables
E N D
BIOL 582 Lecture Set 13 Combining factors and covariates Analysis of Covariance (ANCOVA)
BIOL 582 Linear Model Set-up • Often, biological research is concerned with the relationship of two or more variables • Often, biological research is concerned with the difference in means between groups • Often, biological research is concerned in differences in the relationships of two or more variables, between two or more groups y y y x x x No difference in trends; comparison of predicted y between red and blue groups is the same for any value of x. Can just compare means Red group = positive trend; blue group = no trend; comparison of predicted y between red and blue groups is the not the same for any value of x. Red group = steep positive trend; blue group = shallow positive trend; comparison of predicted y between red and blue groups is the not the same for any value of x.
BIOL 582 Linear Model Set-up • Often, biological research is concerned with the relationship of two or more variables • Often, biological research is concerned with the difference in means between groups • Often, biological research is concerned in differences in the relationships of two or more variables, between two or more groups • Consider Bumpus data • The procedure for analyzing heterogeneity in trends (slopes) is called analysis of covariance (ANCOVA). It is ANOVA done a specific way on a linear model that contains both factors and covariates > rm(list=ls()) > > bumpus<-read.csv("bumpus.csv") > attach(bumpus) > group<-as.factor(paste(sex,survived,sep="-"))
BIOL 582 Linear Model Set-up • One starts ANCOVA by creating a linear model that has a (group) factor-covariate interaction. ANOVA is done on this model to test the homogeneity of slopes (null hypothesis) • The linear equation is • Which has the model • That produces error The value, x0, is generally 1 for all subjects, and can, therefore, be eliminated from the equation. However, it can also be 0, if one wishes to force the linear model through the origin (where y must be 0, if x is 0). Thus, x0 is a dummy variable which indicates whether to include or exclude a y-intercept. The effect, α, means all group-level effects for distinguishing group differences. This can be thought of as a summation of slopes and dummy variables that adjust the intercept by group.
BIOL 582 Linear Regression Model Set-up • To test the homogeneity of slopes, there is only one reduction (remove interaction) • One should compare the errors of these two models (e.g., via ANOVA) to ascertain if the interaction is important. • One-way ANOVA
BIOL 582 Linear Model Evaluation • There will be one of two outcomes: • Interaction is significant (i.e., variance due to interaction is significantly greater than 0) keep it in the model and analyze differences in trends • Interaction is not significant (i.e., variance due to interaction is not significantly greater than 0) remove it from model and compare means (or any prediction of y for any value of x) between groups, but with the effects of x left in the model. (This last part is important because if there is a general relationship between y and x, one would have greater statistical power to find group differences by having x in the model. This will hopefully become apparent soon.)
BIOL 582 Linear Model Assumptions • These still apply • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations • These are the assumptions of Linear Models!
BIOL 582 Example homogeneity of slopes test • > # ANCOVA for the response, alar extent (AE) • > # the independent variable, total length (TL) • > # and groups defined by sex and survival • > • > lm.full<-lm(AE~TL*group) # will include main effects + interaction • > lm.reduced<-lm(AE~TL + group) # will only include main effects • > • > # residual plots • > • > par(mfrow=c(2,4)) • > plot(lm.full) • > plot(lm.reduced)
BIOL 582 Example homogeneity of slopes test > # ANCOVA for the response, alar extent (AE) > # the independent variable, total length (TL) > # and groups defined by sex and survival > > lm.full<-lm(AE~TL*group) # will include main effects + interaction > lm.reduced<-lm(AE~TL + group) # will only include main effects > > # residual plots > > par(mfrow=c(2,4)) > plot(lm.full) > plot(lm.reduced) > > # Shapiro-Wilktests > > shapiro.test(resid(lm.full)) Shapiro-Wilk normality test data: resid(lm.full) W = 0.9876, p-value = 0.2607 > shapiro.test(resid(lm.reduced)) Shapiro-Wilk normality test data: resid(lm.reduced) W = 0.9876, p-value = 0.2616 > > # neither model has assumption concerns > # the similarity in results probably says something about the interaction
BIOL 582 Example homogeneity of slopes test > # ANCOVA for the response, alar extent (AE) > # the independent variable, total length (TL) > # and groups defined by sex and survival > > # For homogeneity of slopes, always wise to do ANOVA like this > > anova(lm.reduced,lm.full) Analysis of Variance Table Model 1: AE ~ TL + group Model 2: AE ~ TL * group Res.Df RSS Df Sum of Sq F Pr(>F) 1 131 1568.8 2 128 1541.4 3 27.361 0.7574 0.5201 Conclusion: having a TL x group interaction does not contribute much to the model. It can be assumed that all groups have similar (parallel) slopes of AE vs. TL. We can remove the interaction term and just used the reduced model.
BIOL 582 Example homogeneity of slopes test > # First, look at the two models to see why > > lm.full Call: lm(formula = AE ~ TL * group) Coefficients: (Intercept) TL groupf-TRUE groupm-FALSE groupm-TRUE TL:groupf-TRUETL:groupm-FALSETL:groupm-TRUE 60.83603 1.14080 50.52820 -3.09614 43.58310 -0.31709 0.02876 -0.24292 > > lm.reduced Call: lm(formula = AE ~ TL + group) Coefficients: (Intercept) TL groupf-TRUE groupm-FALSE groupm-TRUE 79.8819 1.0206 0.4978 2.0041 4.9970 > > # Any parameter of the model that has "TL" in it is an adjustment of the slope > # Any parameter that has only a coefficient that does not include "TL" > # is an adjustment of the intercept For example, the full model indicates for surviving females that the intercept is 60.836 + 50.528 = 111.364 mm and the slope is 1.141 – 0.317 = 0.824 The reduced model indicates that for surviving females, the intercept is 79.882 + 0.498 = 80.380 mm and the slope is 1.021. ANOVA indicates that these equations (with consideration to the error) are relatively similar for making predictions
BIOL 582 Example homogeneity of slopes test Call: lm(formula = AE ~ TL + group) Coefficients: (Intercept) TL groupf-TRUE groupm-FALSE groupm-TRUE 79.8819 1.0206 0.4978 2.0041 4.9970 > > # Any parameter of the model that has "TL" in it is an adjustment of the slope > # Any parameter that has only a coefficient that does not include "TL" > # is an adjustment of the intercept Pop Quiz! What is the linear equation for males that did not survive? What is the linear equation for males that survived? Look at the coefficients. Assign a value of x = 0 for False and x = 1 for True, and create a string of scalars for pertinent coefficients Non surviving males: 1, 1, 0, 1, 0 Surviving males: 1, 1, 0, 0, 1 Then multiply these values times the coefficients to see which ones remain Non surviving males: 79.8819, 1.0206 (TL), 2.0041 Surviving males: 79.8819, 1.0206 (TL), 4.9970 Finally, add up “intercept” components and slope components NS males: AE = (79.8819 + 2.0041) + (1.0206)*TL S males: AE = (79.8819 + 4.9970) + (1.0206)*TL
BIOL 582 ANCOVA • Step 2, provided homogeneity of slopes is not rejected • ANOVA of the reduced model • Review multiple regression, factorial ANOVA to understand the differences between types I and III SS # our ANCOVA model is the reduced model > lm.ancova<-lm(AE~TL + group,x=T) > > # look at the model and the design (first 10 subjects) > lm.ancova Call: lm(formula = AE ~ TL + group, x = T) Coefficients: (Intercept) TL groupf-TRUE groupm-FALSE groupm-TRUE 79.8819 1.0206 0.4978 2.0041 4.9970 > lm.ancova$x (Intercept) TL groupf-TRUE groupm-FALSE groupm-TRUE 1 1 154 0 0 1 2 1 165 0 1 0 3 1 160 0 1 0 4 1 160 0 0 1 5 1 155 0 0 1 6 1 161 0 1 0 7 1 154 0 0 1 8 1 162 0 1 0 9 1 156 0 0 1 10 1 163 0 1 0
BIOL 582 ANCOVA • Step 2, provided homogeneity of slopes is not rejected • ANOVA of the reduced model • Review multiple regression, factorial ANOVA to understand the differences between types I and III SS > # ANOVA with type I SS > anova(lm.ancova) Analysis of Variance Table Response: AE Df Sum Sq Mean Sq F value Pr(>F) TL 1 1964.68 1964.68 164.060 < 2.2e-16 *** group 3 581.57 193.86 16.188 5.22e-09 *** Residuals 131 1568.78 11.98 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > # ANOVA with type III SS > Anova(lm.ancova, type="III") Anova Table (Type III tests) Response: AE Sum SqDf F value Pr(>F) (Intercept) 340.59 1 28.441 4.11e-07 *** TL 1398.07 1 116.745 < 2.2e-16 *** group 581.57 3 16.188 5.22e-09 *** Residuals 1568.78 131 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > # R2 values > SS<-anova(lm.ancova)[,2] > R2<-SS/sum(SS); names(R2)<-rownames(anova(lm.ancova)) > R2 TL group Residuals 0.4774407 0.1413276 0.3812317 Conclusion: the overall AE vs. TL relationship is significantly different from 0. TL accounts for 47.7% percent of the variation in AE. Group differences are also significant (reject the null hypothesis of no group differences), and account for 14.1% of the overall variation
BIOL 582 ANCOVA • Step 2, provided homogeneity of slopes is not rejected • ANOVA of the reduced model • Review multiple regression, factorial ANOVA to understand the differences between types I and III SS > # ANOVA with type I SS > anova(lm.ancova) Analysis of Variance Table Response: AE Df Sum Sq Mean Sq F value Pr(>F) TL 1 1964.68 1964.68 164.060 < 2.2e-16 *** group 3 581.57 193.86 16.188 5.22e-09 *** Residuals 131 1568.78 11.98 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > # ANOVA with type III SS > Anova(lm.ancova, type="III") Anova Table (Type III tests) Response: AE Sum SqDf F value Pr(>F) (Intercept) 340.59 1 28.441 4.11e-07 *** TL 1398.07 1 116.745 < 2.2e-16 *** group 581.57 3 16.188 5.22e-09 *** Residuals 1568.78 131 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > # R2 values > SS<-anova(lm.ancova)[,2] > R2<-SS/sum(SS); names(R2)<-rownames(anova(lm.ancova)) > R2 TL group Residuals 0.4774407 0.1413276 0.3812317 > # FYI > summary(lm.ancova) Call: lm(formula = AE ~ TL + group, x = T) Residuals: Min 1Q Median 3Q Max -10.2823 -2.2040 -0.1516 2.7291 6.8690 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 79.88192 14.97880 5.333 4.11e-07 *** TL 1.02058 0.09446 10.805 < 2e-16 *** groupf-TRUE 0.49775 1.00386 0.496 0.6208 groupm-FALSE 2.00414 0.93783 2.137 0.0345 * groupm-TRUE 4.99700 0.81768 6.111 1.06e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.461 on 131 degrees of freedom Multiple R-squared: 0.6188, Adjusted R-squared: 0.6071 F-statistic: 53.16 on 4 and 131 DF, p-value: < 2.2e-16
BIOL 582 ANCOVA • Step 3, look at the patterns • (r script for plots left as data analysis exercise)
BIOL 582 Multiple Comparisons • It is worth it to look at multiple comparisons done for three different models to understand why one might want to do ANCOVA versus ANOVA on just factors Here is the set-up # Tukey HSD TukeyHSD(aov(AE~group)) # no consideration to bird size TukeyHSD(aov(AE/TL~group)) # try to change response to account for bird size TukeyHSD(aov(AE~group + TL)) # analyze group difference in response, with respect to bird size Cannot really do the last one, but can do this TukeyHSD(aov(resid(lm(AE~TL))~group)) Since there is a “common” slope, using residuals from a common AE-TL relationship allows one to test differences in groups, which corresponds to the vertical distance between lines in the previous plot.
BIOL 582 Multiple Comparisons > TukeyHSD(aov(AE~group)) # no consideration to bird size Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = AE ~ group) $group diff lwrupr p adj f-TRUE-f-FALSE -0.5714286 -4.132556 2.989699 0.9753950 m-FALSE-f-FALSE 5.7341270 2.625720 8.842534 0.0000249 m-TRUE-f-FALSE 5.8403361 2.938803 8.741870 0.0000037 m-FALSE-f-TRUE 6.3055556 2.918248 9.692863 0.0000207 m-TRUE-f-TRUE 6.4117647 3.213240 9.610289 0.0000041 m-TRUE-m-FALSE 0.1062092 -2.579144 2.791562 0.9996079 > TukeyHSD(aov(AE/TL~group)) # try to change response to account for bird size Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = AE/TL ~ group) $group diff lwrupr p adj f-TRUE-f-FALSE 0.0066006921 -0.011336053 0.02453744 0.7736828 m-FALSE-f-FALSE 0.0008899351 -0.014766540 0.01654641 0.9988410 m-TRUE-f-FALSE 0.0287499024 0.014135411 0.04336439 0.0000063 m-FALSE-f-TRUE -0.0057107570 -0.022772005 0.01135049 0.8198412 m-TRUE-f-TRUE 0.0221492103 0.006038831 0.03825959 0.0027068 m-TRUE-m-FALSE 0.0278599673 0.014334336 0.04138560 0.0000021 > TukeyHSD(aov(resid(lm(AE~TL))~group)) # analyze group difference in response, with respect to bird size Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = resid(lm(AE ~ TL)) ~ group) $group diff lwrupr p adj f-TRUE-f-FALSE 0.5509291 -2.041452 3.143311 0.9455920 m-FALSE-f-FALSE 1.8186291 -0.444187 4.081445 0.1613057 m-TRUE-f-FALSE 4.9550540 2.842835 7.067273 0.0000001 m-FALSE-f-TRUE 1.2677000 -1.198147 3.733547 0.5406886 m-TRUE-f-TRUE 4.4041249 2.075706 6.732543 0.0000149 m-TRUE-m-FALSE 3.1364249 1.181578 5.091272 0.0003110 Notice the general gain in statistical power (i.e. less likely to make a type II error)
BIOL 582 Comments • BE CAREFUL WITH CANNED STATS SOFTWARE! • Sometime it is assumed that a homogeneity of slopes test was performed. • The “reduced” model is assumed to be appropriate • The steps for ANCOVA should ALWAYS include • 1. Homogeneity of slopes test • 2. ANOVA and evaluation with the appropriate model • If the reduced model is used, it is a “true” ANCOVA • If the full model is used, it is more of a multiple regression problem • This is just a semantics issues • WHAT IS REALLY IMPORTANT IS THAT • 1. One finds the best model • 2. One evaluates the model appropriatly • E.g., it would not be appropriate to do multiple comparisons for groups with different slopes • Complex models will be dealt with via example • Model Selection is also a much deeper concept we will explore.