310 likes | 528 Views
BIOL 582. Lecture Set 12 ANOVA for simple linear regression Multiple regression With R annotations. BIOL 582. Linear Regression Model Set-up. Often, biological research is concerned with the relationship of two variables
E N D
BIOL 582 Lecture Set 12 ANOVA for simple linear regression Multiple regression With R annotations
BIOL 582 Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? > rm(list=ls()) > > snake<-read.csv("snake.data.csv") > > attach(snake) > > cor(SVL,HS) [1] 0.7289513
BIOL 582 Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? The null hypothesis for a correlation coefficient is H0: ρ = 0 The alternative can have direction HA: ρ > 0 HA: ρ< 0 HA: ρ ≠ 0 A t-test or randomization test can be used to test the null. Remember that r is the sample stat, ρ is the population parameter > rm(list=ls()) > > snake<-read.csv("snake.data.csv") > > attach(snake) > > cor(SVL,HS) [1] 0.7289513
BIOL 582 Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? The null hypothesis for a correlation coefficient is H0: ρ = 0 The alternative can have direction HA: ρ > 0 HA: ρ< 0 HA: ρ ≠ 0 A t-test or randomization test can be used to test the null. Remember that r is the sample stat, ρ is the population parameter > rm(list=ls()) > > snake<-read.csv("snake.data.csv") > > attach(snake) > > cor(SVL,HS) [1] 0.7289513 > r<-cor(SVL,HS) > n<-length(HS) > t<-r/sqrt(((1-r^2)/(n-2))) > t [1] 6.5641 > p.value<-1-pt(t,n-2) > p.value [1] 4.809665e-08
BIOL 582 Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • First, one might not wish to assert a “dependence” on one variable on the other, but simply ask if there is a linear relationship • Are the variables correlated? > # Randomization approach > > r<-cor(SVL,HS) > result<-r; p<-1 > > for(i in 1:10000){ + HS.r<-sample(HS) + r.r<-cor(SVL,HS.r) + result<-c(result,r.r) + p<-ifelse(r.r>=r,p+1,p) + } > > p.value<-p/10000 > p.value [1] 1e-04
BIOL 582 Linear Regression Model Set-up • Often, biological research is concerned with the relationship of two variables • Consider the snake data. There are two variables that could be related: HS and SVL • Second, one might not wish to assert a “dependence” on one variable on the other. This makes most sense when one variable is clearly independent of the other (e.g., age), but sometimes, the independent variable is simple a “general” variable (e.g., overall size) and the dependent variable is a specific variable (e.g., size of certain morphological trait). • The linear equation is • Which has the model • That produces error The value, x0, is generally 1 for all subjects, and can, therefore, be eliminated from the equation. However, it can also be 0, if one wishes to force the linear model through the origin (where y must be 0, if x is 0). Thus, x0 is a dummy variable which indicates whether to include or exclude a y-intercept.
BIOL 582 Linear Regression Model Set-up • It is simple to see how the linear model is reduced • Imagine that for each model (with at least one parameter) , SSEcan be obtained easily (from residuals of predictions made by estimated model parameters). There are two sets of SSE from the two different models • From model containing: slopeintercept only • Both models contain an intercept
BIOL 582 Linear Regression Model Evaluation • Calculating SS for the regression (the type of SS is irrelevant with only one slope) • These are the only values needed for ANOVA • One can use ANOVA or t-tests
BIOL 582 Linear Regression Model Evaluation • t-tests to test coefficient hull hypotheses • First, the t stats are • Where the standard errors are • In each case > lm.svl<-lm(HS~SVL, x=T) > summary(lm.svl) Call: lm(formula = HS ~ SVL, x = T) Residuals: Min 1Q Median 3Q Max -4.7469 -1.2607 -0.7911 1.1206 7.2280 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.26976 1.15227 1.970 0.0562 . SVL 0.09987 0.01521 6.564 9.62e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 2.547 on 38 degrees of freedom Multiple R-squared: 0.5314, Adjusted R-squared: 0.519 F-statistic: 43.09 on 1 and 38 DF, p-value: 9.619e-08
BIOL 582 Linear Regression Model Evaluation • Summary of ANOVA simple linear regression > lm.svl<-lm(HS~SVL, x=T) > anova(lm.svl) Analysis of Variance Table Response: HS Df Sum Sq Mean Sq F value Pr(>F) SVL 1 279.48 279.475 43.087 9.619e-08 *** Residuals 38 246.48 6.486 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
BIOL 582 Linear Regression Model Uses and Assumptions • Linear regression (or extensions of it) are used in just about any analysis which requires examining trends. • E.g., trait vs. time, trait vs. age, trait vs. size, response vs. temp. • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models!
BIOL 582 Linear Regression Model Set-up • One can also consider more than one independent variable • Some data on fish body shape will be used as an example, but first the model set-up • The linear equation is • Which has the model • That produces error
BIOL 582 Linear Regression Model Set-up • Using two independent variables, the model is reduced as • It should be obvious that ANOVA will have to consider the type of SS
BIOL 582 Example Data (more pupfish) • Data collected from photographs – 13 anatomical landmarks • Body shape calculated with geometric morphometrics (GM)methods • The size of the head, body, and tail also calculated using log(centroid size) • The goal of the research was to understand how salinity influences body shape, but we knew that fish size could also influence shape • Question: Is body shape (streamlined/bluff) more associated with head, body, or tail size? Response variable more negative Shape index more positive From Collyer et al. 2007, Ecol. Res.
BIOL 582 Example Data (more pupfish) • Data collected from photographs – 13 anatomical landmarks • Body shape calculated with geometric morphometrics (GM)methods • The size of the head, body, and tail also calculated using log(centroid size) • The goal of the research was to understand how salinity influences body shape, but we knew that fish size could also influence shape • Question: Is body shape (streamlined/bluff) more associated with head, body, or tail size? From Collyer et al. 2007, Ecol. Res.
BIOL 582 Multiple Linear Regression Model Uses and Assumptions • Linear regression (or extensions of it) are used in just about any analysis which requires examining trends. • E.g., trait vs. time, trait vs. age, trait vs. size, response vs. temp. • E.g., multiple regression: trait vs. time, age, and size • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models!
BIOL 582 Multiple Linear Regression Model Uses and Assumptions • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models! > shape.fit<-lm(shape~log(headCS) +log(bodyCS)+log(tailCS)) > > par(mfcol=c(2,2)) > plot(shape.fit) > shapiro.test(resid(shape.fit)) Shapiro-Wilk normality test data: resid(shape.fit) W = 0.9888, p-value = 0.8362
BIOL 582 Multiple Linear Regression Model Uses and Assumptions • Assumptions include • Normally distributed residuals (not data) • Homoscedasticity • Independent observations (i.e., sample sizes don’t contain multiple measurements on the same subjects; however, there are analyses which deal with repeated measures.) • These are the assumptions of Linear Models! > # Some additional diagnostic plots (just for visual reference) > > par(mfcol=c(1,3)) > > plot(log(headCS),resid(shape.fit)) > plot(log(bodyCS),resid(shape.fit)) > plot(log(tailCS),resid(shape.fit))
BIOL 582 Linear Regression Model Evaluation • Calculating SS for the model • These are the only values needed for ANOVA, but ANOVA can also look at the SS for each independent variable, separately • One can use ANOVA or t-tests
BIOL 582 Linear Regression Model Evaluation • t-tests to test coefficient hull hypotheses • First, the t stats are • Where the standard errors are • In each case > summary(shape.fit) Call: lm(formula = shape ~ log(headCS) + log(bodyCS) + log(tailCS)) Residuals: Min 1Q Median 3Q Max -0.036973 -0.011269 -0.001346 0.011482 0.039555 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.73692 0.16605 4.438 4.03e-05 *** log(headCS) 0.10427 0.04503 2.315 0.0241 * log(bodyCS) 0.92555 0.15862 5.835 2.43e-07 *** log(tailCS) 0.12293 0.06217 1.977 0.0527 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 0.01589 on 59 degrees of freedom Multiple R-squared: 0.4348, Adjusted R-squared: 0.4061 F-statistic: 15.13 on 3 and 59 DF, p-value: 2.043e-07
BIOL 582 Linear Regression Model Evaluation • t-tests to test coefficient hull hypotheses • First, the t stats are • Where the standard errors are • In each case > summary(shape.fit) Call: lm(formula = shape ~ log(headCS) + log(bodyCS) + log(tailCS)) Residuals: Min 1Q Median 3Q Max -0.036973 -0.011269 -0.001346 0.011482 0.039555 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.73692 0.16605 4.438 4.03e-05 *** log(headCS) 0.10427 0.04503 2.315 0.0241 * log(bodyCS) 0.92555 0.15862 5.835 2.43e-07 *** log(tailCS) 0.12293 0.06217 1.977 0.0527 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 0.01589 on 59 degrees of freedom Multiple R-squared: 0.4348, Adjusted R-squared: 0.4061 F-statistic: 15.13 on 3 and 59 DF, p-value: 2.043e-07 Full Model ANOVA
BIOL 582 Linear Regression Model Evaluation • Summary of ANOVA for different independent variables • Type I SS • Note M refers to b1,b2,..,b(k-1), bk , or in other words, all model parameters
BIOL 582 Linear Regression Model Evaluation • Summary of ANOVA for different independent variables • Type III SS • Note M refers to b1,b2,..,b(k-1), bk , or in other words, all model parameters • Note M-bkrefers to all model parameters except the kth • Note Type II only applies if there are interactions
BIOL 582 Linear Regression Model Evaluation > library(car) > # Type I SS ANOVA > anova(shape.fit) Analysis of Variance Table Response: shape Df Sum Sq Mean Sq F value Pr(>F) log(headCS) 1 0.0011737 0.0011737 4.6475 0.03519 * log(bodyCS) 1 0.0093027 0.0093027 36.8360 9.921e-08 *** log(tailCS) 1 0.0009874 0.0009874 3.9099 0.05268 . Residuals 59 0.0149001 0.0002525 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 > > # Type III SS ANOVA > Anova(shape.fit,type="III") Anova Table (Type III tests) Response: shape Sum Sq Df F value Pr(>F) (Intercept) 0.0049742 1 19.6963 4.034e-05 *** log(headCS) 0.0013539 1 5.3609 0.02409 * log(bodyCS) 0.0085985 1 34.0477 2.425e-07 *** log(tailCS) 0.0009874 1 3.9099 0.05268 . Residuals 0.0149001 59 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
BIOL 582 Multiple Comparisons? > # There really is not a way to do multiple comparisons if there are not multiple groups > # But one can still visualize which effects are more meaningful > # for predicting responses > > par(mfcol=c(1,3)) > > plot(log(headCS),shape); abline(lm(shape~log(headCS)),col='red') > plot(log(bodyCS),shape); abline(lm(shape~log(bodyCS)),col='blue') > plot(log(tailCS),shape); abline(lm(shape~log(tailCS)),col='green')
BIOL 582 Multiple Comparisons? > # And if one is clever, one can look at partial effects > # Warning, only try this if you know what you are doing > > par(mfcol=c(1,3)) > > # partial residuals > res.no.head<-resid(lm(shape~log(bodyCS)+log(tailCS))) > res.no.body<-resid(lm(shape~log(headCS)+log(tailCS))) > res.no.tail<-resid(lm(shape~log(headCS)+log(bodyCS))) > > plot(log(headCS),res.no.head); abline(lm(res.no.head~log(headCS)),col='red') > plot(log(bodyCS),res.no.body); abline(lm(res.no.body~log(bodyCS)),col='blue') > plot(log(tailCS),res.no.tail); abline(lm(res.no.tail~log(tailCS)),col='green') > > # The plots show “partial” effects
BIOL 582 Some other considerations What if one just wanted to know correlations among variables? Correlation matrix > # first make a matrix > log.headCS<-log(headCS);log.bodyCS<-log(bodyCS);log.tailCS<-log(tailCS) > Y<-cbind(log.headCS,log.bodyCS,log.tailCS,shape) > > # then make a correlation matrix > > cor(Y) log.headCS log.bodyCS log.tailCS shape log.headCS 1.00000000 0.01516104 -0.1124418 0.2109957 log.bodyCS 0.01516104 1.00000000 -0.6660615 0.5971490 log.tailCS -0.11244183 -0.66606147 1.0000000 -0.2754235 shape 0.21099571 0.59714905 -0.2754235 1.0000000 > > # and plot > > pairs(Y,main="All pairwise scatterplots")
BIOL 582 Some other considerations What if one does not want to assume strict independence of independent variables? Structural Equation Modeling (Path Analysis) Calculates partial correlations among variables and one can create a path diagram to show relationships. (This is a more complex subject, we might cover it later) Variable 1 Response Variable 2 Variable 3
BIOL 582 Some other considerations What if one wants to examine curvilinear relationships? Polynomial Regression > # polynomial regression of shape and total body size (total CS) > > plot(totalCS,shape)
BIOL 582 Some other considerations What if one wants to examine curvilinear relationships? Polynomial Regression > shapefit<-lm(shape~poly(totalCS,3)) > par(mfcol=c(2,2)) > plot(shapefit) > shapiro.test(resid(shapefit)) Shapiro-Wilk normality test data: resid(shapefit) W = 0.9723, p-value = 0.1665
BIOL 582 Some other considerations What if one wants to examine curvilinear relationships? Polynomial Regression > plot(totalCS,shape) > points(totalCS,predict(shapefit),pch=21,bg="red") > summary(shapefit) Call: lm(formula = shape ~ poly(totalCS, 3)) Residuals: Min 1Q Median 3Q Max -0.035464 -0.012123 -0.002539 0.010906 0.055230 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.902e-20 2.254e-03 0.000 1.0000 poly(totalCS, 3)1 7.911e-02 1.789e-02 4.422 4.27e-05 *** poly(totalCS, 3)2 -3.485e-02 1.789e-02 -1.948 0.0562 . poly(totalCS, 3)3 1.798e-03 1.789e-02 0.100 0.9203 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 0.01789 on 59 degrees of freedom Multiple R-squared: 0.2836, Adjusted R-squared: 0.2472 F-statistic: 7.785 on 3 and 59 DF, p-value: 0.0001834