R for Macroecology

R for Macroecology Tests and models

Smileys

Homework • Solutions to the color assignment problem?

Statistical tests in R • This is why we are using R (and not C++)! • t.test() • aov() • lm() • glm() • And many more

Read the documentation! • With statistical tests, its particularly important to read and understand the documentation of each function you use • They may do some complicated things with options, and you want to make sure they do what you want • Default behaviors can change (with, e.g. sample size)

Returns from statistical tests • Statistical tests are functions, so they return objects > x = 1:10 > y = 3:12 > t.test(x,y) Welch Two Sample t-test data: x and y t = -1.4771, df = 18, p-value = 0.1569 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.8446619 0.8446619 sample estimates: mean of x mean of y 5.5 7.5

Returns from statistical tests • Statistical tests are functions, so they return objects > x = 1:10 > y = 3:12 > test = t.test(x,y) > str(test) List of 9 $ statistic : Named num -1.48 ..- attr(*, "names")= chr "t" $ parameter : Named num 18 ..- attr(*, "names")= chr "df" $ p.value : num 0.157 $ conf.int : atomic [1:2] -4.845 0.845 ..- attr(*, "conf.level")= num 0.95 $ estimate : Named num [1:2] 5.5 7.5 ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y" $ null.value : Named num 0 ..- attr(*, "names")= chr "difference in means" $ alternative: chr "two.sided" $ method : chr "Welch Two Sample t-test" $ data.name : chr "x and y" - attr(*, "class")= chr "htest" t.test() returns a list

Returns from statistical tests • Getting the results out • This hopefully looks familiar after last week’s debugging > x = 1:10 > y = 3:12 > test = t.test(x,y) > test$p.value [1] 0.1569322 > test$conf.int[2] [1] 0.8446619 > test[[3]] [1] 0.1569322

Model specification • Models in R use a common syntax: • Y ~ X1 + X2 + X3 . . . + Xi • Means Y is a linear function of X1:j

Linear models • Basic linear models are fit with lm() • Again, lm() returns a list > x = 1:10 > y = 3:12 > test = lm(y ~ x) > test Call: lm(formula = y ~ x) Coefficients: (Intercept) x 2 1

Linear models • summary() is helpful for looking at a model > summary(test) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.293e-15 -1.986e-16 1.343e-16 3.689e-16 6.149e-16 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.000e+00 4.019e-16 4.976e+15 <2e-16 *** x 1.000e+00 6.477e-17 1.544e+16 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.883e-16 on 8 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 2.384e+32 on 1 and 8 DF, p-value: < 2.2e-16

Extracting coefficients, P values • For the list (e.g. “test”) returned by lm(),test$coefficientswill give the coefficients, but not the std.error or p value. • Instead, use summary(test)$coefficients

Model specification - interactions • Interactions are specified with a * or a : • X1 * X2 means X1 + X2 + X1:X2 • (X1 + X2 + X3)^2 means each term and all second-order interactions • - removes terms • constants are included by default, but can be removed with “-1” • more help available using ?formula

Quadratic terms • Because ^2 means something specific in the context of a model, if you want to square one of your predictors, you have to do something special: > x = 1:10 > y = 3:12 > test = lm(y ~ x + x^2) > test$coefficients (Intercept) x 2 1 > test = lm(y ~ x + I(x^2)) > test$coefficients (Intercept) x I(x^2) 2.000000e+00 1.000000e+00 -4.401401e-17

A break to try things out • t test • anova • linear models

Plotting a test object • plot(t.test(x,y)) does nothing • plot(lm(y~x)) plots diagnostic graphs

Forward and backward selection • Uses the step() function Object – starting model Scope – specifies the range of models to consider Direction – backward, forward or both? Trace – print to the screen? Steps – set a maximum number of steps k – penalization for adding variables (2 means AIC)

> x1 = runif(100) > x2 = runif(100) > x3 = runif(100) > x4 = runif(100) > x5 = runif(100) > x6 = runif(100) > y = x1+x2+x3+runif(100) > model = step(lm(y~1),scope = y~x1+x2+x3+x4+x5+x6,direction = "both",trace = F) > summary(model) Call: lm(formula = y ~ x2 + x3 + x1) Residuals: Min 1Q Median 3Q Max -0.517640 -0.264732 -0.003983 0.241431 0.525636 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.47422 0.09381 5.055 2.06e-06 *** x2 0.90890 0.09887 9.192 8.08e-15 *** x3 1.11036 0.10555 10.520 < 2e-16 *** x1 1.00836 0.10865 9.281 5.23e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3059 on 96 degrees of freedom Multiple R-squared: 0.7701, Adjusted R-squared: 0.7629 F-statistic: 107.2 on 3 and 96 DF, p-value: < 2.2e-16

All subsets selection leaps(x=, y=, wt=rep(1, NROW(x)), int=TRUE, method=c("Cp", "adjr2", "r2"), nbest=10, names=NULL, df=NROW(x), strictly.compatible=TRUE) x – a matrix of predictors y – a vector of the response method – how to compare models (Mallows Cp, adjusted R2, or R2) nbest – number of models of each size to return

> Xmat = cbind(x1,x2,x3,x4,x5,x6) > leaps(x = Xmat, y, method = "Cp", nbest = 2) $which 1 2 3 4 5 6 1 FALSE TRUE FALSE FALSEFALSEFALSE 1 TRUE FALSE FALSEFALSEFALSEFALSE 2 TRUE FALSE TRUE FALSE FALSEFALSE 2 FALSE TRUE TRUE FALSE FALSEFALSE 3 TRUE TRUETRUE FALSE FALSEFALSE 3 FALSE TRUE TRUETRUE FALSE FALSE 4 TRUE TRUETRUETRUE FALSE FALSE 4 TRUE TRUETRUE FALSE TRUE FALSE 5 TRUE TRUETRUETRUETRUE FALSE 5 TRUE TRUETRUETRUE FALSE TRUE 6 TRUE TRUETRUETRUETRUETRUE $label [1] "(Intercept)" "1" "2" "3" "4" "5" "6" $size [1] 2 2 3 3 4 4 5 5 6 6 7 $Cp [1] 191.732341 197.672391 85.498456 87.117764 3.466464 81.286117 3.579295 5.078805 [9] 5.138063 5.509598 7.000000 > leapOut= leaps(x = Xmat, y, method = "Cp", nbest = 2)

> leapOut= leaps(x = Xmat, y, method = "Cp", nbest = 2) > Xmat= cbind(x1,x2,x3,x4,x5,x6) > leapOut= leaps(x = Xmat, y, method = "Cp", nbest = 2) > aicVals= NULL > for(iin 1:nrow(leapOut$which)) > { > model = as.formula(paste("y~",paste(c("x1","x2","x3","x4", "x5","x6")[leapOut$which[i,]],collapse = "+"))) >test = lm(model) > aicVals[i] = AIC(test) > } > aicVals [1] 159.13825 161.18166 113.95184 114.84992 52.81268 112.42959 52.81609 [8] 54.40579 54.34347 54.74159 56.19513 > i = 8 > leapOut$which[i,] 1 2 3 4 5 6 TRUE TRUETRUEFALSE TRUE FALSE > c("x1","x2","x3","x4","x5","x6")[leapOut$which[i,]] [1] "x1" "x2" "x3" "x5" > paste("y~",paste(c("x1","x2","x3","x4","x5","x6")[leapOut$which[i,]],collapse = "+")) [1] "y~ x1+x2+x3+x5"

Comparing AIC of the best models > data.frame(leapOut$which,aicVals) X1 X2 X3 X4 X5 X6 aicVals 1 FALSE TRUE FALSE FALSEFALSEFALSE 159.13825 2 TRUE FALSE FALSEFALSEFALSEFALSE 161.18166 3 TRUE FALSE TRUE FALSE FALSEFALSE 113.95184 4 FALSE TRUE TRUE FALSE FALSEFALSE 114.84992 5 TRUE TRUETRUE FALSE FALSEFALSE 52.81268 6 FALSE TRUE TRUETRUE FALSE FALSE 112.42959 7 TRUE TRUETRUETRUE FALSE FALSE 52.81609 8 TRUE TRUETRUE FALSE TRUE FALSE 54.40579 9 TRUE TRUETRUETRUETRUE FALSE 54.34347 10 TRUE TRUETRUETRUE FALSE TRUE 54.74159 11 TRUE TRUETRUETRUETRUETRUE 56.19513

Practice with the mammal data • VIF, lm(), AIC(), leaps()

R for Macroecology