340 likes | 854 Views
Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca Start by loading your Lakedata_06 dataset: diane<-read.table(file.choose(),sep=";",header=TRUE) Dataframes Two ways to identify a column (called "treatment") in your dataframe (called "diane"):
E N D
Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca
Start by loading your Lakedata_06 dataset: diane<-read.table(file.choose(),sep=";",header=TRUE)
Dataframes Two ways to identify a column (called "treatment") in your dataframe (called "diane"): diane$treatment OR attach(diane); treatment At end of session, remember to: detach(diane)
Summary statistics length (x) mean (x) var (x) cor (x,y) sum (x) summary (x) minimum, maximum, mean, median, quartiles What is the correlation between two variables in your dataset?
Factors • A factor has several discrete levels (e.g. control, herbicide) • If a vector contains text, R automatically assumes it is a factor. • To manually convert numeric vector to a factor: • x <- as.factor(x) • To check if your vector is a factor, and what the levels are: • is.factor(x) ; levels(x)
Exercise Make lake area into a factor called AreaFactor: Area 0 to 5 ha: small Area 5.1 to 10: medium Area > 10 ha: large You will need to: 1. Tell R how long AreaFactor will be. AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium" 2. Assign cells in AreaFactor to each of the 3 levels 3. Make AreaFactor into a factor, then check that it is a factor
Exercise Make lake area into a factor called AreaFactor: Area 0 to 5 ha: small Area 5.1 to 10: medium Area > 10 ha: large You will need to: 1. Tell R how long AreaFactor will be. AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium" 2. Assign cells in AreaFactor to each of the 3 levels AreaFactor[Area<5.1]<-“small"; AreaFactor[Area>10]<-“large" 3. Make AreaFactor into a factor, then check that it is a factor AreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)
Linear regression ALT+126 model <- lm (y ~ x, data = diane) insert your dataframe name here invent a name for your model linear model insert your x vector name here insert your y vector name here
Linear regression model <- lm (Species ~ Elevation, data = diane) summary (model) Call: lm(formula = Species ~ Elevation, data = diane) Residuals: Min 1Q Median 3Q Max -7.29467 -2.75041 -0.04947 1.83054 15.00270 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.421568 2.426983 3.882 0.000551 *** Elevation -0.0026090.003663 -0.712 0.482070 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.811 on 29 degrees of freedom Multiple R-Squared: 0.01719, Adjusted R-squared: -0.0167 F-statistic: 0.5071 on 1 and 29 DF, p-value: 0.4821
Linear regression model2 <- lm (Species ~ AreaFactor, data = diane) summary (model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.833 1.441 8.210 6.17e-09 *** AreaFactormedium -2.405 1.723 -1.396 0.174 AreaFactorsmall -8.288 1.792 -4.626 7.72e-05 *** Large has mean species richness of 11.8 Medium has mean species richness of 11.8 - 2.4 = 9.4 Small has a mean species richness of 11.8 - 8.3 = 3.5 mean(Species[AreaFactor=="medio"])
ANOVA model2 <- lm (Species ~ AreaFactor, data = diane) anova (model2) Analysis of Variance Table Response: Species Df Sum Sq Mean Sq F value Pr(>F) AreaFactor 2 333.85 166.92 13.393 8.297e-05 *** Residuals 28 348.99 12.46
F tests in regression model3 <- lm (Species ~ Elevation + pH, data = diane) anova (model, model3) Model 1: Species ~ Elevation Model 2: Species ~ Elevation + pH Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 671.10 2 28 502.06 1 169.05 9.4279 0.004715 ** F 1, 28 = 9.43
Exercise • Fit the model: Species~pH • Fit the model: Species~pH+pH2 • ("pH2" is just pH2) • Use the ANOVA command to decide whether species richness is a linear or quadratic function of pH
Distributions: not so normal! • Review assumptions for parametric stats (ANOVA, regressions) • Why don’t transformations always work? • Introduce non-normal distributions
Tests before ANOVA, t-tests Normality Constant variances
Tests for normality: exercise data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data How many datapoints are there?
Tests for normality: exercise • Shapiro-Wilks (if sig, not normal) shapiro.test (data) If error message, make sure the stats package is loaded, then try again: library(stats); shapiro.test (data)
Tests for normality: exercise • Kolmogorov-Smirnov (if sig, not normal) ks.test(data,”pnorm”,mean(data),sd=sqrt(var(data)))
Tests for normality: exercise • Quantile-quantile plot (if wavers substantially off 1:1 line, not normal) par(pty="s") qqnorm(data); qqline(data) opens up a single plot window
Tests for normality: exercise If the distribution isn´t normal, what is it? freq.data<-table(data); freq.data barplot(freq.data)
Non-normal distributions • Poisson (count data, left-skewed, variance = mean) • Negative binomial (count data, left-skewed, variance >> mean) • Binomial (binary or proportion data, left-skewed, variance constrained by 0,1) • Gamma (variance increases as square of mean, often used for survival data)
Exercise model2 <- lm (Species ~ AreaFactor, data = diane) anova (model2) 1. Test for normality of residuals resid2<- resid (model2) ...you do the rest! 2. Test for homogeneity of variances summary (lm (abs (resid2) ~ AreaFactor))
Regression diagnostics • Residuals are normally distributed • Absolute value of residuals do not change with predicted value (homoscedastcity) • Residuals show no pattern with predicted values (i.e. the function “fits”) • No datapoint has undue leverage on the model.
Regression diagnostics model3 <- lm (Species ~ Elevation + pH, data = diane) par(mfrow=c(2,2)); plot(model3)
1. Residuals are normally distributed • Straight “Normal Q-Q plot” Std. deviance resid. Theoretical Quantiles
2. Absolute residuals do not change with predicted values • No fan shape in Residuals vs fitted plot • No upward (or downward) trend in Scale-location plot MALO BUENO Residuals Sqrt (abs (SD resid.)) Fitted values Fitted values
3. Residuals show no pattern Curved residual plots result from fitting a straight line to non-linear data (e.g. quadratic)
4. No unusual leverage Cook’s distance > 1 indicates a point with undue leverage (large change in model fit when removed)
Transformations Try transforming your y-variable to improve the regression diagnostic plot • replace Species with log(Species) • replace Species with sqrt(Species)
Poisson distribution • Frequency data • Lots of zero (or minimum value) data • Variance increases with the mean
What do you do with Poisson data? • Correct for correlation between mean and variance by log-transforming y (but log (0) is undefined!!) • Use non-parametric statistics (but low power) • Use a “generalized linear model” specifying a Poisson distribution
The problem: Hard to transform data to satisfy all requirements! Tarea: Janka example Janka dataset: Asks if Janka hardness values are a good estimate of timber density? N=36