230 likes | 249 Views
Join the workshop to enhance your research with R statistical software. Learn how R can help in data analysis and visualization.
E N D
R by Example CNR SBCS workshop Gary Casterline
Experience and Applications • Name and Dept/Lab • Experience with Statistics / Software • Your research and how R might help
3 easy steps to get R http://cran.cnr.berkeley.edu Select a platform-specific binary Windows OS X Unix/Linux 3. Save to disk and install R is already installed on the lab machines, so just start it up.
Set up working environment for project • create a directory for today's workshop. SayC:\temp\gary • (you may use your own name :) • change to that directory within R with either the File\Change Dir ... menu command • or using the friendly command line: > setwd("C:\temp\gary") > getwd() [1] "C:/temp/gary" > dir()# shows what is there character(0)
Excel into R • get this spreadsheet using a web browser:http://nature.berkeley.edu/~casterln/jantemps.xls • open in Excel and Save-As file-type .csv in your temp folder. • back in R > dir()#is it there yet? [1] "jantemps.csv"# yes! > jantemps = read.csv("jantemps.csv",header=T)#create dataframe
attaching dataframes to search path > names(jantemps) [1] "tmax" "tmin" "day" > attach(jantemps) > search()# now where is R going to look for stuff [1] ".GlobalEnv" "jantemps" "package:methods" [4] "package:stats" "package:graphics" "package:grDevices" [7] "package:utils" "package:datasets" "Autoloads" [10] "package:base" > max(tmax) [1] 10.8 > min(tmin) [1] -11.5
building up a plot > plot(day,tmax,ylim=c(-12,12),type="n",ylab="Temperature") > points(day,tmin,col="blue",pch=16) > points(day,tmax,col="red",pch=16) > for( i in 1:31 ) lines( c(i,i), c(tmin[i],tmax[i]), + col="green") > detach(jantemps)
plot with categorical variables > weather = read.table( + "http://nature.berkeley.edu/~casterln/crawley/SilwoodWeather.txt",header=T) > names(weather) [1] "upper" "lower" "rain" "month" "yr" > attach(weather) > month <- factor(month) > plot(month,upper) > detach(weather) > objects() [1] "jantemps" "month" "weather"
> pollution = read.table( • + "http://nature.berkeley.edu/~casterln/crawley/pollute.txt",header=T) • > names(pollution) • [1] "Pollution" "Temp" "Industry" "Population" "Wind" "Rain" "Wet.days" • > pairs(pollution, • + panel=panel.smooth) • every possible pairwise plot • each plot in 2nd row has Temp on y-axis • each plot in 1st col has Pollution on x-axis multivariate plots
> library(tree)# may have to install package first • > regtree = tree(Pollution ~ . , data=pollution) • > plot(regtree) • > text(regtree) • goal is to create homogeneous buckets by splitting the data at optimal levels of explanatory variables • first split is on Industry, mean pollution level of all Industry > 748 is 67 tree-based models
> attach(pollution)# may already be attached (search() ??) • > coplot(Pollution ~ Temp | Rain) • ordered from lower left, row-wise, to upper right, from lowest rainfall to highest. • big range of rain in lower left panel conditioning plots
which analysis is right? Answer these questions about your data: which variable is the response? which are the explanatory variables? are the explanatory variables continuous, categorical or a mix? what kind of response: continuous, count, proportion, time-at-death or category? Then use this guide: • Explanatory variables all explanatory variables continuous: Regression all explanatory variables categorical: Anova explanatories both cont. and categ: Ancova 2. Response variable Continuous: Normal Regression, Anova or Ancova Proportion: Logistic Regression (glm, binomial errors) Count: Contingency Tables (glm, poisson errors) Binary: Binary logistic analysis (glm Time-at-death: Survival analysis
>decay = read.table("http://nature.berkeley.edu/~casterln/crawley/decay.txt", +header=T) >attach(decay) >names(decay) [1] "x" "y" >plot(x,y) >abline(lm(y~x)) >result = lm(y~x) >summary(result) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -19.065 -10.029 -2.058 5.107 40.447 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 84.5534 5.0277 16.82 < 2e-16 *** x -2.8272 0.2879 -9.82 9.94e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.34 on 29 degrees of freedom Multiple R-Squared: 0.7688, Adjusted R-squared: 0.7608 F-statistic: 96.44 on 1 and 29 DF, p-value: 9.94e-11 log transform and prediction
>par(mfcol=c(2,2)) • >plot(result) • U-shaped resid vs fitted • qq plot not a straight line • (non normal residuals) regression summary plots
>transformed = • + lm(log(y) ~ x) • >plot(result) • much better • qq plot nearly straight • but we may have nonconstant • variance (heteroskedasticity) check transformed model
> smoothx = seq(0,30,0.1) • >smoothy = exp(predict(transformed, list(x=smoothx))) • >par(mfrow=c(1,1)) • >plot(x,y) • >lines(smoothx, smoothy) • use predict() with the transformed model • and the smoothed x's • and unlog (exp()) the results so they'll • fit on the original axes. plot predicted w/back-transform
analysis of covariance with count data This dataset contains the counts on the number of "Species" on plots that have different "Biomass" and soil "pH" ( low, mid, high). > species = read.table( + "http://nature.berkeley.edu/~casterln/crawley/species.txt", + header=T) > attach(species) > names(species) [1] "pH" "Biomass" "Species" > plot(Biomass,Species,type="n") > spp = split(Species,pH) > bio = split(Biomass,pH) > points(bio[[1]],spp[[1]],pch=16) > points(bio[[2]],spp[[2]],pch=17) > points(bio[[3]],spp[[3]]) Note the use of split to generate sub lists. check out what is in bio and bio[[1]] etc.
Ancova w/count data (cont.) > model1 = glm( Species ~ Biomass*pH, poisson) > summary(model1) > model2 = glm( Species ~ Biomass + pH, poisson) > anova(model1, model2, test="Chi") Analysis of Deviance Table Model 1: Species ~ Biomass * pH Model 2: Species ~ Biomass + pH Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 84 83.201 2 86 99.242 -2 -16.040 0.0003288 We've tested whether model1 is significantly better than model2. It is: p=.00032
Ancova w/count data (concl.) Now lets add the predictions to our plot: > xv = seq(0,10,0.10) > levels(pH) [1] "high" "low" "mid" > length(xv) [1] 101 > phv = rep("high",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv) > phv = rep("mid",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv) > phv = rep("low",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv)
logistic reg w/bin errors Question: Does the proportion of males increase with density of insects? >numbers = read.table( + "http://nature.berkeley.edu/~casterln/crawley/sexratio.txt", + header=T) > attach(numbers) > par(mfrow=c(1,2)) > p = males/(males+females) > plot(density, p, ylab = "Proportion male") > plot(log(density),p,ylab = "Proportion male") Log looks like it will work better. Lets see.
Logistic regression w/binomial errors (cont.) The "y" variable should be 2-column matrix: > y = cbind( males, females) > model = glm( y ~ density, binomial) > summary(model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.0807368 0.1550376 0.521 0.603 density 0.0035101 0.0005116 6.862 6.81e-12 *** Null deviance: 71.159 on 7 degrees of freedom Residual deviance: 22.091 on 6 degrees of freedom AIC: 54.618 Not too bad, but let's try the log(density) version (watch resid variance). > model = glm( y ~ log(density), binomial) > summary(model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.65927 0.48758 -5.454 4.92e-08 *** log(density) 0.69410 0.09056 7.665 1.80e-14 *** Null deviance: 71.1593 on 7 degrees of freedom Residual deviance: 5.6739 on 6 degrees of freedom AIC: 38.201
Logistic regression (concl) Draw the fitted model through the scatter: > xv = seq(0,6,0.05) > par(mfrow=c(1,1)) > plot(log(density),p,ylab="Proportion male") > lines(xv,predict(model,list(density=exp(xv)), + type="response"))
questions? • please fill out an evaluation before you leave • feel free to contact me anytime