1 / 23

R by Example

Join the workshop to enhance your research with R statistical software. Learn how R can help in data analysis and visualization.

gcanales
Download Presentation

R by Example

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R by Example CNR SBCS workshop Gary Casterline

  2. Experience and Applications • Name and Dept/Lab • Experience with Statistics / Software • Your research and how R might help

  3. 3 easy steps to get R http://cran.cnr.berkeley.edu Select a platform-specific binary Windows OS X Unix/Linux 3. Save to disk and install R is already installed on the lab machines, so just start it up.

  4. Set up working environment for project • create a directory for today's workshop. SayC:\temp\gary • (you may use your own name :) • change to that directory within R with either the File\Change Dir ... menu command • or using the friendly command line: > setwd("C:\temp\gary") > getwd() [1] "C:/temp/gary" > dir()# shows what is there character(0)

  5. Excel into R • get this spreadsheet using a web browser:http://nature.berkeley.edu/~casterln/jantemps.xls • open in Excel and Save-As file-type .csv in your temp folder. • back in R > dir()#is it there yet? [1] "jantemps.csv"# yes! > jantemps = read.csv("jantemps.csv",header=T)#create dataframe

  6. attaching dataframes to search path > names(jantemps) [1] "tmax" "tmin" "day" > attach(jantemps) > search()# now where is R going to look for stuff [1] ".GlobalEnv" "jantemps" "package:methods" [4] "package:stats" "package:graphics" "package:grDevices" [7] "package:utils" "package:datasets" "Autoloads" [10] "package:base" > max(tmax) [1] 10.8 > min(tmin) [1] -11.5

  7. building up a plot > plot(day,tmax,ylim=c(-12,12),type="n",ylab="Temperature") > points(day,tmin,col="blue",pch=16) > points(day,tmax,col="red",pch=16) > for( i in 1:31 ) lines( c(i,i), c(tmin[i],tmax[i]), + col="green") > detach(jantemps)

  8. plot with categorical variables > weather = read.table( + "http://nature.berkeley.edu/~casterln/crawley/SilwoodWeather.txt",header=T) > names(weather) [1] "upper" "lower" "rain" "month" "yr" > attach(weather) > month <- factor(month) > plot(month,upper) > detach(weather) > objects() [1] "jantemps" "month" "weather"

  9. > pollution = read.table( • + "http://nature.berkeley.edu/~casterln/crawley/pollute.txt",header=T) • > names(pollution) • [1] "Pollution" "Temp" "Industry" "Population" "Wind" "Rain" "Wet.days" • > pairs(pollution, • + panel=panel.smooth) • every possible pairwise plot • each plot in 2nd row has Temp on y-axis • each plot in 1st col has Pollution on x-axis multivariate plots

  10. > library(tree)# may have to install package first • > regtree = tree(Pollution ~ . , data=pollution) • > plot(regtree) • > text(regtree) • goal is to create homogeneous buckets by splitting the data at optimal levels of explanatory variables • first split is on Industry, mean pollution level of all Industry > 748 is 67 tree-based models

  11. > attach(pollution)# may already be attached (search() ??) • > coplot(Pollution ~ Temp | Rain) • ordered from lower left, row-wise, to upper right, from lowest rainfall to highest. • big range of rain in lower left panel conditioning plots

  12. which analysis is right? Answer these questions about your data: which variable is the response? which are the explanatory variables? are the explanatory variables continuous, categorical or a mix? what kind of response: continuous, count, proportion, time-at-death or category? Then use this guide: • Explanatory variables all explanatory variables continuous: Regression all explanatory variables categorical: Anova explanatories both cont. and categ: Ancova 2. Response variable Continuous: Normal Regression, Anova or Ancova Proportion: Logistic Regression (glm, binomial errors) Count: Contingency Tables (glm, poisson errors) Binary: Binary logistic analysis (glm Time-at-death: Survival analysis

  13. >decay = read.table("http://nature.berkeley.edu/~casterln/crawley/decay.txt", +header=T) >attach(decay) >names(decay) [1] "x" "y" >plot(x,y) >abline(lm(y~x)) >result = lm(y~x) >summary(result) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -19.065 -10.029 -2.058 5.107 40.447 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 84.5534 5.0277 16.82 < 2e-16 *** x -2.8272 0.2879 -9.82 9.94e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.34 on 29 degrees of freedom Multiple R-Squared: 0.7688, Adjusted R-squared: 0.7608 F-statistic: 96.44 on 1 and 29 DF, p-value: 9.94e-11 log transform and prediction

  14. >par(mfcol=c(2,2)) • >plot(result) • U-shaped resid vs fitted • qq plot not a straight line • (non normal residuals) regression summary plots

  15. >transformed = • + lm(log(y) ~ x) • >plot(result) • much better • qq plot nearly straight • but we may have nonconstant • variance (heteroskedasticity) check transformed model

  16. > smoothx = seq(0,30,0.1) • >smoothy = exp(predict(transformed, list(x=smoothx))) • >par(mfrow=c(1,1)) • >plot(x,y) • >lines(smoothx, smoothy) • use predict() with the transformed model • and the smoothed x's • and unlog (exp()) the results so they'll • fit on the original axes. plot predicted w/back-transform

  17. analysis of covariance with count data This dataset contains the counts on the number of "Species" on plots that have different "Biomass" and soil "pH" ( low, mid, high). > species = read.table( + "http://nature.berkeley.edu/~casterln/crawley/species.txt", + header=T) > attach(species) > names(species) [1] "pH" "Biomass" "Species" > plot(Biomass,Species,type="n") > spp = split(Species,pH) > bio = split(Biomass,pH) > points(bio[[1]],spp[[1]],pch=16) > points(bio[[2]],spp[[2]],pch=17) > points(bio[[3]],spp[[3]]) Note the use of split to generate sub lists. check out what is in bio and bio[[1]] etc.

  18. Ancova w/count data (cont.) > model1 = glm( Species ~ Biomass*pH, poisson) > summary(model1) > model2 = glm( Species ~ Biomass + pH, poisson) > anova(model1, model2, test="Chi") Analysis of Deviance Table Model 1: Species ~ Biomass * pH Model 2: Species ~ Biomass + pH Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 84 83.201 2 86 99.242 -2 -16.040 0.0003288 We've tested whether model1 is significantly better than model2. It is: p=.00032

  19. Ancova w/count data (concl.) Now lets add the predictions to our plot: > xv = seq(0,10,0.10) > levels(pH) [1] "high" "low" "mid" > length(xv) [1] 101 > phv = rep("high",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv) > phv = rep("mid",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv) > phv = rep("low",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv)

  20. logistic reg w/bin errors Question: Does the proportion of males increase with density of insects? >numbers = read.table( + "http://nature.berkeley.edu/~casterln/crawley/sexratio.txt", + header=T) > attach(numbers) > par(mfrow=c(1,2)) > p = males/(males+females) > plot(density, p, ylab = "Proportion male") > plot(log(density),p,ylab = "Proportion male") Log looks like it will work better. Lets see.

  21. Logistic regression w/binomial errors (cont.) The "y" variable should be 2-column matrix: > y = cbind( males, females) > model = glm( y ~ density, binomial) > summary(model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.0807368 0.1550376 0.521 0.603 density 0.0035101 0.0005116 6.862 6.81e-12 *** Null deviance: 71.159 on 7 degrees of freedom Residual deviance: 22.091 on 6 degrees of freedom AIC: 54.618 Not too bad, but let's try the log(density) version (watch resid variance). > model = glm( y ~ log(density), binomial) > summary(model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.65927 0.48758 -5.454 4.92e-08 *** log(density) 0.69410 0.09056 7.665 1.80e-14 *** Null deviance: 71.1593 on 7 degrees of freedom Residual deviance: 5.6739 on 6 degrees of freedom AIC: 38.201

  22. Logistic regression (concl) Draw the fitted model through the scatter: > xv = seq(0,6,0.05) > par(mfrow=c(1,1)) > plot(log(density),p,ylab="Proportion male") > lines(xv,predict(model,list(density=exp(xv)), + type="response"))

  23. questions? • please fill out an evaluation before you leave • feel free to contact me anytime

More Related