R by Example

R by Example CNR SBCS workshop Gary Casterline

Experience and Applications • Name and Dept/Lab • Experience with Statistics / Software • Your research and how R might help

3 easy steps to get R http://cran.cnr.berkeley.edu Select a platform-specific binary Windows OS X Unix/Linux 3. Save to disk and install R is already installed on the lab machines, so just start it up.

Set up working environment for project • create a directory for today's workshop. SayC:\temp\gary • (you may use your own name :) • change to that directory within R with either the File\Change Dir ... menu command • or using the friendly command line: > setwd("C:\temp\gary") > getwd() [1] "C:/temp/gary" > dir()# shows what is there character(0)

Excel into R • get this spreadsheet using a web browser:http://nature.berkeley.edu/~casterln/jantemps.xls • open in Excel and Save-As file-type .csv in your temp folder. • back in R > dir()#is it there yet? [1] "jantemps.csv"# yes! > jantemps = read.csv("jantemps.csv",header=T)#create dataframe

attaching dataframes to search path > names(jantemps) [1] "tmax" "tmin" "day" > attach(jantemps) > search()# now where is R going to look for stuff [1] ".GlobalEnv" "jantemps" "package:methods" [4] "package:stats" "package:graphics" "package:grDevices" [7] "package:utils" "package:datasets" "Autoloads" [10] "package:base" > max(tmax) [1] 10.8 > min(tmin) [1] -11.5

building up a plot > plot(day,tmax,ylim=c(-12,12),type="n",ylab="Temperature") > points(day,tmin,col="blue",pch=16) > points(day,tmax,col="red",pch=16) > for( i in 1:31 ) lines( c(i,i), c(tmin[i],tmax[i]), + col="green") > detach(jantemps)

plot with categorical variables > weather = read.table( + "http://nature.berkeley.edu/~casterln/crawley/SilwoodWeather.txt",header=T) > names(weather) [1] "upper" "lower" "rain" "month" "yr" > attach(weather) > month <- factor(month) > plot(month,upper) > detach(weather) > objects() [1] "jantemps" "month" "weather"

> pollution = read.table( • + "http://nature.berkeley.edu/~casterln/crawley/pollute.txt",header=T) • > names(pollution) • [1] "Pollution" "Temp" "Industry" "Population" "Wind" "Rain" "Wet.days" • > pairs(pollution, • + panel=panel.smooth) • every possible pairwise plot • each plot in 2nd row has Temp on y-axis • each plot in 1st col has Pollution on x-axis multivariate plots

> library(tree)# may have to install package first • > regtree = tree(Pollution ~ . , data=pollution) • > plot(regtree) • > text(regtree) • goal is to create homogeneous buckets by splitting the data at optimal levels of explanatory variables • first split is on Industry, mean pollution level of all Industry > 748 is 67 tree-based models

> attach(pollution)# may already be attached (search() ??) • > coplot(Pollution ~ Temp | Rain) • ordered from lower left, row-wise, to upper right, from lowest rainfall to highest. • big range of rain in lower left panel conditioning plots

which analysis is right? Answer these questions about your data: which variable is the response? which are the explanatory variables? are the explanatory variables continuous, categorical or a mix? what kind of response: continuous, count, proportion, time-at-death or category? Then use this guide: • Explanatory variables all explanatory variables continuous: Regression all explanatory variables categorical: Anova explanatories both cont. and categ: Ancova 2. Response variable Continuous: Normal Regression, Anova or Ancova Proportion: Logistic Regression (glm, binomial errors) Count: Contingency Tables (glm, poisson errors) Binary: Binary logistic analysis (glm Time-at-death: Survival analysis

>decay = read.table("http://nature.berkeley.edu/~casterln/crawley/decay.txt", +header=T) >attach(decay) >names(decay) [1] "x" "y" >plot(x,y) >abline(lm(y~x)) >result = lm(y~x) >summary(result) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -19.065 -10.029 -2.058 5.107 40.447 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 84.5534 5.0277 16.82 < 2e-16 *** x -2.8272 0.2879 -9.82 9.94e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.34 on 29 degrees of freedom Multiple R-Squared: 0.7688, Adjusted R-squared: 0.7608 F-statistic: 96.44 on 1 and 29 DF, p-value: 9.94e-11 log transform and prediction

>par(mfcol=c(2,2)) • >plot(result) • U-shaped resid vs fitted • qq plot not a straight line • (non normal residuals) regression summary plots

>transformed = • + lm(log(y) ~ x) • >plot(result) • much better • qq plot nearly straight • but we may have nonconstant • variance (heteroskedasticity) check transformed model

> smoothx = seq(0,30,0.1) • >smoothy = exp(predict(transformed, list(x=smoothx))) • >par(mfrow=c(1,1)) • >plot(x,y) • >lines(smoothx, smoothy) • use predict() with the transformed model • and the smoothed x's • and unlog (exp()) the results so they'll • fit on the original axes. plot predicted w/back-transform

analysis of covariance with count data This dataset contains the counts on the number of "Species" on plots that have different "Biomass" and soil "pH" ( low, mid, high). > species = read.table( + "http://nature.berkeley.edu/~casterln/crawley/species.txt", + header=T) > attach(species) > names(species) [1] "pH" "Biomass" "Species" > plot(Biomass,Species,type="n") > spp = split(Species,pH) > bio = split(Biomass,pH) > points(bio[[1]],spp[[1]],pch=16) > points(bio[[2]],spp[[2]],pch=17) > points(bio[[3]],spp[[3]]) Note the use of split to generate sub lists. check out what is in bio and bio[[1]] etc.

Ancova w/count data (cont.) > model1 = glm( Species ~ Biomass*pH, poisson) > summary(model1) > model2 = glm( Species ~ Biomass + pH, poisson) > anova(model1, model2, test="Chi") Analysis of Deviance Table Model 1: Species ~ Biomass * pH Model 2: Species ~ Biomass + pH Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 84 83.201 2 86 99.242 -2 -16.040 0.0003288 We've tested whether model1 is significantly better than model2. It is: p=.00032

Ancova w/count data (concl.) Now lets add the predictions to our plot: > xv = seq(0,10,0.10) > levels(pH) [1] "high" "low" "mid" > length(xv) [1] 101 > phv = rep("high",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv) > phv = rep("mid",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv) > phv = rep("low",101) > yv = predict(model1,list(pH=factor(phv),Biomass=xv),type="response") > lines(xv,yv)

logistic reg w/bin errors Question: Does the proportion of males increase with density of insects? >numbers = read.table( + "http://nature.berkeley.edu/~casterln/crawley/sexratio.txt", + header=T) > attach(numbers) > par(mfrow=c(1,2)) > p = males/(males+females) > plot(density, p, ylab = "Proportion male") > plot(log(density),p,ylab = "Proportion male") Log looks like it will work better. Lets see.

Logistic regression w/binomial errors (cont.) The "y" variable should be 2-column matrix: > y = cbind( males, females) > model = glm( y ~ density, binomial) > summary(model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.0807368 0.1550376 0.521 0.603 density 0.0035101 0.0005116 6.862 6.81e-12 *** Null deviance: 71.159 on 7 degrees of freedom Residual deviance: 22.091 on 6 degrees of freedom AIC: 54.618 Not too bad, but let's try the log(density) version (watch resid variance). > model = glm( y ~ log(density), binomial) > summary(model) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.65927 0.48758 -5.454 4.92e-08 *** log(density) 0.69410 0.09056 7.665 1.80e-14 *** Null deviance: 71.1593 on 7 degrees of freedom Residual deviance: 5.6739 on 6 degrees of freedom AIC: 38.201

Logistic regression (concl) Draw the fitted model through the scatter: > xv = seq(0,6,0.05) > par(mfrow=c(1,1)) > plot(log(density),p,ylab="Proportion male") > lines(xv,predict(model,list(density=exp(xv)), + type="response"))

questions? • please fill out an evaluation before you leave • feel free to contact me anytime

R by Example

R by Example

Presentation Transcript

Programming by Example

Condor by Example

HESES by example

TDD by example

Learning by example…

Leaders by Example

Leaders by Example

Leaders by Example

SQL by Example

Particles by example

Leading by example

Leaders by Example

Query-By-Example

Java by Example

Prolog by Example

Example R Answer

Condor by Example

VaR by example

Leading By Example

Squeak by Example

Learn By Example

Condor by Example