An Introduction to R

An Introduction to R Xiaozhi Zhou

What is R? • The R system for statistical computing is an environment for data manipulation, calculation and graphical display. • R can be regarded as an implementation of the S language (S-Plus system) which was developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. • Strengths • free and open source, supported by a strong user community • highly extensible and flexible • implementation of modern statistical methods • moderately flexible graphics with intelligent defaults • Weaknesses • slow or impossible with large data sets • non-standard programming paradigms

Installing R • Main source: • http://www.R-project.org • The base system is available for Windows platforms and Mac OS X • Current Version: 2.8.0 (2008-10-20)

Documentation • An Introduction to R: gives an introduction to the language and how to use R for doing statistical analysis and graphics. • R Data Import/Export: describes the import and export facilities available either in R itself or via packages which are available from CRAN. • R Installation and Administration: hints for installing R on special platforms • Writing R Extensions: covers how to create your own packages, write R help files, and the foreign language (C, C++, Fortran, ...) interfaces. • The R Reference Index: contains all help files of the R standard and recommended packages in printable form.

Installing Packages • A huge amount of additional functionality is implemented in add-on packages (can be download from the official home page) • Packages can be installed directly from the R prompt: • R> install.packages(“package name”) • menu: ‘packages’- ‘install package(s)’ • The package functionality is available after attaching the package by: • R> library(“package name”)

Help • Online help via the official home page • Help systems in R: • R> help(help) or use ?help for short • R> help.start - HTML format • R> help(“mean”) or use ?mean for short • R> help(package=“survey”) • help.search

R commands, case sensitivity, recall commands • Case sensitive: • A and a are different symbols and would refer to different variables • Elementary commands consist of either expressions or assignments • Expressions: after being evaluated and printed, the value is lost • Assignments: evaluate an expression and pass the value to a variable, but the result is not automatically printed • Commands are separated either by a newline or by a ‘;’ • Elementary commands can be grouped together into one compound expression by ‘{’ and ‘}’ • Comments can be put anywhere starting with a # • Recall previous commands: vertical arrow keys ↑ ↓

Data import and export • Reading data from files • read.table() function: read entire data frame • R> survey <- read.table("survey.csv", hearder=TRUE, sep=", ", row.name=1) • Header=TRUE option specifies that the first line is a line of headings, no explicit row labels are given • sep=", ": columns are separated by a comma • row.name=1: the first column should be interpreted as row names but not as a variable • scan() function: read in vectors as a list • R> inp <- scan("input.dat", list("",0,0))

Reading an excel file • R> read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, comment.char="", ...) • file - the name of the file which the data are to be read from • header - a logical value indicating whether the file contains the names of the variables as its first line • sep - the field separator character. Values on each line of the file are separated by this character • Quote - the set of quoting characters • dec - the character used in the file for decimal points • Check help(read.csv) for more information • R> read.csv("F:/houses.data.csv", header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, comment.char="") • X Price Floor Area Rooms Age Cent.heat • 1 1 52.00 111 830 5 6.2 No • 2 2 54.75 128 710 5 7.5 No • 3 3 57.50 101 1000 5 4.2 No • 4 4 57.50 131 690 5 8.8 No • 5 5 59.75 93 900 5 1.9 Yes

Simple manipulations; numbers and vectors • Vectors and assignment • R> x<- c(10.4, 5.6, 3.1, 6.4, 21.7) • R> c(10.4, 5.6, 3.1, 6.4, 21.7) -> x • R> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7)) • R> y <- 2*x-3.5 • Vector arithmetic • R> sum((x-mean(x))^2)/(length(x)-1) • Generating regular sequences • R> seq(-5, 5, by=.2) -> s3 • R> s4 <- seq(length=51, from=-5, by=.2) • R> s5 <- rep(x, times=5) • R> x6 <- rep(x, each=5)

Simple manipulations; numbers and vectors • Logical vectors • R> temp <- x>13 (values TRUE, FALSE and NA) • logi [1:5] FALSE FALSE FALSE FALSE TRUE • Character vectors • R> labs <- paste(c("X", "Y"), 1:10, sep="") • chr [1:10] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10 " • Missing values • R> z <- c(1:3, NA); is.na(z) • is.na(xx) is TRUE both for NA and NaN values • is.nan(xx) is only TRUE for NaNs

Simple summary statistics • R> x<- c(10.4, 5.6, 3.1, 6.4, 21.7) • R> summary(x) • Min. 1st Qu. Median Mean 3rd Qu. Max. • 3.10 5.60 6.40 9.44 10.40 21.70 • R> str(x) • num [1:5] 10.4 5.6 3.1 6.4 21.7

Lists and data frames • R> Lst <- list(name="Fred",wife="Mary",no.children=3, child.ages=c(4, 7, 9)) • R> str(Lst) • List of 4 • $ name: chr "Fred" • $ wife: chr "Mary" • $ no.children: num 3 • $ child.ages: num [1:3] 4 7 9 • R> name$component_name • R> str(Lst$child.ages) • num [1:3] 4 7 9 • attach() and detach() • R> attach(Lst) • R> str(child.ages) • num [1:3] 4 7 9

Calculate simple correlation • Arrange data in a .CSV file and read data file into memory • R> weight = read.csv("F:/height-weight.csv", header=T, row.names=1) • R> height = read.csv("F:/height-weight.csv", header=T, row.names=2) • R> cor( height, weight, method = "pearson") • Weight • Height 0.9936593

Conduct a linear regression analysis • The basic function is: lm(model, data) • "package=graphics" # Annette Dobson (1990) "An Introduction to Generalized Linear Models” • R> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) • R> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) • R> group <- gl(2,10,20, labels=c("Ctl","Trt")) • gl(): Generate factors by specifying the pattern of their levels. • R> weight <- c(ctl, trt) • anova(lm.D9 <- lm(weight ~ group)) • Analysis of Variance Table • Response: weight • Df Sum Sq Mean Sq F value Pr(>F) • group 1 0.6882 0.6882 1.4191 0.249 • Residuals 18 8.7293 0.4850

Conduct a linear regression analysis • R> summary(lm.D90 <- lm(weight ~ group - 1)) • Call: • lm(formula = weight ~ group - 1) • Residuals: • Min 1Q Median 3Q Max • -1.0710 -0.4938 0.0685 0.2462 1.3690 • Coefficients: • Estimate Std. Error t value Pr(>|t|) • groupCtl 5.0320 0.2202 22.85 9.55e-15 *** • groupTrt 4.6610 0.2202 21.16 3.62e-14 *** • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • Residual standard error: 0.6964 on 18 degrees of freedom • Multiple R-squared: 0.9818, Adjusted R-squared: 0.9798 • F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16

Conduct a logistic regression • Logistic regression is a special case of a Generalized Linear Model (GLM); function glm() • R> counts <- c(18,17,15,20,10,20,25,13,12) • R> outcome <- gl(3,1,9) • Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 • R> treatment <- gl(3,3) • Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 • R> print(d.AD <- data.frame(treatment, outcome, counts)) • R> glm.D93 <- glm(counts ~ outcome + treatment, family=poisson()) • R> anova(glm.D93) • Analysis of Deviance Table • Model: poisson, link: log • Response: counts • Terms added sequentially (first to last) • Df Deviance Resid. Df Resid. Dev • NULL 8 10.5814 • outcome 2 5.4523 6 5.1291 • treatment 2 0.0000 4 5.1291

Conduct a logistic regression • R> summary(glm.D93) • Call: • glm(formula = counts ~ outcome + treatment, family = poisson()) • Deviance Residuals: • 1 2 3 4 5 6 7 8 9 • -0.67125 0.96272 -0.16965 -0.21999 -0.95552 1.04939 0.84715 -0.09167 -0.96656 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 *** • outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 * • outcome3 -2.930e-01 1.927e-01 -1.520 0.1285 • treatment2 8.717e-16 2.000e-01 4.36e-15 1.0000 • treatment3 4.557e-16 2.000e-01 2.28e-15 1.0000 • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for poisson family taken to be 1) • Null deviance: 10.5814 on 8 degrees of freedom • Residual deviance: 5.1291 on 4 degrees of freedom • AIC: 56.761 • Number of Fisher Scoring iterations: 4

Simple graphs - Histogram • The basic function is hist(variables) • R> test.data <- c(2.1, 2.6, 2.7, 3.2, 4.1, 4.3, 5.2, 5.1, 4.8, 1.8, 1.4, 2.5, 2.7, 3.1, 2.6, 2.8) • R> hist(test.data) • R> hist(log(test.data))

Simple graphs – Scatter Plot • The basic function is plot(x~y) • R> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) • R> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) • R> plot(ctl ~ trt)

Simple graphs –Boxplot • The basic function is boxplot(variables) • R> counts <- c(18,17,15,20,10,20,25,13,12) • R> outcome <- gl(3,1,9) • R> boxplot(counts~outcome) • R> test.data <- c(2.1, 2.6, 2.7, 3.2, 4.1, 4.3, 5.2, 5.1, 4.8, 1.8, 1.4, 2.5, 2.7, 3.1, 2.6, 2.8) • boxplot(test.data, xlab="Single sample", ylab="Value axis", col="lightblue") • title(main="Plot with outlier", font.main= 4)

References • B. S. Everitt and T. Hothorn (2006). A handbook of statistical analyses using R. Chapman & Hall/CRC: FL. • W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to R. http://www.R-project.org

An Introduction to R