Introduction to R

Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator

R • R is a free software environment for statistical computing and graphics • Object-oriented • It runs on a wide variety of platforms • Highly extensible • Command line and GUI • Conflict between extensible and GUI

R Studio Datasets Scripts Results Files, plots, packages, & help

Creating a project • Project > Create Project… Store all R scripts and data in the same folder or directory by creating a project

Script • # CO2 parts per million for 2000-2009 • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) # a range of values • # show values • co2 • year • #compute mean and standard deviation • mean(co2) • sd(co2) • plot(year,co2) • A script is a set of R commands • A program • c is short for combine in c(369.40, …)

Exercise Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers Plot kWh per square foot by year for the following University of Georgia data.

Datasets • A dataset is a table • One row for each observation • Columns contain observation values • Same as the relational model • R supports multiple data structures and multiple data types

Data structures • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) • co2[2] # get the second value • m <- matrix(1:12, nrow=4,ncol=3) • m[4,3] • Vector • A single row table where data are all of the same type • Matrix • A table where all data are of the same type

Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18

Data structures • a <- array(1:24, c(4,3,2)) • a[1,1,1] • gender <- c("m","f","f") • age <- c(5,8,3) • df <- data.frame(gender,age) • df[1,2] • df[1,] • df[,2] • Array • Extends a matrix beyond two dimensions • Data frame • Same as a relational table • Columns can have different data types • Typically, read a file to create a data frame

Data structures • l <- list(co2,m,df) • l[[3]] # list 3 • l[[1]][2] # second element of list 1 • List • An ordered collection of objects • Can store a variety of objects under one name

Objects • Anything that can be assigned to a variable • Constant • Data structure • Function • Graph • …

Types of data • Classification • Nominal • Sorting or ranking • Ordinal • Measurement • Interval • Ratio

Factors • Nominal and ordinal data are factors • By default, strings are treated as factors • Determine how data are analyzed and presented • Failure to realize a column contains a factor, can cause confusion • Use str() to find out a frame’s data structure

Missing values • sum(c(1,NA,2)) • sum(c(1,NA,2),na.rm=T) Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations

Missing values • gender <- c("m","f","f","f") • age <- c(5,8,3,NA) • df <- data.frame(gender,age) • df2 <- na.omit(df) You remove rows with missing values by using na.omit()

Packages • R’s base set of packages can be extended by installing additional packages • Over 4,000 packages • Search the R Project site to identify packages and functions • Install using R studio • Packages must be installed prior to useand their use specified in a script • require(packagename)

Packages # install ONCE on your computer # can also use Rstudio to install install.packages("knitr") # require EVERY TIME before using a package in a session # loads the package to memory require(knitr)

Compile a notebook • A notebook is a report of an analysis • Interweaves R code and output • File > Compile Notebook … • Select html, pdf, or Word output • Install knitr before use • Install suggested packages

PDF

Reading a file • R can read a wide variety of input formats • Text • Statistical package formats (e.g., SAS) • DBMS

Reading a text file It will not find this local file on your computer. • t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=',') • Delimited text file, such as CSV • Creates a data frame • Specify as required • Presence of header • Separator • Row names

Reading a text file • url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" • t <- read.table(url, header=T, sep=',') Can read a file using a URL

Learning about an object Click on the name of the file in the top-right window to see its content Click on the blue icon of the file in the top-right window to see its structure • url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object

Referencing data url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) Column Data set datasetName$columName

Creating a new column <url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') # computeCelsius t$Ctemp = round((t$temperature-32)*5/9,1)

Reshaping Melt Cast • Converting data from one format to another • Wide to narrow

Reshaping require(reshape) url <- 'http://people.terry.uga.edu/rwatson/data/meltExample.csv' s <- read.table(url, header=F, sep=',') colnames(s) <- c('year', 1:12) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') c <- cast(m,year~month, value='co2')

Writing files The file is stored in the project's folder url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’ t <- read.table(url, header=T, sep=',') # computeCelsiusandroundtoonedecimalplace t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # renamethirdcolumntoindicateFahrenheit write.table(t,"centralparktempsCF.txt")

sqldf A R package for using SQL with data frames Returns a data frame Supports MySQL

Subset • trow <- t[t$year== 1999,] • trowSQL<- sqldf("select * from t where year = 1999") • tcol <- t[,c(1:2,4)] • tcolSQL <- sqldf("select year, month, Ctemp from t”) • trowcol <- t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)] • trowcolSQL <- sqldf("select year, month, Ctemp from t where year > 1989 and year < 2000") Selecting rows Selecting columns Selecting rows and columns

Sort • Sorting on column name • s <- t[order(-t$year, t$month),] • sSQL <- sqldf("select * from t order by year desc, month”) • s <- t[order(-t[,1], t[,2]),] Sort on column 1 descending and column 2 ascending

Recoding t$Category<- 'Other' t$Category[t$Ftemp>= 30] <- 'Hot' • Some analyses might be facilitated by the recoding of data • Split a continuous measure into two categories

Deleting a column t$Category <- NA Assign NULL

Exercise • Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html • Export a CSV file that contains three columns: year, month, and average CO2 • Read the file into R • Recode missing values (-99.99) to NA • Plot year versus CO2

Tabulating data url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’ t <- read.table(url, header=T, sep=',') # tabulate temperatures in integer format table(round(t$temperature,0)) sqldf("select round(temperature,0) as temperature, count(*) as frequency from t group by round(temperature,0)”) # tabulate temperatures by month table(round(t$temperature,0),t$month) sqldf("select month, round(temperature,0) as temperature, count(*) as frequency from t group by month, round(temperature,0)”) Report counts

Aggregate data • # average temperate for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'mean') • a • sqldf("select year, avg(temperature) as mean from t group by year") Summarize data using a specified function Compute the mean monthly temperature for each year

Merging files • url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' • t <- read.table(url, header=T, sep=',') • # averageyearlytemperature • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'meanTemp') • # readcarbon data (http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html) • url <- 'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt' • carbon <- read.table(url, header=T, sep=',') • m <- merge(carbon,a,by='year') • mSQL <- sqldf("selecta.year, CO2, meanTempfrom a, carbonwherea.year = carbon.year") There must be a common column in both files

Correlation coefficient • cor.test(m$meanTemp,m$CO2) Pearson's product-moment correlation data: m$meanTemp and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percentconfidenceinterval: 0.1454994 0.6049393 sampleestimates: cor 0.4000598 Significant

Concatenating files • Taking a set of files of with the same structure and creating a single file • Same type of data in corresponding columns • Files should be in the same directory

Concatenating files # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','watts') Local directory

Takes a while to run Concatenating files # read the file names from a remote directory (FTP) require(RCurl) url <- "ftp://watson_ftp:bulldawg1989@http://people.terry.uga.edu/rwatson/data/Mauna%20Loa%20CO2.csvpeople.terry.uga.edu/rwatson/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filenames # append the files one after another for (i in 1:length(filenames)) { file <- paste(url,filenames[i],sep='') # concatenate for url if (i == 1) { cp <- read.table(file, header=F, sep=',') } else { temp <-read.table(file, header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','kwh') Remote directory with FTP

Database accessServer Might need to edit require(rJava) require(RJDBC) drv <- JDBC("com.mysql.jdbc.Driver", "$:/usr/share/java/mysql-connector-java-5.1.16.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access

Database accessPersonal computer • MySQL access • You need the appropriate Java ARchive (JAR) for MySQL access file installed on your computer • http://dev.mysql.com/downloads/connector/j/

Database access Personal computer * xx is the release number • Decompress and move mysql-connector-java-5.1.xx*-bin.jar • OS X • Macintosh HD/Library/Java/Extensions • Windows • c:\jre\lib\ext

Database accessPersonal computer Change path for Windows require(RJDBC) # Load the driver – Change path to point to your jar file drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.33-bin.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access

Exercise • Using the Atlanta weather database and the lubridate package • Compute the average temperature at 5 pm in August • Determine the maximum temperature for each day in August for each year

Resources R books Reference card Quick-R

Key points • R is a platform for a wide variety of data analytics • Statistical analysis • Data visualization • HDFS and MapReduce • Text mining • Energy Informatics • R is a programming language • Much to learn

Introduction to R