420 likes | 499 Views
Introduction to R. Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator. R. R is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI
E N D
Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator
R • R is a free software environment for statistical computing and graphics • Object-oriented • It runs on a wide variety of platforms • Highly extensible • Command line and GUI • Conflict between extensible and GUI
R Studio Datasets Scripts Results Files, plots, packages, & help
Creating a project • Project > Create Project… Store all R scripts and data in the same folder or directory by creating a project
Script • # CO2 parts per million for 2000-2009 • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) # a range of values • # show values • co2 • year • #compute mean and standard deviation • mean(co2) • sd(co2) • plot(year,co2) • A script is a set of R commands • A program • c is short for combine in c(369.40, …)
Exercise Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers Plot kWh per square foot by year for the following University of Georgia data.
Datasets • A dataset is a table • One row for each observation • Columns contain observation values • Same as the relational model • R supports multiple data structures and multiple data types
Data structures • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) • m <- matrix(1:12, nrow=4,ncol=3) • Vector • A single row table where data are all of the same type • Matrix • A table where all data are of the same type
Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18
Data structures • a <- array(1:24, c(4,3,2)) • gender <- c("m","f","f") • age <- c(5,8,3) • df <- data.frame(gender,age) • Array • Extends a matrix beyond two dimensions • Data frame • Same as a relational table • Columns can have different data types • Typically, read a file to create a data frame
Data structures • l <- list(co2,m,df) • List • An ordered collection of objects • Can store a variety of objects under one name
Objects • Anything that can be assigned to a variable • Constant • Data structure • Function • Graph • …
Types of data • Classification • Nominal • Sorting or ranking • Ordinal • Measurement • Interval • Ratio
Factors Nominal and ordinal data are factors Determine how data are analyzed and presented
Missing values • sum(c(1,NA,2)) • sum(c(1,NA,2),na.rm=T) Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations
Missing values • gender <- c("m","f","f","f") • age <- c(5,8,3,NA) • df <- data.frame(gender,age) • df2 <- na.omit(df) You remove rows with missing values by using na.omit()
Reading a file • R can read a wide variety of input formats • Text • Statistical package formats (e.g., SAS) • DBMS
Reading a text file • t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=”\t") • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",") • Delimited text file, such as CSV • Creates a data frame • Specify as required • Presence of header • Separator • Row names
Learning about an object Click on the name of the file in the top-right window to see its content t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object
Referencing data Data set t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) # create a new column with the temperature in Celsius t$Ctemp = (t$temperature-32)*5/9 Column datasetName$columName
Packages • R’s base set of packages can be extended by installing additional packages • Over 4,000 packages • Search the R Project site to identify packages and functions • Install using R studio • Packages must be installed prior to useand their use specified in a script • require(packagename)
Packages t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep="," ) require(weathermetrics) #previously installed # compute Celsius t$Ctemp = fahrenheit.to.celsius(t$temperature,round=1)
Exercise Install the weathermetrics package and run the preceding code
Reshaping Melt Cast • Converting data from one format to another • Wide to narrow
Reshaping require(reshape) s <- read.table('http://dl.dropbox.com/u/6960256/data/meltExample.csv',sep=',') colnames(s) <- c('year', 1:12) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') c <- cast(m,year~month, value='co2')
Writing files t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",") # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write.table(t,"centralparktempsCF.txt")
Subset • trow <- t[t$year== 1999,] • tcol <- t[,c(1:2,4)] • trowcol <- t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)] Selecting rows Selecting columns Selecting rows and columns
Sort • Sorting on column name • s <- t[order(-t$year, t$month),] • s <- t[order(-t[,1], t[,2]),] Sort on column 1 descending and column 2 ascending
Recoding m$Cut <- 'Other' m$Cut[m$Temperature >= 90] <- 'Hot' • Some analyses might be facilitated by the recoding of data • Split a continuous measure into two categories
Exercise • Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html • Export a CSV file that contains three columns: year, month, and average CO2 • Read the file into R • Recode missing values (-99.99) to NA • Plot year versus CO2
Aggregate data • # average temperate for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'mean') Summarize data using a specified function Compute the mean monthly temperature for each year
Merging files • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=',') • # averagemonthlytempforeachyear • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'meanTemp') • # readcarbon data (http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html) • carbon <- read.csv("http://dl.dropbox.com/u/6960256/data/carbon1959-2011.txt", sep=',',header=T) • m <- merge(carbon,a,by='year') There must be a common column in both files
Concatenating files • Taking a set of files of with the same structure and creating a single file • Same type of data in corresponding columns • Files should be in the same directory
Concatenating files # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','watts') Local directory
Concatenating files # read the file names from a remote directory (FTP) require(RCurl) url <- "ftp://watson_ftp:bulldawg1989@richardtwatson.com/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filennames # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','kwh') Remote directory with FTP
Correlation coefficient • cor.test(m$meanTemp,m$CO2) Pearson's product-moment correlation data: m$meanTemp and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percentconfidenceinterval: 0.1454994 0.6049393 sampleestimates: cor 0.4000598 Significant
Database access • MySQL access • You need the appropriate Java ARchive (JAR) for MySQL access file installed on your computer • http://dev.mysql.com/downloads/connector/j/
Database access * xx is the release number • Decompress and move mysql-connector-java-5.1.xx*-bin.jar • OS X • Macintosh HD/Library/Java/Extensions • Windows • c:\jre\lib\ext
Database access Change path for Windows require(RJDBC) # Load the driver – Change path to point to your jar file drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.26-bin.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access
Exercise • Using the Atlanta weather database and the lubridate package • Compute the average temperature at 5 pm in August • Determine the maximum temperature for each day in August for each year
Resources R books Reference card Quick-R
Key points • R is a platform for a wide variety of data analytics • Statistical analysis • Data visualization • HDFS and MapReduce • Text mining • Energy Informatics • R is a programming language • Much to learn