340 likes | 538 Views
Introduction to R. Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator. R. R is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI
E N D
Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator
R • R is a free software environment for statistical computing and graphics • Object-oriented • It runs on a wide variety of platforms • Highly extensible • Command line and GUI • Conflict between extensible and GUI
R Studio Datasets Scripts Results Files, plots, packages, & help
Creating a project • Project > Create Project… Store all R scripts and data in the same folder or directory by creating a project
Script • # CO2 parts per million for 2000-2009 • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) # a range of values • # show values • co2 • year • #compute mean and standard deviation • mean(co2) • sd(co2) • plot(year,co2) • A script is a set of R commands • A program • c is short for combine in c(369.40, …)
Exercise Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers Compute some statistics for the data in the following table
Datasets • A dataset is a table • One row for each observation • Columns contain observation values • Same as the relational model • R supports multiple data structures and multiple data types
Data structures • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) • m <- matrix(1:12, nrow=4,ncol=3) • Vector • A single row table where data are all of the same type • Matrix • A table where all data are of the same type
Data structures • a <- array(1:24, c(4,3,2)) • gender <- c("m","f","f") • age <- c(5,8,3) • df <- data.frame(gender,age) • Array • Extends a matrix beyond two dimensions • Data frame • Same as a relational table • Columns can have different data types • Typically, read a file to create a data frame
Data structures • l <- list(co2,m,df) • List • An ordered collection of objects • Can store a variety of objects under one name
Objects • Anything that can be assigned to a variable • Constant • Data structure • Function • Graph • …
Types of data • Classification • Nominal • Sorting or ranking • Ordinal • Measurement • Interval • Ratio
Factors Nominal and ordered data are factors Determine how data are analyzed and presented
Reading a file • R can read a wide variety of input formats • Text • Statistical package formats (e.g., SAS) • DBMS
Reading a text file • t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=”\t") • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep="\t") • Delimited text file, such as CSV • Creates a data frame • Specify as required • Presence of header • Separator • Row names
Learning about an object Click on the name of the file in the top-right window to see its content • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object
Packages • R’s base set of packages can be extended by installing additional packages • Over 4,000 packages • Search the R Project site to identify packages and functions • Install using R studio • Packages must be installed prior to useand their use specified in a script • require(packagename)
Packages t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) require(weathermetrics) #previously installed # compute Celsius t$Ctemp = fahrenheit.to.celsius(t$temperature,round=1)
Reshaping Melt Cast • Converting data from one format to another • Wide to narrow
Reshaping require(reshape) s <- read.table('http://dl.dropbox.com/u/6960256/data/meltExample.csv',sep=',') colnames(s) <- c('year', 1:12) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') c <- cast(m,year~month, value='co2')
Exercise • Create a CSV file of annual CO2 measurements taken for 1969 through 2011 at Annual Mean Concentrations at the Mauna Loa Observatory (PPM) • http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html • Read the file into R and do some basic stats
Referencing fields t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) # create a new column with the temperature in Celsius t$Ctemp = (t$temperature-32)*5/9
Writing files t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write.table(t,"centralparktempsCF.txt")
Exercise Install weathermetrics Run the code on the prior slide
Subset • trow <- t[t$year== 1999,] • tcol <- t[,c(1:2,4)] • trowcol <- t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)] Selecting rows Selecting columns Selecting rows and columns
Sort • s <- t[order(-t[,1], t[,2]),] Sort on column 1 descending and column 2 ascending
Recoding m$Cut <- 'Other' m$Cut[m$Temperature >= 90] <- 'Hot' • Some analyses might be facilitated by the recoding of data • Split a continuous measure into two categories
Aggregate data • # average temperate for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'mean') Summarize data using a specified function Compute the mean monthly for each year
Merging files • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) • # average monthly temp for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'meanTemp') • # read carbon data (http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html) • carbon <- read.csv("http://dl.dropbox.com/u/6960256/data/carbon1959-2011.txt", sep='\t',header=T) • m <- merge(carbon,a,by=‘year’) There must be a common column in both files
Correlation coefficient • cor.test(m$meanTemp,m$CO2) Pearson's product-moment correlation data: m$meanTemp and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percentconfidenceinterval: 0.1454994 0.6049393 sampleestimates: cor 0.4000598 Significant
Database access require(RJDBC) # Load the driver -- Change to code to point to your jar file drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.23-bin.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file p for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access
Exercise • Using the Atlanta weather database and the lubridate package • Compute the average temperature at 5pm in August • Determine the maximum temperature for each day in August for each year
Resources R books Reference card Quick-R
Key points • R is a platform for a wide variety of data analytics • Statistical analysis • Data visualization • HDFS and MapReduce • Text mining • Energy Informatics • R is a programming language • Much to learn