1 / 42

Introduction to R

Introduction to R. Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator. R. R is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI

asher
Download Presentation

Introduction to R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator

  2. R • R is a free software environment for statistical computing and graphics • Object-oriented • It runs on a wide variety of platforms • Highly extensible • Command line and GUI • Conflict between extensible and GUI

  3. R Studio Datasets Scripts Results Files, plots, packages, & help

  4. Creating a project • Project > Create Project… Store all R scripts and data in the same folder or directory by creating a project

  5. Script • # CO2 parts per million for 2000-2009 • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) # a range of values • # show values • co2 • year • #compute mean and standard deviation • mean(co2) • sd(co2) • plot(year,co2) • A script is a set of R commands • A program • c is short for combine in c(369.40, …)

  6. Exercise Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers Plot kWh per square foot by year for the following University of Georgia data.

  7. Datasets • A dataset is a table • One row for each observation • Columns contain observation values • Same as the relational model • R supports multiple data structures and multiple data types

  8. Data structures • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) • m <- matrix(1:12, nrow=4,ncol=3) • Vector • A single row table where data are all of the same type • Matrix • A table where all data are of the same type

  9. Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18

  10. Data structures • a <- array(1:24, c(4,3,2)) • gender <- c("m","f","f") • age <- c(5,8,3) • df <- data.frame(gender,age) • Array • Extends a matrix beyond two dimensions • Data frame • Same as a relational table • Columns can have different data types • Typically, read a file to create a data frame

  11. Data structures • l <- list(co2,m,df) • List • An ordered collection of objects • Can store a variety of objects under one name

  12. Objects • Anything that can be assigned to a variable • Constant • Data structure • Function • Graph • …

  13. Types of data • Classification • Nominal • Sorting or ranking • Ordinal • Measurement • Interval • Ratio

  14. Factors Nominal and ordinal data are factors Determine how data are analyzed and presented

  15. Missing values • sum(c(1,NA,2)) • sum(c(1,NA,2),na.rm=T) Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations

  16. Missing values • gender <- c("m","f","f","f") • age <- c(5,8,3,NA) • df <- data.frame(gender,age) • df2 <- na.omit(df) You remove rows with missing values by using na.omit()

  17. Reading a file • R can read a wide variety of input formats • Text • Statistical package formats (e.g., SAS) • DBMS

  18. Reading a text file • t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=”\t") • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",") • Delimited text file, such as CSV • Creates a data frame • Specify as required • Presence of header • Separator • Row names

  19. Learning about an object Click on the name of the file in the top-right window to see its content t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object

  20. Referencing data Data set t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) # create a new column with the temperature in Celsius t$Ctemp = (t$temperature-32)*5/9 Column datasetName$columName

  21. Packages • R’s base set of packages can be extended by installing additional packages • Over 4,000 packages • Search the R Project site to identify packages and functions • Install using R studio • Packages must be installed prior to useand their use specified in a script • require(packagename)

  22. Packages t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep="," ) require(weathermetrics) #previously installed # compute Celsius t$Ctemp = fahrenheit.to.celsius(t$temperature,round=1)

  23. Exercise Install the weathermetrics package and run the preceding code

  24. Reshaping Melt Cast • Converting data from one format to another • Wide to narrow

  25. Reshaping require(reshape) s <- read.table('http://dl.dropbox.com/u/6960256/data/meltExample.csv',sep=',') colnames(s) <- c('year', 1:12) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') c <- cast(m,year~month, value='co2')

  26. Writing files t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",") # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write.table(t,"centralparktempsCF.txt")

  27. Subset • trow <- t[t$year== 1999,] • tcol <- t[,c(1:2,4)] • trowcol <- t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)] Selecting rows Selecting columns Selecting rows and columns

  28. Sort • Sorting on column name • s <- t[order(-t$year, t$month),] • s <- t[order(-t[,1], t[,2]),] Sort on column 1 descending and column 2 ascending

  29. Recoding m$Cut <- 'Other' m$Cut[m$Temperature >= 90] <- 'Hot' • Some analyses might be facilitated by the recoding of data • Split a continuous measure into two categories

  30. Exercise • Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html • Export a CSV file that contains three columns: year, month, and average CO2 • Read the file into R • Recode missing values (-99.99) to NA • Plot year versus CO2

  31. Aggregate data • # average temperate for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'mean') Summarize data using a specified function Compute the mean monthly temperature for each year

  32. Merging files • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=',') • # averagemonthlytempforeachyear • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'meanTemp') • # readcarbon data (http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html) • carbon <- read.csv("http://dl.dropbox.com/u/6960256/data/carbon1959-2011.txt", sep=',',header=T) • m <- merge(carbon,a,by='year') There must be a common column in both files

  33. Concatenating files • Taking a set of files of with the same structure and creating a single file • Same type of data in corresponding columns • Files should be in the same directory

  34. Concatenating files # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','watts') Local directory

  35. Concatenating files # read the file names from a remote directory (FTP) require(RCurl) url <- "ftp://watson_ftp:bulldawg1989@richardtwatson.com/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filennames # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','kwh') Remote directory with FTP

  36. Correlation coefficient • cor.test(m$meanTemp,m$CO2) Pearson's product-moment correlation data: m$meanTemp and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percentconfidenceinterval: 0.1454994 0.6049393 sampleestimates: cor 0.4000598 Significant

  37. Database access • MySQL access • You need the appropriate Java ARchive (JAR) for MySQL access file installed on your computer • http://dev.mysql.com/downloads/connector/j/

  38. Database access * xx is the release number • Decompress and move mysql-connector-java-5.1.xx*-bin.jar • OS X • Macintosh HD/Library/Java/Extensions • Windows • c:\jre\lib\ext

  39. Database access Change path for Windows require(RJDBC) # Load the driver – Change path to point to your jar file drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.26-bin.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access

  40. Exercise • Using the Atlanta weather database and the lubridate package • Compute the average temperature at 5 pm in August • Determine the maximum temperature for each day in August for each year

  41. Resources R books Reference card Quick-R

  42. Key points • R is a platform for a wide variety of data analytics • Statistical analysis • Data visualization • HDFS and MapReduce • Text mining • Energy Informatics • R is a programming language • Much to learn

More Related