1 / 34

Introduction to R

Introduction to R. Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator. R. R is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI

salome
Download Presentation

Introduction to R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator

  2. R • R is a free software environment for statistical computing and graphics • Object-oriented • It runs on a wide variety of platforms • Highly extensible • Command line and GUI • Conflict between extensible and GUI

  3. R Studio Datasets Scripts Results Files, plots, packages, & help

  4. Creating a project • Project > Create Project… Store all R scripts and data in the same folder or directory by creating a project

  5. Script • # CO2 parts per million for 2000-2009 • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) # a range of values • # show values • co2 • year • #compute mean and standard deviation • mean(co2) • sd(co2) • plot(year,co2) • A script is a set of R commands • A program • c is short for combine in c(369.40, …)

  6. Exercise Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers Compute some statistics for the data in the following table

  7. Datasets • A dataset is a table • One row for each observation • Columns contain observation values • Same as the relational model • R supports multiple data structures and multiple data types

  8. Data structures • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) • m <- matrix(1:12, nrow=4,ncol=3) • Vector • A single row table where data are all of the same type • Matrix • A table where all data are of the same type

  9. Data structures • a <- array(1:24, c(4,3,2)) • gender <- c("m","f","f") • age <- c(5,8,3) • df <- data.frame(gender,age) • Array • Extends a matrix beyond two dimensions • Data frame • Same as a relational table • Columns can have different data types • Typically, read a file to create a data frame

  10. Data structures • l <- list(co2,m,df) • List • An ordered collection of objects • Can store a variety of objects under one name

  11. Objects • Anything that can be assigned to a variable • Constant • Data structure • Function • Graph • …

  12. Types of data • Classification • Nominal • Sorting or ranking • Ordinal • Measurement • Interval • Ratio

  13. Factors Nominal and ordered data are factors Determine how data are analyzed and presented

  14. Reading a file • R can read a wide variety of input formats • Text • Statistical package formats (e.g., SAS) • DBMS

  15. Reading a text file • t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=”\t") • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep="\t") • Delimited text file, such as CSV • Creates a data frame • Specify as required • Presence of header • Separator • Row names

  16. Learning about an object Click on the name of the file in the top-right window to see its content • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object

  17. Packages • R’s base set of packages can be extended by installing additional packages • Over 4,000 packages • Search the R Project site to identify packages and functions • Install using R studio • Packages must be installed prior to useand their use specified in a script • require(packagename)

  18. Packages t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) require(weathermetrics) #previously installed # compute Celsius t$Ctemp = fahrenheit.to.celsius(t$temperature,round=1)

  19. Reshaping Melt Cast • Converting data from one format to another • Wide to narrow

  20. Reshaping require(reshape) s <- read.table('http://dl.dropbox.com/u/6960256/data/meltExample.csv',sep=',') colnames(s) <- c('year', 1:12) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') c <- cast(m,year~month, value='co2')

  21. Exercise • Create a CSV file of annual CO2 measurements taken for 1969 through 2011 at Annual Mean Concentrations at the Mauna Loa Observatory (PPM) • http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html • Read the file into R and do some basic stats

  22. Referencing fields t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) # create a new column with the temperature in Celsius t$Ctemp = (t$temperature-32)*5/9

  23. Writing files t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write.table(t,"centralparktempsCF.txt")

  24. Exercise Install weathermetrics Run the code on the prior slide

  25. Subset • trow <- t[t$year== 1999,] • tcol <- t[,c(1:2,4)] • trowcol <- t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)] Selecting rows Selecting columns Selecting rows and columns

  26. Sort • s <- t[order(-t[,1], t[,2]),] Sort on column 1 descending and column 2 ascending

  27. Recoding m$Cut <- 'Other' m$Cut[m$Temperature >= 90] <- 'Hot' • Some analyses might be facilitated by the recoding of data • Split a continuous measure into two categories

  28. Aggregate data • # average temperate for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'mean') Summarize data using a specified function Compute the mean monthly for each year

  29. Merging files • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) • # average monthly temp for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'meanTemp') • # read carbon data (http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html) • carbon <- read.csv("http://dl.dropbox.com/u/6960256/data/carbon1959-2011.txt", sep='\t',header=T) • m <- merge(carbon,a,by=‘year’) There must be a common column in both files

  30. Correlation coefficient • cor.test(m$meanTemp,m$CO2) Pearson's product-moment correlation data: m$meanTemp and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percentconfidenceinterval: 0.1454994 0.6049393 sampleestimates: cor 0.4000598 Significant

  31. Database access require(RJDBC) # Load the driver -- Change to code to point to your jar file drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.23-bin.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file p for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access

  32. Exercise • Using the Atlanta weather database and the lubridate package • Compute the average temperature at 5pm in August • Determine the maximum temperature for each day in August for each year

  33. Resources R books Reference card Quick-R

  34. Key points • R is a platform for a wide variety of data analytics • Statistical analysis • Data visualization • HDFS and MapReduce • Text mining • Energy Informatics • R is a programming language • Much to learn

More Related