690 likes | 704 Views
A comprehensive introduction to R programming language for statistical analysis. Learn about functions, working with data, importing/exporting data, graphs and statistics, and more. Access presentation materials and additional resources.
E N D
Ann Arbor ASA Up and Running Series R Sponsored by the Ann Arbor Chapter of the American Statistical Association and the Department of Statistics of the University of Michigan
Content • Introduction • R Help • Functions • Working with Data • Importing/Exporting Data • Graphs + Statistics • Practice Problems • Further Resources Ann Arbor ASA Up and Running Series: R
http://community.amstat.org/annarbor/home Presentation Materials Up and Running Series R Select files: R - code R - furniture_txt R - furniture_xlsx R - datafiles R - workshop Save upload of each file to Desktop Ann Arbor ASA Up and Running Series: R
Introduction - What is R? • R is open source with code available to users • R is object-oriented programming • involves the S computer language • R is a commonly used for statistical analysis • R is a free software package • R-project.org Ann Arbor ASA Up and Running Series: R
Introduction - Installation Ann Arbor ASA Up and Running Series: R
Introduction - More About R • Statistical analysis is done using pre-defined functions in R. • Upon download of the ‘base’ package, you have access to many functions. • More advanced functions will require the of download other packages. Ann Arbor ASA Up and Running Series: R
Introduction – What can you do with R? • Topics in statistics are readily available • Linear modeling, linear mixed modeling, clustering, multivariate analysis, non-parametric methods, classification, among others. • R produces high quality graphics • Simple plots are easy • With more practice, users can produce publishable graphics! Ann Arbor ASA Up and Running Series: R
Introduction - Launch R • Start All Programs Math & Statistics R Workspace Ann Arbor ASA Up and Running Series: R
Introduction – Editor Window • Get Editor window: File New script • More convenient than workspace Workspace Editor window Ann Arbor ASA Up and Running Series: R
Introduction - Data Objects in R • Users create different data objects in R • Data objects refer to variables, arrays of numbers, character strings, functions and other more complicated data manipulations • <- allows you to assign data objects • Type in your editor window: a <- 7 • Submit this command by highlighting it and pressing ctrl+r • Practice creating different data objects and submit them to the workspace Ann Arbor ASA Up and Running Series: R
Introduction - Data Objects in R • Type objects () • This allows you to see that you have created the object aduring this R session • You can view previously submitted commands by using the up/down arrow • You can remove this object by typing rm(a) • Try removing some objects you created and then type objects() to see if they are listed Ann Arbor ASA Up and Running Series: R
Examples • Set up vector named x: • x <- c(5,4,3,6) • This is an assignmentstatement • the functionc() creates a vector by concatenating its arguments • Perform vector/matrix arithmetic: • v <- 3*x - 5 Ann Arbor ASA Up and Running Series: R
Questions? Ann Arbor ASA Up and Running Series: R
Content • Introduction • R Help • Functions • Working with Data • Importing/Exporting Data • Graphs + Statistics • Practice Problems • Further Resources Ann Arbor ASA Up and Running Series: R
R Help – CRAN: Search • CRAN: Search • R archives (manuals, mail, help files, etc.) • faced with a tough analysis question • see if another R user has addressed the question before Ann Arbor ASA Up and Running Series: R
R Help • To get help on any specific function: • help(function.name) • ?(function.name) • Sometimes help is not available from the packages downloaded • ??(function.name) Ann Arbor ASA Up and Running Series: R
R Help • To see a list of all of the functions that come with the base R package • library(help = “base”) Error: unexpected input in "library(help = ““ • library(help = "base") Ann Arbor ASA Up and Running Series: R
R Help • Two popular R resource websites: • Rseek.org • nabble.com Ann Arbor ASA Up and Running Series: R
R Help • For help via the Internet submit • help.start() Ann Arbor ASA Up and Running Series: R
Questions? Ann Arbor ASA Up and Running Series: R
Content • Introduction • R Help • Functions • Working with Data • Importing/Exporting Data • Graphs + Statistics • Practice Problems • Further Resources Ann Arbor ASA Up and Running Series: R
Functions - R Reference Card*created by Tom Short • There are thousands of available functions in R • Reference Card provides a strong working knowledge • Look at the organization of the Reference Card • Try out a few of the functions available! Ann Arbor ASA Up and Running Series: R
Functions - R Reference Card*created by Tom Short Ann Arbor ASA Up and Running Series: R
Functions - Generating Sequences • Sequences • seq(-5, 5, by=.2) • seq(length=51, from=-5, by=.2) • Both produce a sequence from -5 to 5 with a distance of .2 between objects Ann Arbor ASA Up and Running Series: R
Functions - Replicating Objects • Replications • rep(“x”, times=5) • rep(“x”, each=5) • Both produce x replicated 5 times Ann Arbor ASA Up and Running Series: R
Questions? Ann Arbor ASA Up and Running Series: R
Content • Introduction • R Help • Functions • Working with Data • Importing/Exporting Data • Graphs + Statistics • Practice Problems • Further Resources Ann Arbor ASA Up and Running Series: R
Working with Data • There are many data sets available for use in R • data() to see what’s available • We will work with the trees data set • data(trees) • This data set is now ready to use in R • The following are useful commands: • summary(trees) – summary of variables • dim(trees) – dimension of data set • names(trees) – see variable names • attach(trees) – attaches variable names Ann Arbor ASA Up and Running Series: R
Extracting Data • R has saved the data set trees as a data frame object • Check this by typing - class(trees) • R stores this data in matrix row/column format: data.frame[rows,columns] • trees[c(1:2),2] • first 2 rows and 2nd column • trees[3,c(“Height”, “Girth”)] • reference column names • trees[-c(10:20), “Height”] • skips rows 10-20 for variable Height Ann Arbor ASA Up and Running Series: R
Extracting Data • The subset() command is very useful to extract data in a logical manner, where the 1st argument is data, and the 2nd argument is logical subset requirement • subset(trees, Height>80) • subset where all tree heights >80 • subset(trees, Height<70 & Girth>10) • subset where all tree heights<70 AND tree girth>10 • subset(trees, Height <60 | Girth >11) • subset where all tree heights <60 OR Girth >11 Ann Arbor ASA Up and Running Series: R
Questions? Ann Arbor ASA Up and Running Series: R
Content • Introduction • R Help • Functions • Working with Data • Importing/Exporting Data • Graphs + Statistics • Practice Problems • Further Resources Ann Arbor ASA Up and Running Series: R
Importing Data • The most common (and easiest) file to import is a text file with the read.table() command • R needs to be told where the file is located • set the working directory setwd("C:\\Users\\akazanis\\Desktop") • tells R where all your files are located • OR point to working directory • File Change dir… and choosing the location of the files • OR include the physical location of your file in the read.table() command Ann Arbor ASA Up and Running Series: R
Using the read.table() command • Include the physical location of your file in the read.table() command read.table("C:\\Users\\akazanis\\Desktop\\furniture.txt",header=TRUE,sep="") • Important to use double slashes \\ • rather than single slash \ • header=TRUE or header=FALSE • Tells R whether you have column names on data Ann Arbor ASA Up and Running Series: R
Using the read.table() command • Another way of specifying the file’s location is to set the working directory first and then read in the file • setwd(“C:\\Users\\akazanis\\Desktop”) • read.table(“furniture.txt”,header=TRUE,sep=“”) • OR point to the location File Change dir… pointing to the file’s location • Then, read in the data file read.table(“furniture.txt”,header=TRUE,sep=“”) Ann Arbor ASA Up and Running Series: R
read.table(), read.csv(), Missing Values • It is also popular to import csv files since excel files are easily converted to csv files • read.csv() and read.table() are very similar although, they handle missing values differently • read.csv() automatically assigns an ‘NA’ to missing values • read.table() will not load data with missing values • Assign ‘NA’ to missing values before reading it into R Ann Arbor ASA Up and Running Series: R
read.table(), read.csv(), Missing Values • Let’s remove a data entry from both furniture.txt and furniture.csv • From the first row, erase 100 from the Area column • Now read in the data from these two files using read.table() and read.csv() • You should see that you cannot read the data in using the read.table() command unless you input an entry for the missing value Ann Arbor ASA Up and Running Series: R
Other Options for Importing Data • *** When you download R, automatically obtain the foreignpackage*** • Submit library(foreign) • many more options for importing data: • read.xport(), read.spss(), read.dta(), read.mtp() • For more information on these options, submit help(read.xxxx) Ann Arbor ASA Up and Running Series: R
Exporting Data • You can export data by using the write.table() command write.table(trees,“treesDATA.txt”,row.names=FALSE,sep=“,”) • Specifies that we want the trees data set exported • Type in name of file to be exported. • By default R writes file to working directory already specified unless you give a location • row.names=FALSE • tells R that we do not wish to preserve the row names • sep=“,” • data set is comma delimited Ann Arbor ASA Up and Running Series: R
Questions? Ann Arbor ASA Up and Running Series: R
Content • Introduction • R Help • Functions • Working with Data • Importing/Exporting Data • Graphs + Statistics • Practice Problems • Further Resources Ann Arbor ASA Up and Running Series: R
Example - Furniture Data Set • Assign a name to the furniture data set, as we read it in, to do some analysis furn<-read.table(“furniture.txt”,sep=“”,h=T) • To examine data set • dim(furn) • summary(furn) • names(furn) • attach(furn) • It is important to attach before subsequent steps with the data Ann Arbor ASA Up and Running Series: R
Graphs • R can produce very simple and very complex graphs • Make a simple scatter plot of the Area and Cost variables from the furniture data set • plot(Area,Cost,main=“Area vs Cost”,xlab=“Area”,ylab=“Cost”) • Area on the x-axis • Cost on the y-axis • Title and labels the axes Ann Arbor ASA Up and Running Series: R
Graphs • Variables distribution using graphs in R • hist(Area) – histogram of Area • hist(Cost) – histogram of Cost • boxplot(Cost ~ Type) – boxplot of Cost by Type Ann Arbor ASA Up and Running Series: R
Graphs • We can make the boxplot much prettier boxplot(Cost~Type,main=“Boxplot of Cost by Type”, col=c(“orange”,“green”,“blue”), xlab=“Type”, ylab=“Cost”) Ann Arbor ASA Up and Running Series: R
Graphs • Scatter plot matrix of all variables in a data set using the pairs() function • pairs(furn) • Correlation/covariance matrix of numeric variables • cor(furn[,c(2:3)]) • cov(furn[,c(2:3)]) Ann Arbor ASA Up and Running Series: R
Graphs + Statistics • Simple linear regression using the furniture data • m1<-lm(Cost ~ Area) • summary(m1) • coef(m1) • fitted.values(m1) • residuals(m1) Ann Arbor ASA Up and Running Series: R
Graphs + Statistics • Plot the residuals against the fitted values • plot(fitted.values(m1), residuals(m1)) Ann Arbor ASA Up and Running Series: R
Graphs + Statistics • Scatter plot of Area and Cost plot(Area,Cost,main=“Cost Regression Example”,xlab=“Cost”, ylab=“Area”) • abline(lm(Cost~Area), col=3, lty=1) • lines( lowess(Cost~Area), col=3, lty=2) • Interactively add a legend • legend(locator(1),c(“Linear”,“Lowess”),lty=c(1,2),col=3) • point to graph and place legend where you wish! Ann Arbor ASA Up and Running Series: R
Graphs + Statistics • Identify different points on the graph • identify(Area, Cost, row.names(furn)) • Makes it easy to identify outliers • Use the locator() command to quantify differences between the regression fit and the loess line • locator(2) • Example - Compare predicted values of Cost when Area is equal to 50 Ann Arbor ASA Up and Running Series: R