1 / 45

Introduction to R

Introduction to R. Summer session: Lecture 3 Brian Healy. Outline. Discussion of R Importing and changing data Creating your own data Summary statistics / graphs Tests for normality. What is R?. Statistical computer language similar to S-plus Has many built-in statistical functions

jana
Download Presentation

Introduction to R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to R Summer session: Lecture 3 Brian Healy

  2. Outline • Discussion of R • Importing and changing data • Creating your own data • Summary statistics / graphs • Tests for normality

  3. What is R? • Statistical computer language similar to S-plus • Has many built-in statistical functions • Easy to build your own functions (similar to SAS macros) • Good graphic displays • Extensive help files

  4. Strengths • Many built-in functions • Can get other functions from the internet by downloading libraries • Relatively easy data manipulations Weaknesses • Not as commonly used by non-statisticians • Many datasets already in SAS form

  5. Starting R • Windows / HSPH computers • Open using the start menu or g: drive • Unix / Telnet in HSPH from home • Open using R1.9 command • Unix specific commands will be discussed at the end of the session

  6. Writing R code • Can input lines one at a time into R • Can write many lines of code in a text editor and run all at once • Using Windows version, simply paste the commands into R • Using Unix version, save the commands and run in batch mode

  7. Types of commands • Defining variables • Inputting data • Using built-in functions • Using the help menu and notation • ?functionname, help.search(“functionname”) • Writing your own functions

  8. Language layout • Three types of statement • expression: it is evaluated, printed, and the value is lost (3+5) • assignment: passes the value to a variable but the result is not printed automatically (out<-3+5) • comment: (#This is a comment)

  9. Naming conventions • Any roman letters, digits, and ‘.’ (non-initial position) • Avoid using system names: c, q, s, t, C, D, F, I, T, diff, mean, pi, range, rank, tree, var • Hold for variables, data and functions

  10. Arithmetic operations and functions • Most operations in R are similar to Excel and calculators • Basic: +(add), -(subtract), *(multiply), /(divide) • Exponentiation: ^ • Remainder or modulo operator: %% • Matrix multiplication: %*% • sin(x), cos(x), cosh(x), tan(x), tanh(x), acos(x), acosh(x), asin(x), asinh(x), atan(x), atan(x,y) atanh(x) • abs(x), ceiling(x), floor(x) • exp(x), log(x, base=exp(1)), log10(x), sqrt(x), trunc(x) (the next integer closer to zero) • max(), min()

  11. Defining new variables • Assignment symbol, use “<-” (or _) • Scalars • scal<-6 • value<-7 • Vectors • vec<-c(0,1,2) • vec2<-c(1:10) • vec3<-c(8,6,4,2,10,12,14) • famnames<-c("Kate", "Andrew", "Brian") • Variable names are case sensitive

  12. Examples • Try the following • 3*4+2 • 5*scal • (value+scal)*2 • 3+vec • sqrt(vec2) • trunc(log(vec3)) • What happened in the last three cases? • How can we assign the result of these to a new variable? • What is the minimum of vec3?

  13. Indexing vectors • To choose individual observations from a vector, use vec[1] • Try the following: • Find the 3rd value from vec • List the 4th, 5th, and 6th values from vec2 • Make a new variable that is only the first 2 values from vec2

  14. Matrix • There are several ways to make a matrix • To make a 2x3 (2 rows, 3 columns) matrix of 0’s: • mat<-matrix(0,2,3) • To make the following matrix: • mat2<-rbind(c(71,172),c(73,169),c(69,160),c(65,130)) • mat3<-cbind(c(71,73,69,65),c(172,169,160,130)) • To make the following matrix: • mat4<-matrix(vec2,2,5, byrow=T)

  15. Matrix Examples • Create the following matrices • ex1 • ex2 ex3

  16. Rows and columns of matrices • You can pick an individual data point, row or column from a matrix • mat[1,2] will give the observation in the first row, second column • What happens when you type • mat2[,2] • mat2[2,] • mat4[1:3,1]

  17. Lists • The final form of data is lists • Type the following code • ourlist<-list(v1=vec,v2=vec2, fam=famnames) • Now, let’s look at ourlist • If there are no names, we can give the elements of our list names using • names(outlist)<-c(“v1”, “v2”, “fam”) • To get individual parts of a list we must use the $ sign • What happens when you type ourlist$v1? • What happens when you type ourlist$v1[1]?

  18. Data frames • Very similar to matrices in form • Different columns can be of different types, unlike matrices • famages<-c(24, 29, 27); famheight<- c(64, 73, 71) • We can put my family names, ages and heights into a data frame • fam<-data.frame(famnames, famages, famheight)

  19. Converting data • To change a matrix to a data frame • as.data.frame() • To change a data frame to a matrix use • fammat<-as.matrix(fam) • Try this on your own • How does this change? • What happens when we type fam[,2]+3 and fammat[,2]+3

  20. Opening data • You must always know the directory where your files have been stored • In R for Windows, use \\ c:\\splus\\free.dat • In Unix, splus/free.dat • For now we will show to do this in Windows

  21. Reading in data • Change the directory to g:\shared\bio271summer • Let’s look at the class data set in the notepad • To read in data, use this command: class<-read.table("class.dat", header=T) • This command assumes that the data is space or tab delimited • Reads in as data frame

  22. Working with the data • Type class to look at the data • How could we find people height in cm? • Use cmheight<-class[,3]*2.54 • This command completes the operation on the entire column as we discussed before • We can attach this new variable to the old dataset using newclass<-cbind(class,cmheight) • rbind is used to combine extra rows • In this example, that would be more students

  23. Reading in data cont. • Look at the data in auto.dat in the notepad • Note that data is comma delimited • Try the read.table method • What is wrong? Look at ?read.table • auto<-read.table(“auto2.dat", sep=",", header=T)

  24. Practice • Find the minimum height in the class • Add two family members to your class data set • Make a new variable mpg from the third column of auto

  25. Outputting data-CHECK • write a vector to a file: write() write(x,file=“outdata”) • write a matrices or data frames write.matrix<-function(x,file=“”,sep=“”) { x<-as.matrix(x) p<-ncol(x) cat(dimnames(x)[[2]],format(t(x)),file=file, sep=c(rep(sep,p-1),”\n”)) } • Try to write your family dataset to your P:\\ drive

  26. Input/Output • execute commands from a file: source(“command.s”) • use options(echo=T) to have the commands echoed. • divert output to a file: sink(“record.lis”) • write objects to an external file: dump(c(“a”,“x”,“ink”),file=“outdat”) • general print: cat(format(iris[,1,1]),fill=60)

  27. Sorting data • There are several ways to sort data in R • To sort a vector, use sort() • To sort the ages in the class, sort(class[,2]) • To sort the entire matrix, use order() • To order the class by age, class[order(class[,2]),] • To get the same result in two steps • o<-order(class[,2]) • class[o,] • Try to sort auto by foreign

  28. Missing values • What happens if there are missing values as in the auto data sets? • R codes these as NA • How can we change the NA’s to 0’s? • is.na(x) is a logical function that assigns a T to all values that are NA and F otherwise • Ans: data[is.na(data)]<-0

  29. Practice • In slide 11, several functions were mentioned including sum and min • Try these functions on mpg • What has happened? • How can we find the sum of mpg not including the missing values? • mean(mpg[!is.na(mpg)])

  30. Practice • What is the difference when you use the functions sqrt on the same data? • Why is there a difference in the effect of the missing data?

  31. Loops and conditionals • Conditional • if (expr) expr • if (expr) expr else expr • Iteration • repeat expr • while (expr) expr • for (name in expr1) expr • For comparisons use: • == for equal • != for not equal • > for greater than • && for and • | for or

  32. Examples • What happens with the following code? • if (value==1) {check<-1} else {check<-0} • counter<-0 for (i in 1:10) { if (auto[i,3]<10){counter<-counter+1} }

  33. Basic R functions • Let’s use the class data set • Find the summary statistics of the data • summary(class) • Notice how names are handled compared to ages and heights • What happens when we type range(class) • How can we find the range of age and height using this command?

  34. Functions • You can define a function to complete any operation • out<-function(var){definition} • Let’s look at this function: filter <- function(x){ if (is.na(sum(x))){fil<-F} else {fil<-T} } filt <- apply(auto, 1, filter) newauto<-auto[filt,] • What is this function doing?

  35. Practice • Write a function to calculate the sum of the numbers in a vector greater ten and the sum of the numbers less than or equal to ten • Output the answer in a list using list(ans1,ans2) at the end of your function • Try your function on the mpg data from the auto data set • Look at the output when you apply your function

  36. Generating data from distributions • Many applications require generating data from specific distributions • r<distname>(n,<parameters>) • Possible distributions: beta, cauchy, chisq, f, gamma, norm, t, unif • You find other characteristics of distributions as well • d<dist>(x,<parameters>): density at x • p<dist>(x,<parameters>): cumulative distribution function to x • q<dist>(p,<parameters>): inverse cdf

  37. Using the distribution functions • Often when we use simulations we need to use the r functions • Try sample<-runif(n=10,min=0,max=1) • What happens here?

  38. Plots • One of the biggest advantages of R is the quality of the plots • Let’s plot the ages of the class • To make plots in R, use the following commands for the appropriate plots • age<-class[,2] • histogram- hist(age) • box plot- boxplot(age)

  39. Plot Command The basic command-line command for producing a scatter plot or line graph. col= set colors, lty= set line types, lwd= set line widths, pch= set the character type, type= pick points (type = "p"), lines ("l"), cex= set the "character expansion“, xlab= and ylab= set the labels, xlim= and ylim= set the limits of the axes, main= put a title on the plot, mtext= add a sub-title, help (par) for details

  40. One-Dimensional Plots • barplot(height) #simple form • barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE) • boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE) • hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside)

  41. Two-Dimensional Plots • lines(x, y, type="l") • points(x, y, type="p")) • matplot(x, y, type="p", lty=1:5, pch=, col=1:4) • matpoints(x, y, type="p", lty=1:5, pch=, col=1:4) • matlines(x, y, type="l", lty=1:5, pch=, col=1:4) • plot(x, y, type="p", log="") • abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=) • qqplot(x, y, plot=TRUE) • qqnorm(x, datax=FALSE, plot=TRUE)

  42. Three-Dimensional Plots • contour(x, y, z, v, nint=5, add=FALSE, labex) • interp(x, y, z, xo, yo, ncp=0, extrap=FALSE) • persp(z, eye=c(-6,-8,5), ar=1)

  43. Multiple Plots Per Page • par(mfrow=c(nrow, ncol), oma=c(0, 0, 4, 0)) • mfrow=c(m,n) : subsequent figures will be drawn row-by-row in an m by n matrix on the page. • oma=c(xbot,xlef,xtop,xrig):outer margin lines of text. • mtext(side=3, line=0, cex=2, outer=T, "This is an Overall Title For the Page") • Try this code on your own • par(mfrow=c(2,1) • hist(age) • plot(class[,2],class[,3])

  44. Output to a postscript file • Often we want to output an R graph to a postscript file to place it into a Latex file or other document • To do this, we use the following code • postscript(“graph1.ps”) – This opens a postscript file in the home directory • plot(regr) – This plots a graph into the file • dev.off() – This closes the postscript file

  45. Making plots of your own • Make the following plots • Histogram of height in the class with the appropriate labels • Scatterplot of height and age in the class using a different point • Make a postscript file with four plots of your choice from the baby dataset on one graph • Write a function to make a histogram and boxplot on one graph and use it on any of your data

More Related