470 likes | 949 Views
Introduction to R. Summer session: Lecture 3 Brian Healy. Outline. Discussion of R Importing and changing data Creating your own data Summary statistics / graphs Tests for normality. What is R?. Statistical computer language similar to S-plus Has many built-in statistical functions
E N D
Introduction to R Summer session: Lecture 3 Brian Healy
Outline • Discussion of R • Importing and changing data • Creating your own data • Summary statistics / graphs • Tests for normality
What is R? • Statistical computer language similar to S-plus • Has many built-in statistical functions • Easy to build your own functions (similar to SAS macros) • Good graphic displays • Extensive help files
Strengths • Many built-in functions • Can get other functions from the internet by downloading libraries • Relatively easy data manipulations Weaknesses • Not as commonly used by non-statisticians • Many datasets already in SAS form
Starting R • Windows / HSPH computers • Open using the start menu or g: drive • Unix / Telnet in HSPH from home • Open using R1.9 command • Unix specific commands will be discussed at the end of the session
Writing R code • Can input lines one at a time into R • Can write many lines of code in a text editor and run all at once • Using Windows version, simply paste the commands into R • Using Unix version, save the commands and run in batch mode
Types of commands • Defining variables • Inputting data • Using built-in functions • Using the help menu and notation • ?functionname, help.search(“functionname”) • Writing your own functions
Language layout • Three types of statement • expression: it is evaluated, printed, and the value is lost (3+5) • assignment: passes the value to a variable but the result is not printed automatically (out<-3+5) • comment: (#This is a comment)
Naming conventions • Any roman letters, digits, and ‘.’ (non-initial position) • Avoid using system names: c, q, s, t, C, D, F, I, T, diff, mean, pi, range, rank, tree, var • Hold for variables, data and functions
Arithmetic operations and functions • Most operations in R are similar to Excel and calculators • Basic: +(add), -(subtract), *(multiply), /(divide) • Exponentiation: ^ • Remainder or modulo operator: %% • Matrix multiplication: %*% • sin(x), cos(x), cosh(x), tan(x), tanh(x), acos(x), acosh(x), asin(x), asinh(x), atan(x), atan(x,y) atanh(x) • abs(x), ceiling(x), floor(x) • exp(x), log(x, base=exp(1)), log10(x), sqrt(x), trunc(x) (the next integer closer to zero) • max(), min()
Defining new variables • Assignment symbol, use “<-” (or _) • Scalars • scal<-6 • value<-7 • Vectors • vec<-c(0,1,2) • vec2<-c(1:10) • vec3<-c(8,6,4,2,10,12,14) • famnames<-c("Kate", "Andrew", "Brian") • Variable names are case sensitive
Examples • Try the following • 3*4+2 • 5*scal • (value+scal)*2 • 3+vec • sqrt(vec2) • trunc(log(vec3)) • What happened in the last three cases? • How can we assign the result of these to a new variable? • What is the minimum of vec3?
Indexing vectors • To choose individual observations from a vector, use vec[1] • Try the following: • Find the 3rd value from vec • List the 4th, 5th, and 6th values from vec2 • Make a new variable that is only the first 2 values from vec2
Matrix • There are several ways to make a matrix • To make a 2x3 (2 rows, 3 columns) matrix of 0’s: • mat<-matrix(0,2,3) • To make the following matrix: • mat2<-rbind(c(71,172),c(73,169),c(69,160),c(65,130)) • mat3<-cbind(c(71,73,69,65),c(172,169,160,130)) • To make the following matrix: • mat4<-matrix(vec2,2,5, byrow=T)
Matrix Examples • Create the following matrices • ex1 • ex2 ex3
Rows and columns of matrices • You can pick an individual data point, row or column from a matrix • mat[1,2] will give the observation in the first row, second column • What happens when you type • mat2[,2] • mat2[2,] • mat4[1:3,1]
Lists • The final form of data is lists • Type the following code • ourlist<-list(v1=vec,v2=vec2, fam=famnames) • Now, let’s look at ourlist • If there are no names, we can give the elements of our list names using • names(outlist)<-c(“v1”, “v2”, “fam”) • To get individual parts of a list we must use the $ sign • What happens when you type ourlist$v1? • What happens when you type ourlist$v1[1]?
Data frames • Very similar to matrices in form • Different columns can be of different types, unlike matrices • famages<-c(24, 29, 27); famheight<- c(64, 73, 71) • We can put my family names, ages and heights into a data frame • fam<-data.frame(famnames, famages, famheight)
Converting data • To change a matrix to a data frame • as.data.frame() • To change a data frame to a matrix use • fammat<-as.matrix(fam) • Try this on your own • How does this change? • What happens when we type fam[,2]+3 and fammat[,2]+3
Opening data • You must always know the directory where your files have been stored • In R for Windows, use \\ c:\\splus\\free.dat • In Unix, splus/free.dat • For now we will show to do this in Windows
Reading in data • Change the directory to g:\shared\bio271summer • Let’s look at the class data set in the notepad • To read in data, use this command: class<-read.table("class.dat", header=T) • This command assumes that the data is space or tab delimited • Reads in as data frame
Working with the data • Type class to look at the data • How could we find people height in cm? • Use cmheight<-class[,3]*2.54 • This command completes the operation on the entire column as we discussed before • We can attach this new variable to the old dataset using newclass<-cbind(class,cmheight) • rbind is used to combine extra rows • In this example, that would be more students
Reading in data cont. • Look at the data in auto.dat in the notepad • Note that data is comma delimited • Try the read.table method • What is wrong? Look at ?read.table • auto<-read.table(“auto2.dat", sep=",", header=T)
Practice • Find the minimum height in the class • Add two family members to your class data set • Make a new variable mpg from the third column of auto
Outputting data-CHECK • write a vector to a file: write() write(x,file=“outdata”) • write a matrices or data frames write.matrix<-function(x,file=“”,sep=“”) { x<-as.matrix(x) p<-ncol(x) cat(dimnames(x)[[2]],format(t(x)),file=file, sep=c(rep(sep,p-1),”\n”)) } • Try to write your family dataset to your P:\\ drive
Input/Output • execute commands from a file: source(“command.s”) • use options(echo=T) to have the commands echoed. • divert output to a file: sink(“record.lis”) • write objects to an external file: dump(c(“a”,“x”,“ink”),file=“outdat”) • general print: cat(format(iris[,1,1]),fill=60)
Sorting data • There are several ways to sort data in R • To sort a vector, use sort() • To sort the ages in the class, sort(class[,2]) • To sort the entire matrix, use order() • To order the class by age, class[order(class[,2]),] • To get the same result in two steps • o<-order(class[,2]) • class[o,] • Try to sort auto by foreign
Missing values • What happens if there are missing values as in the auto data sets? • R codes these as NA • How can we change the NA’s to 0’s? • is.na(x) is a logical function that assigns a T to all values that are NA and F otherwise • Ans: data[is.na(data)]<-0
Practice • In slide 11, several functions were mentioned including sum and min • Try these functions on mpg • What has happened? • How can we find the sum of mpg not including the missing values? • mean(mpg[!is.na(mpg)])
Practice • What is the difference when you use the functions sqrt on the same data? • Why is there a difference in the effect of the missing data?
Loops and conditionals • Conditional • if (expr) expr • if (expr) expr else expr • Iteration • repeat expr • while (expr) expr • for (name in expr1) expr • For comparisons use: • == for equal • != for not equal • > for greater than • && for and • | for or
Examples • What happens with the following code? • if (value==1) {check<-1} else {check<-0} • counter<-0 for (i in 1:10) { if (auto[i,3]<10){counter<-counter+1} }
Basic R functions • Let’s use the class data set • Find the summary statistics of the data • summary(class) • Notice how names are handled compared to ages and heights • What happens when we type range(class) • How can we find the range of age and height using this command?
Functions • You can define a function to complete any operation • out<-function(var){definition} • Let’s look at this function: filter <- function(x){ if (is.na(sum(x))){fil<-F} else {fil<-T} } filt <- apply(auto, 1, filter) newauto<-auto[filt,] • What is this function doing?
Practice • Write a function to calculate the sum of the numbers in a vector greater ten and the sum of the numbers less than or equal to ten • Output the answer in a list using list(ans1,ans2) at the end of your function • Try your function on the mpg data from the auto data set • Look at the output when you apply your function
Generating data from distributions • Many applications require generating data from specific distributions • r<distname>(n,<parameters>) • Possible distributions: beta, cauchy, chisq, f, gamma, norm, t, unif • You find other characteristics of distributions as well • d<dist>(x,<parameters>): density at x • p<dist>(x,<parameters>): cumulative distribution function to x • q<dist>(p,<parameters>): inverse cdf
Using the distribution functions • Often when we use simulations we need to use the r functions • Try sample<-runif(n=10,min=0,max=1) • What happens here?
Plots • One of the biggest advantages of R is the quality of the plots • Let’s plot the ages of the class • To make plots in R, use the following commands for the appropriate plots • age<-class[,2] • histogram- hist(age) • box plot- boxplot(age)
Plot Command The basic command-line command for producing a scatter plot or line graph. col= set colors, lty= set line types, lwd= set line widths, pch= set the character type, type= pick points (type = "p"), lines ("l"), cex= set the "character expansion“, xlab= and ylab= set the labels, xlim= and ylim= set the limits of the axes, main= put a title on the plot, mtext= add a sub-title, help (par) for details
One-Dimensional Plots • barplot(height) #simple form • barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE) • boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE) • hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside)
Two-Dimensional Plots • lines(x, y, type="l") • points(x, y, type="p")) • matplot(x, y, type="p", lty=1:5, pch=, col=1:4) • matpoints(x, y, type="p", lty=1:5, pch=, col=1:4) • matlines(x, y, type="l", lty=1:5, pch=, col=1:4) • plot(x, y, type="p", log="") • abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=) • qqplot(x, y, plot=TRUE) • qqnorm(x, datax=FALSE, plot=TRUE)
Three-Dimensional Plots • contour(x, y, z, v, nint=5, add=FALSE, labex) • interp(x, y, z, xo, yo, ncp=0, extrap=FALSE) • persp(z, eye=c(-6,-8,5), ar=1)
Multiple Plots Per Page • par(mfrow=c(nrow, ncol), oma=c(0, 0, 4, 0)) • mfrow=c(m,n) : subsequent figures will be drawn row-by-row in an m by n matrix on the page. • oma=c(xbot,xlef,xtop,xrig):outer margin lines of text. • mtext(side=3, line=0, cex=2, outer=T, "This is an Overall Title For the Page") • Try this code on your own • par(mfrow=c(2,1) • hist(age) • plot(class[,2],class[,3])
Output to a postscript file • Often we want to output an R graph to a postscript file to place it into a Latex file or other document • To do this, we use the following code • postscript(“graph1.ps”) – This opens a postscript file in the home directory • plot(regr) – This plots a graph into the file • dev.off() – This closes the postscript file
Making plots of your own • Make the following plots • Histogram of height in the class with the appropriate labels • Scatterplot of height and age in the class using a different point • Make a postscript file with four plots of your choice from the baby dataset on one graph • Write a function to make a histogram and boxplot on one graph and use it on any of your data