500 likes | 744 Views
Introduction to R. Spring 2010 Short Course Center for Social Research. Outline. Outline. Goals. Introduce the program well enough for one to understand some basics and feel comfortable trying it on their own Focus on some typical applications with simple examples
E N D
Introduction to R Spring 2010 Short Course Center for Social Research
Outline Center for Social Research
Outline Center for Social Research
Goals • Introduce the program well enough for one to understand some basics and feel comfortable trying it on their own • Focus on some typical applications with simple examples • Identify the myriad means of obtaining assistance in further learning in order to eventually become self-sufficient Center for Social Research
Basics – What is R? • R is a dialect of the S language (‘Gnu-S’) • A long past but a short history • S developed primarily by John Chambers at Bell Labs in 70s • R was initially written by Robert Gentleman and Ross Ihaka 1993; Version 1.0.0 in 2000 • From the website: “R is a language and environment for statistical computing and graphics.” • ‘Suggests a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software’ • The base install and accompanying packages provide a fully functioning statistical environment in which one may conduct any number of typical and advanced analyses • However… • There are ~2000+ (and counting) user contributed packages that provide a vast amount of enhanced functioning • See CRAN Task views Center for Social Research
Basics – What is R? • R is a true programming language, object-oriented • Objects are manipulated by functions, creating new objects which may then have more functions that can be applied to them • Objects • Can be just about anything: a single value, variable, lists, datasets • Class • Determines how a generic function (like plot and summary) will treat the object • Type of object e.g. numeric, factor, data frame, matrix etc. • Modes (numeric, logical etc.) are the default • It may be a little confusing at first, and it will take time to get used to, but in the end it can be much more efficient than other stat languages for many applications Center for Social Research
Basics – What is R? • There are often several ways to do the same thing depending on the objects and functions being used • The same function may do different things for different objects • Example: the summary function can work on single variables, whole data sets, or specific models • The plot function can be used for e.g. a simple scatterplot, or have specific capabilities depending on what type of object is supplied to it • When performing analyses, functions will create objects that in turn have several parts (more objects) associated with them that can be used further • See the regression ‘lm’ example to the right • Type methods(plot) to see examples of what kinds of methods are available to the generic function plot > UScapt = “D.C.” #Red font indicates what #you would type at the #command prompt > UScapt [1] “D.C” > myvar1 = c(1,2,3) > myvar2 = 1:3 > myvar3 = seq(from = 1, to=3, by=1) > myvar1 [1] 1 2 3 > myvar2 [1] 1 2 3 > myvar3 [1] 1 2 3 > mymod = lm(y~x, mydata) > coef(mymod) • shows coefficients > summary(mymod) • more detailed, typical regression output > fitted(mymod) • fitted values > plot(mymod) • diagnostic plots Center for Social Research
Basics – Comparisons to Other Packages • Compared to programs such as SPSS and SAS, the language is more flexible, offers more options for analysis and graphics, and may be updated on a daily basis • It is similar to Stata in that it has a strong user community that makes their own contributions through various packages • Relative to other common statistical packages such as those mentioned above, R has a steep learning curve Center for Social Research
Getting Started - Installation • Download • www.r-project.org (or google R) • Free as in speech and beer • CRAN • Comprehensive R Archive Network • Current version 2.10.1; 2.11 scheduled for release 4-22-10 • Choose a mirror from which to download (there are many from around the globe) • Harvard, National Cancer Institute, Berkeley etc. • The base install can be used as fully functioning general purpose statistical package • Rprofile.site file for customized startup • In the R/etc folder • Can use it to preload commonly used packages for example Center for Social Research
Getting Started – Adding Functionality Through Packages • If you install all packages available from the default repositories, R will take up several gigs on your hard drive • If you prefer the space, install as needed and simply copy over to new installations followed by an occasional update • As you use it more you’ll find that you’ll use some packages very frequently, and so might choose to have them autoload on startup • Would not recommend more than a couple as some may have the same function names as others • Command line: • If you already know the package name(s): > install.packages(c(“Rpack1”,”Rpack2”)) • Go ahead and install the MASS, car and lattice packages Center for Social Research
Getting Started – The Workspace • The workspace is your current R working environment and includes any and every user-defined object (vectors, matrices, functions, data frames, lists) • May save all objects created as .Rdata file • This is not a data set! • When you next load the file, you will have everything you worked on before readily available with no need to rerun analyses Center for Social Research
Getting Started - Console • Not too friendly • Command line interface similar to Stata single line input • Still useful for easy one-liners like the hist() or summary() functions, but typically you don’t want spend much time there Center for Social Research
Getting Started - Writing Scripts • Avoid the console • Built in editor (top) • File/New script • Simple text editor like Notepad • Not so good for extended programming, but definitely better than the console approach • Tinn-R (bottom) • After initial setup it’s pretty easy to use, nice highlighting, and many options • Others • Emacs Speaks Statistics • Jaguar (JGR) • RWinEdt • Eclipse • Do not use MS Word and similar • Hidden characters and other font chicanery will make perfectly valid code fail Center for Social Research
A Note About Object Assignment • When looking through texts or websites regarding R or even these notes, you will notice objects assigned with either ‘<-’ or ‘=‘ • This used to be of some concern, and you are more than welcome to browse R-help archives for reading material regarding the distinction that you will likely regret spending time with on your death bed • For some it is a pragmatic issue (one takes fewer keystrokes and they don’t typically have to worry with others reading their code), but others find that <- makes code easier to read and it technically can be used more flexibly • Not to mention it allows them to be snobby toward those who use the = sign • Type ?“<-” at the console (with quotes) • There is a bit of a difference but it is typically not an issue except perhaps in writing some functions in particular ways. Normal usage should not result in any mishap. • I will flop between both styles so that you are used to seeing it either way. Center for Social Research
Getting Started - Output • Unlike some programs (e.g. SPSS), there is no designated output file/system • However, one can look at it such that everything created within the environment is ‘output’ • All objects are able to be examined, manipulated, exported etc. • Graphics can be saved as one of many typical file extensions e.g. jpeg, bmp etc. • Results may be stored within an object, which may then be saved within the .Rdata file mentioned previously, or perhaps as a text file to be imported into another program • Thus rather than have a separate output file full of tables, graphics etc. (which often get notably large and much of which is disregarded after a time), one has a single file that provides the ability to call up anything one has ever done in a session very easily > load(“myworkspace.Rdata”) Center for Social Research
Getting Started – Menus and Lack Thereof • R-commander (top) > library(Rcmdr) • True menu system • Has a notable variety of analyses, and many more via Rcmdr-specific add-ons and other customizations • Very useful for starting out with R as it will automatically spit out the code and makes data import easy • Various R projects • Rattle for data mining • Rkward • JGR (Jaguar), for script writing • Others available on campus • S-Plus is the proprietary version of the S language that has a look and feel like SPSS or Excel but coupled with the capabilities of the S language • SPSS with R plugin Center for Social Research
Working with the Language • Simple calculator > 7 * 10/2 [1] 35 • Generate some data > X1 = rnorm(100) > X1 [1] 0.478461962 -0.536267522 1.925856627 0.655816482 -0.243718989 [6] 0.863304660 0.240281286 -0.370047167 0.652397107 0.706349110 … • Note that R is case sensitive > x1 Error: object 'x1' not found > Designates command line entry [] Designates output by element number Typos will often be the culprit behind error messages, especially as you start out. Center for Social Research
Working with the Language:Some Example Code Comparison R StataSASSPSS summarize var1 proc means data=mydata; var var1; run; DESCRIPTIVES VARIABLES= var1 /STATISTICS=MEAN STDDEV MIN MAX. reg y x1 x2 proc reg; model y = x1 x2; run; REGRESSION /DEPENDENT y /METHOD=ENTER x1 x2. EXECUTE. > summary(var1) > model1 = lm(y~x1+x2, mydata) > summary(model1) Center for Social Research
Working with the Language:Writing a Function • Writing your own user-defined function can enhance or make more efficient the work you do • It’s always good to search for what you’re attempting first as a relevant package may already be available, and as the code is open source, you take existing functions and change them to meet your needs • No sense reinventing the wheel • But if you can’t find it already available or you need a function specific to your data needs, R’s programming language will surely allow you to do the trick > mymean = function(x) { if (!is.numeric(x)) {warning("I pity dafoo tries to bring non-numeric jibbajabba to MY house!!") return(NA_real_) } return(sum(x)/length(x)) } > var1 = 1:5 > mymean(var1) [1] 3 > mymean("Alf") [1] NA Warning message: In mymean("Alf") : I pity dafoo tries to bring non- numeric jibbajabba to MY house!! Center for Social Research
Working with the Language • Looping can allow iteration over indices of vectors, matrices, lists etc. • The example to the right first creates an empty vector ‘temp’ and then for each value takes a sample of the numbers 1 through 20 • This is just for demonstration purposes, as it would have been much more efficient to simply type > sample(1:20,5,replace=T) • And often there are more efficient ways to do things rather than explicit looping • ifelse • Compares the elements of two objects according to some boolean statement • Can return scalar or vector values for a true condition, and a different set of values for a false condition • The example provides cake for any case where x is less than y, death for all other situations > temp <- rep(NA,5) > for (i in 1:5){ temp[i]<- sample(1:20,1,replace=T) } > temp [1] 12 7 1 19 1 > x <- c(1,3,5) > y <- c(6,4,2) > z <- ifelse(x < y, "Cake","Death") > z [1] "Cake" "Cake" "Death" Center for Social Research
Creation and Simulation of Data • Create a dataset to apply new skills, test functions or future analyses • Create factor and numeric variables easily #Create data set with factor variables of time, subject and gender > mydata <- expand.grid(factor(1:4),factor(1:10)) > names(mydata) <- c("time", "subject") > mydata$gender <- factor(rep(0:1, each=4), labels=c("Male", "Female")) > mydata$y=as.numeric(mydata$time) + rep(rnorm(4,50,10),e=10) + norm(mydata$time,0,1) > dotplot(y~time, data=mydata,xlab="Time", ylab="DV") #example plot #Other plot variations > dotplot(y~time,data=mydata, groups=gender, xlab="Time", ylab="DV",key = simpleKey(levels(mydata$gender))) > dotplot(y~time|gender,data=mydata, groups=gender, xlab="Time", ylab="DV" ) Center for Social Research
Creation and Simulation of Data • One can easily generate data • Simple random sample > x = sample(c("Heads", "Tails"), 5, replace=T, prob=c(.5,.5)) • Random draws from a binomial distribution > x = rbinom(5,1,.5) • Random normal deviates > x = rnorm(50, mean=0,sd=1) • See webpage code for other examples Center for Social Research
Reading in and Working with Data • Importing Data • For basic text files – read.table > mydata <- read.table("c:/trainsacrossthesea.txt", header=TRUE, sep=",", row.names="id") • write.table for saving as text file • Use the foreign library for other programs > library(foreign) > statadat <- read.dta(“c:/rangelife.dta", convert.factors = TRUE) > spssdat <- read.spss(“c:/ourwaytofall.sav", to.data.frame=TRUE) • Reading in data from the web. Note this can work with read.dta etc. > mydata <- read.table("http://csr.nd.edu/assets/22641/testwebdata.txt", header=T, sep="") • Note the slashes (as in Unix); If you want to use backslash as is commonly seen in Windows you must double them up as \\ Center for Social Research
Reading in and Working with Data • We’ll start with some data that’s included in the base install of R • state.x77 • Matrix with state level data regarding Population, Income, Illiteracy, Life Exp, Murder rate, HS Grad, # Frost days, Area > ?state.x77 • There are actually other variables related to this data set also available but as separate vectors, e.g. state.region • Using it as a data frame will allow us a little more flexibility with some functions > state2 <- data.frame(state.x77) Center for Social Research
Reading in and Working with Data • To examine the values of the data set • Type the dataset name to see the whole thing • Return the first or last parts of a vector, matrix, table, data frame or function > head(state2, 10) • First 10 lines > tail(state2, 10) • Last 10 lines • Examine the structure of the dataset > str(state2) • Using the data editor > fix(state2) • Probably best if you don’t • If you want to work with the data in that fashion use SPSS or Excel Center for Social Research
Reading in and Working with Data • Data indexing • mydata[rows,columns] • Using row/column numbers > state2[,3] • Select the 3rd column/variable > state2[14,] • Selects the 14th row/case • Using column names > state2[, "Frost"] • However, dropping by names is not straightforward > state2[, names(state2)!="Frost"] Center for Social Research
Reading in and Working with Data • R can have as many datasets available as desired/computationally feasible • With the subset function, one can specify specific variables or rows from particular datasets > mysubset = subset(state2, state.region== “South”) • Alternative approach: mysubset = with(state2, state.region== “South”) • Again, recall that there is often more than one way to do most things in R > state2$Income > mysubset$Income • Attaching a dataset will allow you to just call the variable names rather than using the $, but when working with multiple data sets of a similar nature this can be a problem > attach(state2) > attach(mysubset) #Note the warning >Income • Which will you get? • Aside • Note that the dollar sign is often how you can obtain values, vectors etc. from the results of completed analyses > mymodel$coefs Center for Social Research
Reading in and Working with Data • ‘Factors’ represent categorical variables • Example: > gender <- c(rep("male",20), rep("female", 20)) > gender <- factor(gender) • Some functions may specifically require a factor or numeric variable in order to run • Suppose you have a numeric scale that actually represents category membership, you can at any time treat it as a factor > mynewdepvar = as.factor(mydepvar) • If it is a string variable you may want to recode it to numeric with labels at some point to avoid any hassles with analyses • Factors used as predictors are automatically dummy coded when used in modeling with many commonly used functions Center for Social Research
Initial Data Exploration: Numeric • With single variables > mean(state2$Life.Exp) > sd(state2$Life.Exp) > summary(state2$Population) • For this sort of thing, I suggest the describe function from the ‘psych’ library • Apply a function over groups > tapply(state2$Frost, state.region, mean) • Alternative approaches: > by(state2$Frost, state.region,mean) • Mean number of days with minimum temperature below freezing • With the psych library, see the describe.by function • Bivariate information > cor(state2) Center for Social Research
Initial Data Exploration: Graphical • Graphical > hist(state2$Illiteracy) > stripchart(Illiteracy~state.region, data=state2, col=rainbow(4),method="jitter") • More on graphs later Center for Social Research
Analysis Example: Regression Model • Typical statistics packages fit a model and output the results • R produces ‘model objects’ that store all the relevant information • Some commonly used model functions include • lm() Linear regression • glm() Generalized linear models • lme() Linear mixed models (in nlme package) Center for Social Research
Analysis Example: Regression Model > mod1 <- lm(Life.Exp~Frost, data=state2) > summary(mod1) • Note the model objects now available • mod1$res, $fitted, $coef etc. • You don’t do anything special to obtain these, they are automatically saved and available as separate objects for further processing • This basic lm approach above applies to an enormous array of other model functions, and many of the same types of model objects would be available • Add variables > mod2 <- update(mod1, ~. + Population + Illiteracy) • Drop variables > mod3 <- update(mod2,~. - Population) • Model Comparison > anova(mod1,mod2) Center for Social Research
Getting Started with Graphics Center for Social Research
Creating Graphs • Creating graphs in R allows you fine detail control over all aspects of the graph and you have the ability to create extremely impressive images • The typical statistical plots are available, but most have enhanced versions available in other packages • Use generic functions such as plot() followed by lines(), points(),legend(), text(), and others to add further information or changes • The generic plot function is able to used after many analyses, such that using it on a model object will bring up plot(s) specific to the analysis • mymod = lm(y~x, mydata) • plot(mymod) • In other cases, there are functions for specific types of graphs • boxplot() • hist() • barplot() > ?plot > ?par Center for Social Research
Graphics Example • Start simply > x <- rnorm(100) > hist(x) > boxplot(x) • Using Cars93 from MASS package • Add a little to it > hist(Cars93$MPG.highway, col="papayawhip“, main="Distance per Gallon 1993",xlab="Highway MPG") • Add a little more > par(mfrow=c(1,2)) > hist(Cars93$MPG.highway, col="papayawhip",prob = T) > lines(density(Cars93$MPG.highway),col="lightblue3") > boxplot(MPG.highway~Origin, col="aliceblue", data=Cars93) > par(mfrow=c(1,1)) Center for Social Research
Graphics Example • Some typically used plots come with extras by default • Example: the scatterplot function in the car library automatically provides a smoothed loess line and marginal boxplots > library(car) > scatterplot(prestige ~ income|type, data=Prestige) > scatterplot(vocabulary ~ education, jitter=list(x=1, y=1), data=Vocab) Center for Social Research
More Possibilities Center for Social Research
Graphics • With great graphing power comes great responsibility! • Just a reminder that the goal of scientific graphing is not to be fancy but to tell the story in a clear manner 3-d bar charts make R Bunny cry! Center for Social Research
Getting Help: Basics • ? and ?? • Same as help and findit in Stata • Example: ?hist ??scatterplot • Typing help.start() at the console will bring up manuals etc. • Note that while there are help files for everything, not all have the same amount of info • For example: some functions/packages come with vignettes, links to reference articles etc. Others give only the minimal information needed to use the function. • The Rseek search engine addon for Firefox and IE and others is really nice Center for Social Research
A Plan of Attack Center for Social Research
A Plan of Attack Center for Social Research
Issues in Learning and Using R • It can be more difficult to learn, though this depends in part on prior experiences with other programs • The actual help in help files can vary quite a bit, • Often assume the knowledge one might in fact be looking for and it may simply provide syntax • However there are many avenues for assistance that are extremely helpful (upcoming slides) • As with other stat programs, cryptic error messages can leave one guessing whether there is a data, code or estimation problem • Can be slower than other programming languages like C • But not for most computers and typical usage • GustafRydevik: The author also has some thought-provoking opinions on R being no-good and that you should write everything in C. • Paul Gilbert: People used to say assembler, that's progress. • -- GustafRydevik and Paul Gilbert (in a discussion about an 'R is slow’ blog post) • Can be relatively slow with huge data sets (hundreds of thousands) • But not for typical ‘large’ sizes, and there are packages and proprietary versions (R-Evolution) that can help optimize processing of large amounts data Center for Social Research
So Why R? • It’s awesome • Really, with all libraries included other statistical packages do not have as much functionality or flexibility • Great graphics • Get precisely what you want and how you want it • It’s free • Many bad puns are readily available • R’nt you glad you know more about R? • All data analyses should receive an R rating • Possible combinations with pirate-speak Center for Social Research
Why not R? No Real Reason. • Unlike other general statistical packages in which the vast majority of their capabilities can be found in other packages of the same type, R at the very least can supplement any primarily used statistical package • Furthermore, many packages such as SPSS and SAS now are able to integrate R into their regular syntax • “R is a leading language for developing new statistical methods.” • Bob Rodriguez, Senior Director of Statistical Development at SAS • “We see R as complementary more than competitive to SPSS Statistics. R is a powerful and flexible programming language for statistics, and there are many, many procedures (packages of functions) for statistics and graphics in it.” • Jon Peck, Principal Software Engineer and Technical Advisor at SPSS • SPSS syntax example: Begin program R. hist(rnorm(100)) End program. • Did we mention it’s free? Center for Social Research
Snips from the R Lists • [the effort of learning how to use R] is the price paid, just as the dollars or euros for a commercial package would be. For that price, I've learnt a great deal - and not only about R… It's a huge, awe-inspiring package - easier to perceive as such because the power is not hidden beneath a cosmetic veneer. • Felix Grant (in an article about free statistics software) • Scientific Computing World (November 2004) • Statistical computing is not easy, so how could R be? Who has ever claimed it is? Any package that makes statistical computing appear to be easy is probably giving you wrong answers half the time, or is extremely limited in scope. • Duncan Murdoch • D. Peng: I don't think anyone actually believes that R is designed to make *everyone* happy. For me, R does about 99% of the things I need to do, but sadly, when I need to order a pizza, I still have to pick up the telephone. • Douglas Bates: There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it's only a matter of time before you will have a pizza-ordering function available. Center for Social Research
Places to Start • Main R website • CRAN Task Views • http://cran.r-project.org/web/views/ • Useful packages grouped by discipline/subject • R forge development site http://r-forge.r-project.org/ • Most recent versions and unreleased packages • Quick-R website: www.statmethods.net • Particularly geared toward SAS and SPSS users but useful to anyone • Manuals • List at the R website • also accessible via help menu • An Introduction to R: main R manual • Simple R • Fundamentals: from R core developer Thomas Lumley • Reference Card Center for Social Research
Places to Start • Graphics • Addicted to R • Paul Murrel’s website, another core developer • Lattice vignette • UCLA’s Statistical Computing group • http://www.ats.ucla.edu/stat/r/ • Great resource for all general packages • http://www.r-bloggers.com/ • http://www.crantastic.org • Note new packages and updates • R For SAS AND SPSS USERS • Actually a good way to learn all three (plus S) • Commercial versions • S-Plus • R+ • R stat • REvolution Center for Social Research
Some Useful General Packages • Analysis • MASS • As in Venables and Ripley’s Modern Applied Statistics in S) • Frank Harrell’s packages • Hmisc, rms (as in the text Regression Modeling Strategies) • car, Rcmdr • John Fox • Zelig • Gary King’s ‘one function fits all’ approach • Graphics • lattice • rggobi • ggplot2 Center for Social Research
List of R Functions and Packages Used in These Slides • Functions • Dealing with data • read.table • attach, detach • subset • factor • data.frame • names • Analysis and other • cor • lm, update • summary • tapply • rnorm • sample • Graphical • hist • plot • xyplot • par • dotplot • stripchart • scatterplot • Packages • MASS • car • lattice Center for Social Research
Summary Center for Social Research