470 likes | 538 Views
Introduction to R. Workshop Plan. The R interface The Console The Script Editor The “Workspace” R programming rules… How does R ‘think’ R Objects The data frame Importing Data Data Manipulation Simple Analyses. What is R?. Programming environment
E N D
Workshop Plan • The R interface • The Console • The Script Editor • The “Workspace” • R programming rules… • How does R ‘think’ • R Objects • The data frame • Importing Data • Data Manipulation • Simple Analyses J. Charles Victor – Intro to R
What is R? • Programming environment • Useful for statistics and powerful graphing capabilities • But you will be programming, not clicking and pointing • Free, ‘open’ software • Users create programs which are made available to other users via web and installation interface • Based on S, S-Plus programming J. Charles Victor – Intro to R
First Step • Open R… J. Charles Victor – Intro to R
The R Console • The main window • Commands are written and submitted • Log of progress recorded • Output (except graphs) produced • Similar to STATA interface and function • Prompt ‘>’ indicates R is waiting for a command • Try the following: > x <- c(1,2,3,4,5) [ENTER] > mean(x) [ENTER] • You should find the following result • [1] 3 • R is telling you the mean of [1,2,3,4,5] is 3 J. Charles Victor – Intro to R
The Script Editor • Accessible from the File menu item • Used to create a series of commands (ie program) that can be saved and run at a later date • Similar to DO editor in STATA • Will make SAS and SPSS syntax users more comfortable • Write commands, highlight and click on submit button • Try opening the Script editor (‘New Script’) and repeating the same commands as before • X <- c(1,2,3,4,5) • mean(x) • Now highlight this code and click on the submit button J. Charles Victor – Intro to R
Script Editor • Nothing Fancy – but VERY useful • Saves programs J. Charles Victor – Intro to R
The Workspace • The ‘Workspace’ is the R data and objects • When exiting R, saving the workspace saves your data and work • Let’s see our work thus far • Type: ls() • What do you see? • Try saving your work thus far • File -> Save Workspace J. Charles Victor – Intro to R
R Programming • General Programming • R is generally case-sensitive • Character strings must be in quotes (only “ “) • Hitting ENTER submits a command • If you want a command to go over more than one line, add a ‘+ then hit enter • Try the following: > newy <- c(0,0,1, + + 1,0) • Use ‘comments’ to identify what you have done • Comments begin with “#” J. Charles Victor – Intro to R
How does R think? • R thinks of data elements as ‘objects’ • Objects can be: • Single variables • Arrays of variables • Entire Datasets • Results from analyses (if saved as an object) • When you save the ‘Workspace’ you save all of these objects • So in a small sense, R works like Excel. J. Charles Victor – Intro to R
OK, I don’t understand this Object thing… • For data analysts it is usually easiest to start by equating the term ‘object’ to mean ‘variable’ at first • We have already created one variable called ‘x’ • We can create another variable (object) called ‘y’ that has the values (20, 27, 18, 50, 99) > y <- c(20,27,18,50,99) • To see all of the variables (objects) in memory we can use the ‘list’ command ls() • Or click on MISC -> LIST OBJECTS • What do you see? J. Charles Victor – Intro to R
Creating Data • How? No Spreadsheet?? • Create your own Class of 5 students, need average test score John Smith 58 M Jaysharee Singh 82 F Emily Xu 90 F Ute VanDroglen 65 F Charles Victor 90 M J. Charles Victor – Intro to R
First Attempt to Enter Data • Many ways to create and edit data in R • First create variables (objects) • Then compile the data set from the variables • Creating variables – 2 main ways • Relatively few values VARIABLENAME <- c(VALUE1,VALUE2,VALUE3….) • Character values in quotes z <- c(“ABC”,”DEF”) • Many values VARIABLENAME <- scan() ENTER VALUE1 VALUE2 VALUE3 VALUE4 …. VALUE8 ENTER VALUE9 VALUE10 ….. ENTER ENTER > J. Charles Victor – Intro to R
Try entering this data in • Use method 1 for first name and last name and sex • Use method 2 for exam mark John Smith 58 Jaysharee Singh 82 Emily Xu 90 Ute VanDroglen 65 Charles Victor 90 After you create each variable, look at the variable to see that it is correct by typing the variable name at the command prompt > firstname J. Charles Victor – Intro to R
A few notes on entering values 1) Variable names can contain most special characters including ‘.’ 2) Missing values should be coded as NA 3) To create a variable whose values are a sequential list of numbers, use a colon (:) StudentID <- c(1:5) J. Charles Victor – Intro to R
Currently we just have 5 variables (objects) These objects are independent of each other (ie the first name John is not linked with the last name Smith) To ‘link’ these objects we need to compile these variables together in a dataset which R calls a ‘data frame’ In R a data frame is an object just like a variable, and thus it is created in a similar fashion DATA_NAME <-data.frame (VARIABLE1,VARIABLE2,VARIABLE3) Note: All variables must have the same number of observations Now take a look at the data by typing the dataset name Creating the Dataset J. Charles Victor – Intro to R
Back to ‘Objects’ • Look at the objects now in memory • ls() or click MISC -> List all object • You should see all of the variables + the dataset • You can now use the dataset similar to how we have used variables • To see a variable, type the variable name • To see the dataset, type the dataset name J. Charles Victor – Intro to R
BUT… • Once attached to a dataset, the variables (Studentid, firstname, lastname, mark, sex) are different than the ‘objects’ in R’s memory • So we have • The object: mark • The variable mark on the class dataset • You may want to get rid of the ‘objects’ now that you have compiled them onto the dataset – (any changes made to the objects, will not be reflected on dataset) rm(studentid, firstname, lastname, mark, sex) J. Charles Victor – Intro to R
Importing Existing Data into R • R has not been very foreign data friendly • But this is changing - rapidly • Optimally datasets need to be in the form of: • ASCII text • Tab delimited • Comma delimited • Best to convert Excel data into one of these formats J. Charles Victor – Intro to R
Importing: ASCII text • Use command: read.table OBJECT <- read.table(“C:\\My Document\\FILE.TXT, header=T) • Note: Pathways, have to have double slash: \\ • If variable names are on the first row • Use header=T option • Otherwise variables will be named V1 V2 V3… • Try to import the heart_rx dataset • If you are unsure of the pathway you can use the command: file.choose() nested in the read.table • This will cause R to bring up a GUI to choose your file OBJECT <- read.table(file.choose(), header=F) • Try to import the heart_rx_noheader dataset this way J. Charles Victor – Intro to R
Importing: Tab Delimited or Comma Separated or Database File • Tab Delimited • Use command: read.delim OBJECT <- read.delim(“C:\\My Document\\FILE.TXT”, header=T,sep=“\t”) • Comma Separated Value (CSV) • Use command: read.csv OBJECT <- read.csv(“C:\\My Document\\FILE.CSV”, header=T,sep=“,”) J. Charles Victor – Intro to R
Importing: Access, SPSS, Stata etc • Best method: 3rd party software to convert data to a Delimited or CSV file • DBMS Copy is very popular • Stat Transfer is very good • Some users have created • read.spss • read.xport (for SAS files) • read.dta (for STATA files) • But these commands need to be downloaded and installed (more on that later) J. Charles Victor – Intro to R
Importing: R Dataset • If a workspace has been saved from a previous session, simply load the workspace by ‘clicking and pointing’ • Or use the load command load(“PATHWAY\\FILENAME.Rdata”) J. Charles Victor – Intro to R
Creating a Dataset from a Dataset • If you want to create a copy of a current dataset, this is a simple function in R. • Simply create a new object (ie with a different name) from the existing dataset NEWDATA <- OLDDATA • To create a new dataset from an edited version of an old dataset • NEWDATA <- edit(olddata) • This will bring up the data editor (more on this later), and any changes will be attributed to NEWDATA, but not to OLDDATA J. Charles Victor – Intro to R
DATA MANIPULATION 99% of the work (don’t underestimate)
Data Manipulation: General • Most of your time should be spent in this phase • R is probably not the ‘best’ package • Data manipulation includes (among other things) • Renaming variables • Getting rid of variables • Creating variables • Changing variables (eg categorising age) • Changing values of specific observations (eg someone reports age of 180) • Getting rid of observations • Merging datasets J. Charles Victor – Intro to R
A couple of things first…. • R has MANY ways of accomplishing similar tasks due to its open software construction • When referring to variables on a dataset you must either: • Use: d_name$v_name OR • “Attach” the dataset • Attach(d_name) • But attaching the dataset does not allow for manipulation of dataset variables only the use of these variables J. Charles Victor – Intro to R
What is he talking about?? • Lets create a new dataset with two variables x and y • X will be the numbers 1 to 20 • Y will be 20 random values from a normal distribution X<- c(1:20) Y<- rnorm(x) Testdata<- data.frame(x,y) • Remove the x and y objects rm(x,y) • Print the dataset, and then x and y testdata X Y • Notice we could not access x and y this way. Try: Testdata$x Testdata$y • That worked, but is a lot of typing. So we could also: Attach(testdata) X Y • That worked too! So attaching a dataset, allows us to access the variables on the dataset, without using the $ format – but only for visualizing and analysing, not editing (so I don’t like to do it) J. Charles Victor – Intro to R
Renaming Variables • Occasionally we need to rename a variable • Many ways • We can edit the data like a spreadsheet Fix(d_name) • Create a copy of Class dataset, and “Fix” it NEWDATA <- edit(d_name) • OR We can create a new variable d_name$new_v_name <-d_name$old_v_name J. Charles Victor – Intro to R
Deleting and Creating Variables • To delete a variable set a variable to NULL d_name$v_name<- NULL • To create a variable just set the new variable equal to some value – we use a similar construct as before • d_name$v_name<- SOME_VALUE OR EXPRESSION J. Charles Victor – Intro to R
Creating Variables • Suppose we want a variable identifying the day the exam was written and a variable identifying the maximum value for the exam class$test_day<- c(“Monday”) class$test_max<- c(100) J. Charles Victor – Intro to R
Creating Variables • We can also create variables based on other variables • Imagine that we now want to calculate the students percentage on the exam • d_name$newv_name= expression • For example: class$prct<- (class$score/class$test_max)*100 • Remember rules of BEDMAS J. Charles Victor – Intro to R
A Note on Mathematic Functions • + = addition • - = subtraction • * = multiplication • / = division • ( ) = brackets • ** = to the exponent • abs( x ) = absolute value of x • int( x ) = integer value of x • log( x ) = natural log of x (ie Ln to non-math types) • log10( x ) = log base 10 of x (ie Log to non-math types) • sqrt( x ) = square root of x • round( x, value) = round x, to value decimals J. Charles Victor – Intro to R
Changing Variables • Lets change the existing prct variable into letter grades • Map out which letter grades apply to which percents • Below 50 = F • 50 – 59 = D • 60 – 69 = C • 70 – 79 = B • 80 – 100 = A J. Charles Victor – Intro to R
Changing Variables - Recoding • Two ways • 1) Only for numeric variables • Using Base R • Cut function D_name$new_v_name<- Cut(d_name$old_v_name , breaks = c(breakpoints) OR breaks = #breaks, labels = c(“LABEL1”, “LABEL2”,….) ) EG class$lettergrd<- cut(class$prct , breaks = c(-Inf,49,59,60, 79,100), labels = c(“F”,”D”,”C”,”B”,”A”) ) J. Charles Victor – Intro to R
Recoding variables – Second Method • There is a “RECODE” function, but it has been developed outside of the original Base R • We can incorporate programs that have been written by other people • Often these programs are compiled into a group of programs that are used for a similar construct • These groups of programs are called “Packages” J. Charles Victor – Intro to R
Installing a Package (to get a function that you do not have) • First, note that you do not have ‘recode’ • help(recode) • Now (after searching google) you find out that a special function called ‘recode’ is available in the package called ‘car’ • Click PACKAGES -> INSTALL PACKAGE(S) • R will ask you to set a CRAN Mirror (site from which to download packages) • Choose CANADA (ON) • R will now ask which package you want to download • Choose “CAR” • R will now download the ‘car’ package • BUT the car package has just been installed, it has not yet been loaded • Click PACKAGES -> LOAD PACKAGE(S) • R will ask which package to Load from all that you have installed • Choose “CAR” • You can now use the recode function • Type help(recode) J. Charles Victor – Intro to R
Recoding – Second Method • Now that the ‘CAR’ package is installed, we can use ‘recode D_name$new_v_name<- recode(d_name$old_v_name, recodes) Where recodes can be in form of: specific values: “c(99,999) = NA; c(1)=‘Y’ “ range of values: “lo:50=‘F’; 51:60=‘D’ “ class$lettergrd2<- recode(class$prct, “lo:50=‘F’; 51:60=‘D’;…..”) J. Charles Victor – Intro to R
Combining Conditional Statementsto Change Values within Observations • Your TA informs you that Jim Smith was sick on for the Monday Exam, instead he was given a makeup exam, out of 98 • To identify observations using conditional statements, we use the R function IFELSE IFELSE(condition/expression, value if true, value if false) class$testmax<- ifelse(class$firstname == ‘Jim’ & class$lastname == ‘Smith’, 98, class$testmax) J. Charles Victor – Intro to R
More complex… • You are then informed that the twins (Joan and John Smith) cheated, you have to give them zeros: class$score<- ifelse((class$firstname == ‘Joan’ | class$firstname == ‘John’) & class$lastname == ‘Smith’, 0, class$score) J. Charles Victor – Intro to R
Logical Statements • < = Less than • <= = Less than or equal to • > = Greater than • >= = Greather than or equal to • != = Not equal to • == = Equal to • & or && = Intersection boolean operator • | or || = Union boolean operator J. Charles Victor – Intro to R
Deleting Observations (or Subsetting) • Suppose we want to look at only the Female students • We need to either delete the Males or keep the females • Best to create a new dataset with only females than deleting observations from our original dataset • Many ways – Use subset command New_d_name <- subset(old_d_name, condition, select=variables wanted) J. Charles Victor – Intro to R
Females <- subset(class, class$sex == ‘F’) Note, we can also select out certain variables only Males <- subset(class, class$sex == ‘M’, select=c(firstname,lastname,lettergrd) ) J. Charles Victor – Intro to R
Data Merge • Two important types of merge • Concatenation • Adding new observations to a set of old observations • Matched merge • Adding new variables (values) to an existing dataset with the same observations • (eg we need to add mid-term marks to our exam database) J. Charles Victor – Intro to R
Concatenation • Easy • Use rbind function, and add all datasets new_d_name <- rbind(d_name1, d_name2,…) But all datasets must have same number (and names) of variables! J. Charles Victor – Intro to R
Matched Merge • A little more complex • Use merge function If there is a common variable on which to merge: New_d_name <- merge(d_name1, d_name2, by = “ID”, all=TRUE) If the matching variables has different names New_d_name <- merge(d_name1, d_name2, by.x=“IDX”, by.y=“IDY”,all=TRUE) J. Charles Victor – Intro to R