470 likes | 575 Views
Appendix A Introduction to R. References: An Introduction to R . (available in pdf from Help menu) Several books in the library. Contents. Introduction to R Some basic concepts An R session Managing R usage Entering data in R Manipulating data Examples
E N D
Appendix A Introduction to R References: An Introduction to R. (available in pdf from Help menu) Several books in the library.
Contents • Introduction to R • Some basic concepts • An R session • Managing R usage • Entering data in R • Manipulating data • Examples • Numeric and Character Vectors versus Factors Statistical Modelling Appendix A
1. Introduction to R • R is a free software environment for statistical computing and graphics. • Downloadable from CRAN • Built on S developed at AT&T’s Bell Labs. • R provides: • An extensive and coherent collection of tools for statistical analysis of data; • Graphical facilities for data display either at a workstation or as hardcopy; • An effective object-oriented programming language that can be easily extended. • R does not have a GUI so script-based: • Use Tinn-R a convenient interface for editing and executing scripts Statistical Modelling Appendix A
R capabilities include: • Import data from other software packages, including Excel spreadsheets • Summarize data in diagrams with complete control over the detail of the graphs • Perform calculations on data and basic statistical procedures • Regression including multiple linear, generalized linear, generalized additive and nonlinear regression • Design and analysis of experiments • Linear and nonlinear mixed models • Multivariate analyses • And packages …. • dae for design and analysis of experiment from Resources web site Statistical Modelling Appendix A
2. Some basic concepts • Performing tasks • R is a function-based language. So that tasks are achieved by calling functions. • When you start R, the R Console is displayed. • After some messages, there is a prompt “>”. • indicates that R is ready to accept commands. • Type 2+2 • Type sqrt(2) • Type x <- 1:5 followed by > x Statistical Modelling Appendix A
b) Type of data objects The entities R operates on are as objects. Vectors: a one-way array of values, all of which are numbers, logical values or character strings, but not a combination of these. • Attributes are length and mode. • All values in the array must have same mode; supply > 1 mode and will be coerced to single mode. Factors: vectors that hold categorical values, like the sex of a person: “male” and “female”. • Attributes are length and levels. • Levels are set of character strings allowed in the vector. • Mode of levels is always character. • While enter levels and operations in terms of levels, internally stored as codes. Statistical Modelling Appendix A
b) Type of data objects (cont’d) Matrix: a two-way array of values of the same mode. • Attributes are dim, length and mode, with dim being no. of rows and columns. Data frames: two-way array of values • Columns can be of different modes but must be of the same length. Lists: A list is an ordered collection of components where each component can be any data object. • Most general and flexible data object in R. • Often used for storing the results of functions with complex output. Statistical Modelling Appendix A
c) Naming conventions • The names of objects • can consist of any combination of upper and lowercase letters, numbers (not first) and periods (.). • Spaces not allowed • R is case-sensitive. • avoid system names — can sometimes cause problems so better to avoid • The following are all different, legal names: • Mydata, mydata, data.ozone, RandomNos, • lottery.ohio.1.28.90, • data.1, data.2, data.3 Statistical Modelling Appendix A
3. An R session An R session normally consists of: • Initializing Tinn-R and R: get Tinn-R and R started • Data entry: the data is entered from the keyboard or loaded from a file • Initial graphs: produce graphical displays of the data to explore it • Statistical analysis: one or more analyses are run on the data • Report generation: tables and diagrams that illustrate the results of the analysis are obtained • Finishing up: save those files that you want to save Statistical Modelling Appendix A
a) Initializing R and Tinn-R • Start Tinn-R (and create new file if new project). • Start R (there is a button in Tinn-R) • The program starts with the working directory set. • Working dir checked and changed using File > Change dir … • On your machine, default is the temp directory or one specified at installation. For me, might have been C:\Documents and Settings\briencj\Local Settings\Temp \Analyses but I specified D:\Analyses\R. • In a computer pool, it is C:\Program Files\R\R-2.5.0 but probably change to E:\My Documents. • If reset working directory, remains in place until changed again, even if intervening sessions. • Check have a suitable working directory. Statistical Modelling Appendix A
b) Data Entry • A common way to enter data is with the c (concatenate) function. • Following code sets up a data frame object with two columns named Temp and Thrust: Rocket.dat <- data.frame(Temp = c(19,15,35,52,35,33,30,57,49,26), Thrust = c(1.2,1.5,1.5,3.3,2.5,2.1,2.5,3.2,2.8,1.5)) > Rocket.dat Temp Thrust 1 19 1.2 2 15 1.5 3 35 1.5 4 52 3.3 5 35 2.5 6 33 2.1 7 30 2.5 8 57 3.2 9 49 2.8 10 26 1.5 > • Execute using a Send button on R tool bar in Tinn-R • To refer to one of these variable, say Temp, enter Rocket.dat$Temp or Rocket.dat[[“Temp”]]. • To avoid the need to specify the name of the data frame, attach it. • For example, attach(Rocket.dat). Statistical Modelling Appendix A
c) Initial graphs • Want a scatter diagram to look at the relationship between Temp and Thrust. • use plot(Thrust, Temp) Statistical Modelling Appendix A
d) Statistical analysis • Compute some summary statistics. > summary(Rocket.dat) Temp Thrust Min. :15.0 Min. :1.200 1st Qu.:27.0 1st Qu.:1.500 Median :34.0 Median :2.300 Mean :35.1 Mean :2.210 3rd Qu.:45.5 3rd Qu.:2.725 Max. :57.0 Max. :3.300 > Statistical Modelling Appendix A
e) Report generation • In this case, there is little to be done here, except perhaps to print out the graph and statistics. • In other cases further graphs and tables of statistics will need to be produced. • To get plot into Word or PowerPoint use • File > Copy to the clipboard > As a Metafile and then paste it. • Text output from R into a Word document • Copy-and-paste from the Console window Statistical Modelling Appendix A
f) Finishing up • One exits or quits R using the q() function. • Note must have parentheses; q on its own will print out the function q, not execute it. • Can also click on the × in the top right hand corner of the R window. • There will be a message asking whether to save the workspace image. • The workspace image contains the objects, data structures and functions, created during a session. • Your response to this depends on whether you want to save objects between sessions (see next section). • It will be saved into the current working directory with the name .RData. • You will also want to save the script file in Tinn-R, for example as Rocket.r. Statistical Modelling Appendix A
4. Managing R usage • Directories, the workspace and objects • Recommended that separate directories are used for different sets of data. Keeps separate from other data sets and everything together. • For the example I have directory Rocket. • The File menu in the R console and some buttons in Tinn-R can be used to manipulate the working directory, the workspace and objects. • The working directory • can be checked and changed using File > Change dir … in the R Console. • can also be set to the current path in Tinn-R using the Set work directory (current file path) button or R > Controlling R > Set work directory (current file path). Statistical Modelling Appendix A
The workspace • Workspaces can be loaded and saved from the File menu in R. • Note that loading a workspace does not clear the current workspace. Rather it adds to the current workspace. • So if you do not want objects currently loaded all can be cleared from Tinn-R using the Clear all objects button or this command from the R > Controlling R menu. • The rm function in R can be used to delete specific objects. • To find out what objects are in the current workspace • use the ls() function in the R Console or the List all objects button in Tinn-R. For example, > ls() [1] "MnThrust" "Rocket.dat" > • So it contains a vector MnThrust and a data frame called Rocket.dat. Note Rocket.dat is the full name and that .dat is not a file extension. Statistical Modelling Appendix A
Objects • There are several functions that allow us to find out about objects: • str, class, attributes, names and length. • Also, putting in the name of an object will cause its contents to be printed. • Convenient are buttons in Tinn-R that Print content, List names and List structure of selected objects. Statistical Modelling Appendix A
b) Getting help in R • There are three forms of help for R: • Online help for functions via the help function. For example, help(log) provides help on the log function. • You can also access this from Tinn-R by selecting the function and hitting the F1 key or clicking on the Help (selected) button. • Electronic manuals that can be accessed with Help > Manuals (in PDF). The most important of these is An introduction to R. • Published books such as Maindonald and Braun as well as Venables and Ripley. Statistical Modelling Appendix A
c) Functions in R • A function is an R expression that returns a value, usually after performing some operation on one or more arguments. • Functions are entered into the Console window and executed to perform tasks. • My preference is to enter functions into a file in Tinn-R: • allows me to edit, execute and save as I go so that I can reuse it later • easier to execute batches of commands. Statistical Modelling Appendix A
Example: plot and calculation of the mean for the Rocket data • Having set the working directory to be Rocket using File > Change dir, following commands entered into the Console window: attach(Rocket.dat) plot(Thrust, Temp) MnThrust <- mean(Thrust) MnThrust • Output in Console window is: > attach(Rocket.dat) > plot(Thrust, Temp) > MnThrust <- mean(Thrust) > MnThrust [1] 2.21 Statistical Modelling Appendix A
Points arising from these commands • As seen already, to access the columns in a data frame, without specifiying the name of the data frame, must attach using the attach function. attach(Rocket.dat) plot(Thrust, Temp) • Places a copy in the Search Path for functions • Changes not made to copy. • So to make changes to an attached data frame, • detach it before making the changes and • attach it again after the changes have been made Statistical Modelling Appendix A
Points arising from these commands • Functions can be called with or without assignment. plot(Thrust, Temp) MnThrust <- mean(Thrust) • Plot function is called without assignment and its results are displayed, in this case in a Graphics device window. • Mean function is called with assignment, the operator <- being the assignment operator. • Result assigned to object MnThrust, vector of length one. • No output is produced. • If function called without assignment, the result would have been printed but not stored. • Typing in the name of an object results in its being printed. > MnThrust [1] 2.21 The [1] at the start of the line indicates that the number 2.21 is first element of object MnThrust. Generally, such an index put at start of each line of output of an object — helps interpreting output. Statistical Modelling Appendix A
Some general rules for using functions • R is case-sensitive for names, including those of functions. • Functions may have their arguments specified or unspecified. • Arguments unspecified for concatenation function c, that puts a list of values into a vector. y <- c(2.5, 2.7, 2.1) places the three values, supplied as arguments to c, in the vector object named y. • Plot has specified arguments where the argument is given in the form name = value. Statistical Modelling Appendix A
Some general rules for using functions (cont’d) • In the case of plot a mixture of conventions can be used. • Plot of two vectors Weight and Mileage with a logarithmic y-axis could be obtained using either of the following commands: plot(x=Weight, y=Mileage, log="y") plot(Weight, Mileage, log="y") Statistical Modelling Appendix A
5. Entering data in R a) Creating data • Have used c to enter data • However, inconvenient if have the data does not have commas. • Another function requiring only spaces is scan. • See notes • If a particular value is missing in the sense that a value was not recorded for it, then enter NA to indicate this. • For example, if a respondent answered most questions in a questionnaire, NA would be entered for those unanswered. • Often want to get factors and response variable into a data.frame • See Appendix C1, Entering the results of an experiment into a data.frame. Statistical Modelling Appendix A
b) Opening previously-stored data • Data, with all its attributes, can be stored in .rda files using the save function and then later opened in an R session using the load function. • For example, save(Rocket.dat, file="Rocket.dat.rda") load("Rocket.dat.rda") • The .rda files are binary format files. Statistical Modelling Appendix A
c) Importing data from other programs • Simplest is to save the data as a comma separated (csv) file in the other program (Excel, S-Plus). • The read.table can be used to import the data. • Similarly write.table can be used to export data to other programs. • The following command reads in a file: CRDRat.dat <- read.table(“CRDRat.dat.csv”, header= TRUE, sep=”,”, colClasses = c(“factor”, “factor”, “numeric”) • The following writes a file that can be imported to Excel: write.table(CRDRat.dat, file = “CRDRat.dat.csv”, sep=”,”, col.names=NA) Statistical Modelling Appendix A
6. Manipulating data • To subset and reorder data vectors and data frames use indices. • Vector has one index • Data frame has two indices • They can be numbers or, if available, names. • Place them between (single) square brackets. • The following commands show this for the Rocket.dat data frame. > attach(Rocket.dat) > T <- Temp[1:3] > T [1] 19 15 35 > Th <- Rocket.dat[, "Thrust"] > Th [1] 1.2 1.5 1.5 3.3 2.5 2.1 2.5 3.2 2.8 1.5 > class(Th) [1] "numeric" > Th <- Th[order(Th)] > Th [1] 1.2 1.5 1.5 1.5 2.1 2.5 2.5 2.8 3.2 3.3 Statistical Modelling Appendix A
Functions that manipulate • Some return a single value, like min, max, sum and mean. • Others return vectors, like sin, cos, log and sqrt. • Can have a combination like • Deviation <- Thrust – mean(Thrust) • takes the mean of Thrust off each value. • Useful function for generating patterned data is the rep function. • rep(c(2,3,5), each = 3, times = 2) • yields • 2 2 2 3 3 3 5 5 5 2 2 2 3 3 3 5 5 5 • Many as.something functions that coerce the argument to class something. • For instance as.factor converts a numeric to a factor: > A <- rep(1:3, each = 3, times = 2) > class(A) [1] "numeric" > A <- as.factor(A) > class(A) [1] "factor" Statistical Modelling Appendix A
7. ExamplesExample I.1 House price (continued) • Data for response variable Price and explanatory variables Age and Area: • Enter this data into R. • Then saved in, say, House.Prices.dat.rda. Statistical Modelling Appendix A
R functions to obtain regression analysis #Set working directory and load House.Prices.dat load("House.Prices.dat.rda") attach(House.Prices.dat) pairs(House.Prices.dat, pch=16) #Copy scatterplot matrix to clipboard House.lm <- lm(Price ~ Age + Years, House.Prices.dat) summary(House.lm) residuals <- residuals(House.lm) fitted <- fitted(House.lm) plot(fitted, residuals, pch=16) #Copy residuals-versus-fitted-values plot to clipboard qqnorm(residuals, pch=16) qqline(residuals) #Copy qq plot to clipboard Statistical Modelling Appendix A
R output — graph Statistical Modelling Appendix A
R output — text > House.lm <- lm(Price ~ Age + Years, House.Prices.dat) > summary(House.lm) Call: lm(formula = Price ~ Age + Years, data = House.Prices.dat) Residuals: 1 2 3 4 5 6.409 -2.832 -1.551 -5.602 3.576 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.0626 10.5066 3.147 0.0879 Age -0.1897 1.1106 -0.171 0.8801 Years 10.7182 9.7272 1.102 0.3854 Residual standard error: 6.916 on 2 degrees of freedom Multiple R-Squared: 0.7142, Adjusted R-squared: 0.4285 F-statistic: 2.499 on 2 and 2 DF, p-value: 0.2858 fitted equation: Price= 33.0626 – 0.1897 Age + 10.7182 Years Statistical Modelling Appendix A
R output — residual plots Statistical Modelling Appendix A
Example III.1 Rat experiment • An experiment in which 6 rats to be fed one of three diets: 3 rats to be fed diet A, 2 diet B and 1 diet C. • Liver weights, as a percentage of total body weight, in standard order, are as follows: Statistical Modelling Appendix A
R functions to analyse rat data load("CRDRat.dat.rda") attach(CRDRat.dat) boxplot(split(LiverWt, Diet)) # # AOV with Error Rat.aov <- aov(LiverWt ~ Diet + Error(Rat), CRDRat.dat) summary(Rat.aov) model.tables(Rat.NoError.aov, type = "means") • split function in the boxplot function splits LiverWt into groups defined by Diet. • aov function does the analysis of variance and summary displays a summary Analysis of Variance table. Statistical Modelling Appendix A
R output load("CRDRat.dat.rda") attach(CRDRat.dat) boxplot(split(LiverWt, Diet)) Statistical Modelling Appendix A
AOV > # > # AOV with Error > # > Rat.aov <- aov(LiverWt ~ Diet + Error(Rat), CRDRat.dat) > summary(Rat.aov) Error: Rat Df Sum Sq Mean Sq F value Pr(>F) Diet 2 0.240000 0.120000 3.6 0.1595 Residuals 3 0.100000 0.033333 > model.tables(Rat.aov, type="means") Tables of means Grand mean 3.1 Diet A B C 3.1 3.3 2.7 rep 3.0 2.0 1.0 Statistical Modelling Appendix A
8. Numeric and Character Vectors versus Factors Mileages and types of cars from a small study • Mileages are positive integers and types of car are words. • What sort of data objects— numeric vectors, character vectors, factors? • Mileages are straightforward — they would go into a numeric vector as they are arbitrary numbers. • But types of car is different as it has only three values and they are words. Statistical Modelling Appendix A
Different data-object types Could: • decide on some codes — 1 = small, 2 = sporty and 3 = compact — and enter these into a numeric vector; • enter the words into a Character vector; or • enter either codes or words into a factor. > TestData.dat Mileage Type Type.code Type.char Type.num Eagle Summit 4 33 Small 1 Small 1 Ford Escort 4 33 Small 1 Small 1 Ford Festiva 4 37 Small 1 Small 1 Chevrolet Camaro V8 20 Sporty 2 Sporty 2 Dodge Daytona 27 Sporty 2 Sporty 2 Ford Mustang V8 19 Sporty 2 Sporty 2 Audi 80 4 27 Compact 3 Compact 3 Buick Skylark 4 23 Compact 3 Compact 3 Chevrolet Beretta 4 26 Compact 3 Compact 3 > str(TestData.dat) 'data.frame': 9 obs. of 5 variables: $ Mileage : num 33 33 37 20 27 19 27 23 26 $ Type : Factor w/ 3 levels "Compact","Small",..: 2 2 2 3 3 3 1 1 1 $ Type.code: Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 $ Type.char: chr "Small" "Small" "Small" "Sporty" ... $ Type.num : num 1 1 1 2 2 2 3 3 3 Note rownames Statistical Modelling Appendix A
Decision • Factors treat the values as categories and are treated specially in analyses by R. • So the decision is important. • Most likely type of car entered into a factor; • Probably more informative to enter the words. • Entering words is more work so enter numeric codes and • use labels argument in factor to add words; • Or use levels function on left hand side to change the levels to words. Statistical Modelling Appendix A
Factor data structures • A factor is a vector that has only a limited set of possible values. Possible values are called the levels of the factor. • If t is the number of levels of a factor and n is the number of values recorded for the factor, R stores n codes each between 1 and t, rather than the actual levels. • It also keeps a list of the t levels associated with the t codes. • An error will occur if you try to enter a value or code outside those specified. Statistical Modelling Appendix A
Generating regular patterns • Because limited set of values • each value often occurs repeatedly • often a regular pattern to the values A: 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 B: 4 4 4 4 1 1 1 1 2 2 2 2 4 4 4 4 1 1 1 1 2 2 2 2 C: 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 • Consequently, often convenient to use rep, factor and data.frame functions • A <- factor(rep(c(1,2), each=12)) • B <- factor(rep(c(4,1,2), each=4, times=2)) • C <- factor(rep(1:4, times=6)) • Fac3.ran <- data.frame(A,B,C) • First argument of rep function gives the levels of the factor • each argument repeats each value a specified number of times • times argument repeats the whole pattern specified by the first argument and each. Statistical Modelling Appendix A
Generating regular patterns (cont’d) • This pattern can be generated using fac.gen, that has the following arguments: • fac.gen(generate, each=1, times=1, order="standard") • generate is a list of factor names or numbers • each component is a number or a vector (numeric or character) of levels. • order="standard" means first factor’s values change slowest, last changes fastest (opposite to Yates order). • Fac3.ran <- fac.gen(list(A = 2, B = c(4,1,2), C = 1:4)) • A: 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 B: 4 4 4 4 1 1 1 1 2 2 2 2 4 4 4 4 1 1 1 1 2 2 2 2 C: 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Statistical Modelling Appendix A
Ordered factors • Sometimes the order of the factors is arbitrary and R orders the factors alphabetically. • For example, the factor Sex with levels male and female would be ordered with female assigned code 1 and male assigned code 2. • Sometimes an innate order to the levels that is different to the alphabetical order. • For example, a factor for income with values lo, mid and hi would by default have the levels ordered hi, lo, mid. • Can specify the order of the levels by converting the factor object to be an ordered object. • An ordered object is the same as a factor except that the levels are specifically ordered. Statistical Modelling Appendix A
Converting a factor to an ordered • To convert the already existing income factor object to an ordered object: ordered(income) <- c(“lo”, “mid”, “hi”) • Ordered factors are in some circumstances treated differently by R. Statistical Modelling Appendix A