Training on R For 3 rd and 4 th Year Honours Students, Dept. of Statistics, RU

Training on R For 3rd and 4th Year Honours Students, Dept. of Statistics, RU Installation and Data Structures of R Empowered by Higher Education Quality Enhancement Project (HEQEP) Department of Statistics Rajshahi University, Rajshahi-6205, Bangladesh March 21-23, 2013

Licensed asS-Plusin 1983. 1990:R An open source program similar to S Developed byRobert GentlemanandRoss Ihaka(Auckland, NZ) 1997: Developed international “R-core” team Updated versions available every couple months History of R Statistical Programming LanguageSdeveloped at Bell Labs,1976. • For more:http://cran.r-project.org/mirrors.html

Advantage of R • R is a free computer programming language, developed by renowned Statisticians. • It is open-source and runs on Windows, Linux and Macintosh. • R has excellent graphing capabilities. • R has an excellent built-in help system. • R's language has a powerful, easy to learn syntax with many built-in statistical functions. • The language is easy to extend with user-written functions.

To obtain and install R on your computer • Go to http://cran.r-project.org/mirrors.html • to choose a mirror near you • Click on your favorite operating system (Windows, Linux, or Mac) • Download and install from the “base” • To install additional packages • Start R on your computer • Choose the appropriate item from the “Packages” menu Here, CRAN = Comprehensive R Archive Network.

To obtain and install R on your computer

To obtain and install R on your computer Double Click

To obtain and install R on your computer

Tools bar Menu bar Command Prompt The R Environment

L ctrl + For clear screen The R Environment

> Creating a Script File

Working in R: As Calculator Numeric Operators • 4 +2 =6 • 4 – 2 = 2 • 4 * 2 = 8 • 4 / 2 = 2 • 4 ^ 2 = 16

Variables & Assignment Operator • Numeric • 5, 5.76, etc • Logical • Values corresponding to True or False • Character Strings • Sequences of characters (blue, male, Rahim, etc) • Variables are assigned by the operator <- or = • Data type need not to be declared. a = 5 (or, a <- 5) b = “blue” c = a^2 + 5 c > a etc

Data Structure • Vectors • Matrices • Arrays • Factors • Lists • Data frames

Vector Here we introduce three functions, c, seq, and rep, that are used to create vectors in various situations. c() to concatenate elements or sub-vectors rep() to repeat elements or patterns seq() to generate sequences > c(2, 7, 9) > [1] 2 7 9 > a = c(2, 7, 9) > b = c(3, 5, 8, a) > b > [1] 2 7 9 2 7 9 rep(value(s), number of repetition) > rep(5,10) [1] 5 5 5 5 5 5 5 5 5 5 > rep(c(2,4,6),3) [1] 2 4 6 2 4 6 2 4 6 seq(initial value, Terminated value, increment) > seq(2, 10, 2) > [1] 2 4 6 8 10

Vector h = c(21,25, 19, 22, 23, 20) # Numeric vector h [1] 21 25 19 22 23 20 name = c(“Rahim”, “Rani”, “Raju”) # Character vector name [1] “Rahim” “Rani” “Raju” c = h > 22 # Logical vector c [1] FALSE TRUE FALSE FALSE TRUE FALSE a = c(1,2,3,4,5) a [1] 1 2 3 4 5 a = 1:5 a [1] 1 2 3 4 5

Vector Indexing w = c(1, 3, 5, 2, 10) > w[3] # the third element of w >[1] 5 > w[3:5] # the third to fifth element of w, inclusive >[1] 5 2 10 > w[w>3] # elements in w greater than 3 >w[-2] # all except the second element >[1] 1 5 2 10 > w[w>2 & w<=5) # greater than 2 and less than or equal to 5

Vector Vector used in functions w = c(1, 3, 5, 2, 10) length(w) sum(w) cumsum(w) min(w) max(w) range(w) sum(w) mean(w) median(w) var(w) std(w) summary(w) abs(10-50) sort(w) sort(w, decreasing=T) etc

HTML HTML help.start() Working in R: Using help ?keyword Specific R keyword help(keyword) > ?mean # information on mean command >help(mean) CRAN Full Manual >help(median) >help.start() help.search(“topic”) Finding "vague" topic ??topic

Array & Matrix A matrix in mathematics is just a two-dimensional array of numbers. Matrices and arrays are represented as vectors with dimensions: # Generate a 3 by 4 array > x <- 1:12 > dim(x) <- c(3,4) > x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 # Generate a 4 by 5 array > A <- array(1:20, dim = c(4,5)) > A [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20 • The dim assignment function sets or changes the dimension attribute of x, causing R to treat the vector of 12 numbers as a 3 × 4 matrix. • Notice that the storage is column-major; that is, the elements of the first column are followed by those of the second, etc.

Array & Matrix A matrix in mathematics is just a two-dimensional array of numbers. Matrices and arrays are represented as vectors with dimensions: # Generate a 3 by 2 Matrix > A = matrix(1:12, nrow=3, byrow=T) > A [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 # 3 x 2 matrix of 0 > Y <- matrix(0, nrow=3, ncol=2) > Y [,1] [,2] [1,] 0 0 [2,] 0 0 [3,] 0 0 > A[ ,2] # 2nd column of matrix A [1] 2 6 10 > A[3, ] # 3rd row of matrix A [1] 9 10 11 12 > A[2 ,2] # (2, 2) th element of matrix A [1] 2 6 10

Basic operations – Matrix

Basic operations – Matrix > A.mat <- matrix(c(19,8,11,2,18,17,15,19,10),nrow=3) > A.mat [,1] [,2] [,3] [1,] 19 2 15 [2,] 8 18 19 [3,] 11 17 10 > inv.A <- solve(A.mat) # inverse of matrix A.mat > t(A.mat) # transpose of matrix A.mat > A.mat %*% inv.A

Basic operations – Matrix > cbind(a,b) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 2 5 8 [2,] 2 5 8 3 6 9 [3,] 3 6 9 4 7 10 > rbind(a,b) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 [4,] 2 5 8 [5,] 3 6 9 [6,] 4 7 10 > a=matrix(1:9,nrow=3) > b=matrix(2:10, nrow=3) > a [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > b [,1] [,2] [,3] [1,] 2 5 8 [2,] 3 6 9 [3,] 4 7 10 Cov.matrix = cov(b) Cor.matrix = cor(b) Row.mean = apply(b, 1, mean) Col.mean = apply(b, 2, mean) NOTE: apply(X, MARGIN, FUN)

List • vector: an ordered collection of data of the same type. • > a = c(7,5,1) • > a[2] • [1] 5 • list: an ordered collection of data of arbitrary types. > a = list(Name="Rahim",age=c(12, 23,10), Married = F) > a $Name [1] "Rahim" $age [1] 12 23 10 $Married [1] FALSE • Typically, vector elements are accessed by their index (an integer), list elements by their name (a character string).

Data frames • Data frame is supposed to represent the typical data table that researchers come up with – like a spreadsheet. • It is a rectangular table with rows and columns with same length; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. • Example: • > a • localisation tumorsize progress • 1 proximal 6.3 FALSE • 2 distal 8.0 TRUE • 3 proximal 10.0 FALSE

Making data frames We illustrate how to construct a data frame from the following car data.

Making data frames > Make <- c("Honda","Chevrolet","Ford","Eagle","Volkswagen","Buick","Mitsbusihi", + "Dodge","Chrysler","Acura") > Model <- c("Civic","Beretta","Escort","Summit","Jetta","Le Sabre","Galant", + "Grand Caravan","NewYorker","Legend") > Cylinder <-c (rep("V4",5),"V6","V4",rep("V6",3)) > Weight <- c(2170, 2655, 2345, 2560, 2330, 3325, 2745, 3735, 3450, 3265) > Mileage <- c(33, 26, 33, 33, 26, 23, 25, 18, 22, 20) > Type <- c("Sporty","Compact",rep("Small",3),"Large","Compact","Van", + rep("Medium",2))

Making data frames Now data.frame() function combines the six vectors into a single data frame. > Car <- data.frame(Make, Model, Cylinder, Weight, Mileage, Type) > Car

Making data frames > names(Car) [1] "Make" "Model" "Cylinder“ "Weight" "Mileage" "Type" > Car[1,] Make Model Cylinder Weight Mileage Type 1 Honda Civic V4 2170 33 Sporty > Car[10,4] [1] 3265 > Car$Mileage [1] 33 26 33 33 26 23 25 18 22 20 > mean(Car$Mileage) #average mileage of the 10 vehicles [1] 25.9 > min(Car$Weight) [1] 2170

Making data frames > table(Car$Type) # gives a frequency table Compact Large Medium Small Sporty Van 2 1 2 3 1 1 > table(Car$Make, Car$Type) # Cross tabulation Compact Large Medium Small Sporty Van Acura 0 0 1 0 0 0 Buick 0 1 0 0 0 0 Chevrolet 1 0 0 0 0 0 Chrysler 0 0 1 0 0 0 Dodge 0 0 0 0 0 1 Eagle 0 0 0 1 0 0 Ford 0 0 0 1 0 0 Honda 0 0 0 0 1 0 Mitsbusihi 1 0 0 0 0 0 Volkswagen 0 0 0 1 0 0

Making data frames > Make.Small <- Car$Make[Car$Type == "Small"] > summary(Car$Mileage) # gives summary statistics Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 22.25 25.50 25.90 31.25 33.00

Making data frames > b = data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)) > b x y z 1 -1.7651180 0.462309932 0.09230914 2 -0.7340731 -1.681826091 0.66648791 3 -0.4968900 1.728658405 -0.68281664 4 -1.3217873 0.307030157 0.24192745 5 -0.2070019 0.003892192 1.19591807 6 -0.9633084 0.060328696 -1.40424843 7 -1.1323626 1.079521099 1.63552915 8 -0.7301976 -1.422012899 -0.16695860 9 0.2979073 0.528152338 0.65995778 10 -0.5759655 0.655296337 -0.39156127 > cor(b) x y z x 1.0000000000 0.0007151043 0.12151913 y 0.0007151043 1.0000000000 -0.05770153 z 0.1215191317 -0.0577015345 1.00000000 > apply(b,1,var) [1] 1.42472853 1.39573092 1.80047438 0.85041478 0.57226442 0.56454121 [7] 2.14379987 0.39516798 0.03357767 0.44098693

Making data frames > b = data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)) > b x y z 1 -1.7651180 0.462309932 0.09230914 2 -0.7340731 -1.681826091 0.66648791 3 -0.4968900 1.728658405 -0.68281664 4 -1.3217873 0.307030157 0.24192745 5 -0.2070019 0.003892192 1.19591807 6 -0.9633084 0.060328696 -1.40424843 7 -1.1323626 1.079521099 1.63552915 8 -0.7301976 -1.422012899 -0.16695860 9 0.2979073 0.528152338 0.65995778 10 -0.5759655 0.655296337 -0.39156127 attach(b) lm.D9 <- lm(y ~ x) # Regression of y on x lm.D90 <- lm(weight ~ group - 1) # omitting intercept anova(lm.D9) summary(lm.D9

Data Entry using Data Editor • R has a Data Editor with spreadsheet-like interface. • The interface quite useful for small data sets. • Suppose we want to construct a data frame based on following data

Data Entry using Data Editor • To do this – type • > result <- data.frame(Roll=integer(0), Bstat101=numeric(0), Bstat102=numeric(0)) • > result <- edit(result) • Then enter the data in the Data Editor and close Editor • > result # To see the data • > result <- edit(result) # To modify the data

Reading data from File • An entire data frame can be read directly with the read.table() function. • # Reading data from Excel .csvFile • > data1 <- read.table(file= “d:/RFiles/data1.csv", header=T, sep=“,”) • > data1 <- read.csv(file= “d:/RFiles/data1.csv", header=T ) • > data1 • # Reading data from text file • data2 <- read.table(file= “d:/RFiles/data3.txt", header=T, sep=“\t” ) • > data2 • > attach(data1) • > detach(data1)

Importing from other statistical systems Package foreign on cran provides import facilities for files produced by the following statistical software. > read.mtp # imports a `Minitab Portable Worksheet’ > read.xport # reads a file in SAS format > read.spss # reads files created by spss Package Rstreams on cran contain functions > readSfile # reads binary objects produced by S-PLUS > data.restore # reads S-PLUS data dumps (created by data.dump)

Thanks

Training on R For 3 rd and 4 th Year Honours Students, Dept. of Statistics, RU