520 likes | 633 Views
Training on R for Students. H igher E ducation Q uality E nhancement P roject (HEQEP). Software Training Program. Lecture 3. (An Open Source Package). Date: March 22-23, 2013. Organized by. Department of Statistics. Rajshahi University, Rajshahi-6205, Bangladesh. Programming with R.
E N D
Training on R for Students Higher Education Quality Enhancement Project (HEQEP) Software Training Program Lecture 3 (An Open Source Package) Date: March 22-23, 2013 Organized by Department of Statistics Rajshahi University, Rajshahi-6205, Bangladesh.
Matrices X<-matrix(rpois(20,1.5),nrow=4) # rpois() is random sample from Poisson dist X [,1] [,2] [,3] [,4] [,5] [1,] 1 0 2 5 3 [2,] 1 1 3 1 3 [3,] 3 1 0 2 2 [4,] 1 0 2 1 0 Suppose that the rows refer to four different trials and we want to label the rows ‘Trial.1’ etc. We employ the function rownames to do this. We could use the paste function but here we take advantage of the prefix option: rownames(X)<-rownames(X,do.NULL=FALSE, prefix="Trial.") X [,1] [,2] [,3] [,4] [,5] Trial.1 1 0 2 5 3 Trial.2 1 1 3 1 3 Trial.3 3 1 0 2 2 Trial.4 1 0 2 1 0
Matrices For the columns we want to supply a vector of different names for the five drugs involved in the trial, and use this to specify the colnames(X): drug.names<-c("aspirin", "paracetamol", "nurofen", "hedex", "placebo") colnames(X)<-drug.names X aspirin paracetamol nurofenhedex placebo Trial.1 1 0 2 5 3 Trial.2 1 1 3 1 3 Trial.3 3 1 0 2 2 Trial.4 1 0 2 1 0 Alternatively, you can use the dimnames function to give names to the rows and/or columns of a matrix. In this example we want the rows to be unlabelled (NULL) and the column names to be of the form ‘drug.1’, ‘drug.2’, etc. dimnames(X)<-list(NULL, paste("drug.",1:5,sep="")) X drug.1 drug.2 drug.3 drug.4 drug.5 [1,] 1 0 2 5 3 [2,] 1 1 3 1 3 [3,] 3 1 0 2 2 [4,] 1 0 2 1 0
Making data frames (1) We illustrate how to construct a data frame from the following car data.
Making data frames (2) > Make <- c("Honda","Chevrolet","Ford","Eagle","Volkswagen","Buick","Mitsbusihi", + "Dodge","Chrysler","Acura") > Model <- c("Civic","Beretta","Escort","Summit","Jetta","Le Sabre","Galant", + "Grand Caravan","New Yorker","Legend") > Cylinder <-c (rep("V4",5),"V6","V4",rep("V6",3)) > Weight <- c(2170, 2655, 2345, 2560, 2330, 3325, 2745, 3735, 3450, 3265) > Mileage <- c(33, 26, 33, 33, 26, 23, 25, 18, 22, 20) > Type <- c("Sporty","Compact",rep("Small",3),"Large","Compact","Van", + rep("Medium",2)) # rep("V4",5) instructs R to repeat V4 five times.
Making data frames (3) Now data.frame() function combines the six vectors into a single data frame. Car <- data.frame(Make,Model,Cylinder,Weight,Mileage,Type) Car
Few Operations in data frame Car (1) names(Car) [1] "Make" "Model" "Cylinder“ "Weight" "Mileage" "Type" Car[1,] Make Model Cylinder Weight Mileage Type 1 Honda Civic V4 2170 33 Sporty Car[10,4] [1] 3265 Car$Mileage [1] 33 26 33 33 26 23 25 18 22 20 mean(Car$Mileage) #average mileage of the 10 vehicles [1] 25.9 min(Car$Weight)# minimum of car weights[1] 2170
Few Operations in data frame Car (2) table(Car$Type)# gives a frequency table Compact Large Medium Small Sporty Van 2 1 2 3 1 1 table(Car$Make, Car$Type)# Cross tabulation Compact Large Medium Small Sporty Van Acura 0 0 1 0 0 0 Buick 0 1 0 0 0 0 Chevrolet 1 0 0 0 0 0 Chrysler 0 0 1 0 0 0 Dodge 0 0 0 0 0 1 Eagle 0 0 0 1 0 0 Ford 0 0 0 1 0 0 Honda 0 0 0 0 1 0 Mitsbusihi 1 0 0 0 0 0 Volkswagen 0 0 0 1 0 0
Making data frames (6) Make.Small <- Car$Make[Car$Type == "Small"] summary(Car$Mileage)# gives summary statistics Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 22.25 25.50 25.90 31.25 33.00
Rank, Sorting and Order Price<- scan() 1: 325 201 157 162 164 101 211 188 95 117 188 121 13: Read 12 items ranks<-rank(Price) sorted<-sort(Price) ordered<-order(Price) # position view<-data.frame(Price, ranks, sorted, ordered) view Price ranks sorted ordered 1 325 12.0 95 9 2 201 10.0 101 6 3 157 5.0 117 10 4 162 6.0 121 12 5 164 7.0 157 3 6 101 2.0 162 4 7 211 11.0 164 5 8 188 8.5 188 8 9 95 1.0 188 11 10 117 3.0 201 2 11 188 8.5 211 7 12 121 4.0 325 1
Evaluating Functions with apply, sapply and lapply apply( arr, margin, fct ) Applies the function fct along some dimensions of the array arr, according to margin, and returns a vector or array of the appropriate size. The apply function is used for applying functions to the rows or columns of matrices or dataframes.
For example: apply (X<-matrix(1:24,nrow=4)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 5 9 13 17 21 [2,] 2 6 10 14 18 22 [3,] 3 7 11 15 19 23 [4,] 4 8 12 16 20 24 apply(X,1,sum) # to obtain the row total [1] 66 72 78 84 apply(X,2,sum) # to obtain the column totals (six of them): [1] 10 26 42 58 74 90 apply(X,1,sqrt) apply(X,2,sqrt) apply(X,1,sample) apply(X,1,function(x) x^2+x)
Vector and sapply If you want to apply a function to a vector then use sapply (rather than apply for matrices or margins of matrices). Here is the code to generate a list of sequences from 1:3 up to 1:7 sapply(3:7, seq) [[1]] [1] 1 2 3 [[2]] [1] 1 2 3 4 [[3]] [1] 1 2 3 4 5 [[4]] [1] 1 2 3 4 5 6 [[5]] [1] 1 2 3 4 5 6 7 The function sapply is most useful with complex iterative calculations.
Example: sapply a<-seq(0.01,0.2,.005) Now we can use sapply to apply the sum of squares function for each of these values of a (without writing a loop), and plot the deviance against the parameter value for a: sumsq<- function(x) {sum(x^2)} # function that produce sum of squares plot(a, sapply(a, sumsq), type="l")
Lists and lapply lapply( li, fct ) # to each element of the list li, the function fct is applied. a<-c("a","b","c","d") b<-c(1,2,3,4,4,3,2,1) c<-c(T,T,F) list.object<-list(a,b,c) # create a list object class(list.object) # to see the class type [1] "list" list.object # to see the contents of the list we just type its name: [[1]] [1] "a" "b" "c" "d" [[2]] [1] 1 2 3 4 4 3 2 1 [[3]] [1] TRUE TRUE FALSE The function lapply applies a specified function to each of the elements of a list in turn (without the need for specifying a loop, and not requiring us to know how many elements there are in the list).
Lists and lapply #To know the length of each of the vectors making up the list: lapply(list.object, length) #To find out class, we apply the function class to the list: lapply(list.object, class) #To find mean lapply(list.object, mean)
Working with Vectors and Logical Subscripts Take the example of a vector containing the 11 numbers 0 to 10: x<-0:10 There are two quite different kinds of things we might want to do with this. We might want to add up the values of the elements: sum(x) # adds up the values of the xs [1] 55 Alternatively, we might want to countthe elements that passed some logical criterion. Suppose we wanted to know how many of the values were less than 5: sum(x<5) # counts up the number of cases that pass the logical # condition ‘x is less than 5’ [1] 5
When we counted the number of cases, the counting was applied to the entire vector, using sum(x<5). To find the sum of the values of x that are less than 5: sum(x[x<5]) [1] 10 The logical condition x<5 is either true or false: x<5 [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE [10] FALSE FALSE Imagine false as being numeric 0 and true as being numeric 1. Then the vector of subscripts [x<5] is five 1s followed by six 0s: 1*(x<5) [1] 1 1 1 1 1 0 0 0 0 0 0 Now imagine multiplying the values of x by the values of the logical vector x*(x<5) [1] 0 1 2 3 4 0 0 0 0 0 0 When the function sum is applied, it gives us the answer we want: the sum of the values of the numbers 0+1+2+3+4=10. sum(x*(x<5)) [1] 10 This produces the same answer as sum(x[x<5])
Addresses within Vectors There are two important functions for finding addresses within arrays. The function which is very easy to understand. y<-c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11) y [1] 8 3 5 7 6 6 8 9 2 3 9 4 10 4 11 Suppose we wanted to know which elements of y contained values bigger than 5. We type which(y>5) [1] 1 4 5 6 7 8 11 13 15 Notice that the answer to this enquiry is a set of subscripts. We don’t use subscripts inside the which function itself. The function is applied to the whole array. To see the values of y that are larger than 5, we just type y[y>5] [1] 8 7 6 6 8 9 9 10 11 Note that this is a shorter vector than y itself, because values of 5 or less have been left out: length(y) [1] 15 length(y[y>5]) [1] 9
Finding Closest Values Finding the value in a vector that is closest to a specified value is straightforward using which. Here, we want to find the value of xv that is closest to 108.0: which(abs(y-8.9)==min(abs(y-8.9))) [1] 8 11 The closest value to 108.0 is in location 332. But just how close to 8.9 is this 8th and 11th value? We use 8 and 11 as a subscript on y to find this out y[c(8,11)] [1] 9 9 Thus, we can write a function to return the closest value to a specified value sv closest<-function(xv, sv){ xv[which(abs(xv-sv)==min(abs(xv-sv)))] } and run it like this: closest(y,8.9) [1] 9 9
The sample Function y<- scan() [1] 8 3 5 7 6 6 8 9 2 3 9 4 10 4 11 and here are two samples of y: sample(y) [1] 8 8 9 9 2 10 6 7 3 11 5 4 6 3 4 sample(y) [1] 9 3 9 8 8 6 5 11 4 6 4 7 3 2 10 The order of the values is different each time that sample is invoked, but the same numbers are shuffled in every case. This is called sampling without replacement. sample(y, 5) [1] 9 4 10 8 11 sample(y, 5) [1] 9 3 4 2 8 The option replace=T allows for sampling with replacement, sample(y, replace=T) [1] 9 6 11 2 9 4 6 8 8 4 4 4 3 9 3
Some general programming guidelines Writing a computer program to solve a problem can usually be reduced by following this sequence of steps: 1 Understand the problem. 2 Work out a general idea how to solve it. 3 Translate your general idea into a detailed implementation. 4 Check: Does it work? Is it good enough? If yes, you are done! If no, go back to step 2.
Flow control • The for()statement • The if() statement • The while() loop • The repeat loop, and the break and next statements
Flow control The for statement (1) The for()statement allows one to specify that a certain operation should be repeated a fixed number of times. Syntax for (variable in sequence) expression Or, for (variable in sequence) { expression expression expression }
Example: The for statement sum.x <- 0 for (i in 1:5) { sum.x <- sum.x + i print(i) } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 sum.x [1] 15 print() # Prints a single R object cat() # Prints multiple objects, # one after the other
Example: for loop The Fibonacci sequence is a famous sequence in mathematics. The first two elements are defined as [1, 1]. Subsequent elements are defined as the sum of the preceding two elements. For example, the third element is 2 (= 1+1), the fourth element is 3 (= 1+2), the fifth element is 5 (= 2+3), and so on. To obtain the first 12 Fibonacci numbers in R, we can use Fibonacci <- numeric(12) # numeric array of size 12 Fibonacci[1] <- Fibonacci[2] <- 1 for (i in 3:12) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1] To see all 12 values, type in Fibonacci [1] 1 1 2 3 5 8 13 21 34 55 89 144
The for statement (2) What are the outputs of the following statement? for (x in 1:10) print(sqrt(x)) It prints the square root of the integers one to ten
Conditional execution (1) The if() statements: The if() statement allows us to control which statements are executed Syntax if (condition) {commands when TRUE} Or, if (condition) {commands when TRUE} else {commands when FALSE} That is, if (expr1) expr2 else expr3 where expr1 must evaluate to a single logical value and the result of the entire expression is then evident.
Conditional execution: if statement Entry True (1) Expre-ssion False Statement 1 Exit
Conditional execution: if – else statement Entry True (1) False (0) Test Expre- ssion Body of else Body of if Exit
Example: Conditional execution (2) A simple example: x <- 3 if (x > 2) y <- 2 * x else y <- 3 * x Since x > 2 is TRUE, y is assigned 2 * 3 = 6. If it hadn’t been true, y would have been assigned the value of 3 * x.
The while() loop Sometimes we want to repeat statements, but the pattern of repetition isn’t known in advance. We need to do some calculations and keep going as long as a condition holds. The while() statement accomplishes this. Syntax while (condition) {statements} The condition is evaluated, and if it evaluates to FALSE, nothing more is done. If it evaluates to TRUE the statements are executed, condition is evaluated again, and the process is repeated.
Example: while loop Suppose we want to list all Fibonacci numbers less than 300. We don’t know beforehand how long this list is, so we wouldn’t know how to stop the for()loop at the right time, but a while()loop is perfect: Fib1 <- 1 Fib2 <- 1 Fibonacci <- c(Fib1, Fib2) while (Fib2 < 300) { Fibonacci <- c(Fibonacci, Fib2) oldFib2 <- Fib2 Fib2 <- Fib1 + Fib2 Fib1 <- oldFib2 } To see the final result of the computation, type Fibonacci [1] 1 1 1 2 3 5 8 13 21 34 55 89 144 233
The repeat loop, and the break and next statements Sometimes we don’t want a fixed number of repetitions of a loop, and we don’t want to put the test at the top of the loop the way it is in a while() loop. In this situation we can use a repeat loop. This loop repeats until we execute a break statement. Syntax repeat { statements } This causes the statements to be repeated endlessly. The statements should normally include a break statement, typically in the form if (condition) break but this is not a requirement of the syntax. The break statement causes the loop to terminate immediately. Break statements can also be used in for() and while() loops. The next statement causes control to return immediately to the top of the loop; it can also be used in any loop. The repeat loop and the break and next statements are used relatively infrequently.
Example: repeat, break We can repeat the Newton’s algorithm example from the previous example using a repeat loop: x <- x0<- .5 tolerance <- 0.000001 repeat { f <- x^3 + 2 * x^2 - 7 if (abs(f) < tolerance) break f.prime <- 3 * x^2 + 4 * x x <- x - f / f.prime } x This version removes the need to duplicate the line that calculates f. #****** Prog Using while x <- x0<- .5 f <- x^3 + 2 * x^2 - 7 tolerance <- 0.000001 while (abs(f) > tolerance) { f.prime <- 3 * x^2 + 4 * x x <- x - f / f.prime f <- x^3 + 2 * x^2 - 7 } x
Writing functions (1) A function is defined by an assignment of the form > name <- function(arg_1, arg_2, ...) expression The expression is an R expression, (usually a grouped expression), that uses the arguments, arg_i, to calculate a value. The value of the expression is the value returned for the function. A call to the function then usually takes the form > name(expr_1, expr_2, ...) and may occur anywhere a function call is legitimate.
Writing functions (2) Example 1:Write a function to compute standard deviation sd <- function(x){ sqrt(var(x)) } If X = 9, 5, 2, 3, 7; type x <- c(9,5,2,3,7) sd(x) [1] 2.863564 Exercise:Calculate the coefficient of variation as the standard deviation of a variable, after dividing by its mean.
Example : Geometric mean as a function insects<-c(1,10,1000,10,1) mean(insects) [1] 204.4 To calculate a geometric mean by finding the antilog (exp) of the average of the logarithms (log) of the data: exp(mean(log(insects))) [1] 10 So a function to calculate geometric mean of a vector of numbers x: geometric<-function (x) {exp(mean(log(x)))} and testing it with the insect data geometric(insects) [1] 10
Writing functions (4) Example 3:Calculation of Grade point and Letter grade from score or marks. grade <- function (s) { n <- length(s) gp <- matrix(0, nrow = n, ncol = 1) # gp means Grade Point lg <- matrix(0, nrow = n, ncol = 1) # lg means Letter Grade for (i in 1:n) { if (s[i] < 40){ gp[i] = 0.00; lg[i]= "F" } else if (s[i] >= 40 && s[i] < 45){ gp[i] = 2.00; lg[i] = "D" }
Writing functions (5) else if (s[i] >= 45 && s[i] < 50){ gp[i] = 2.25; lg[i] = "C" } else if (s[i] >= 50 && s[i] < 55){ gp[i] = 2.50; lg[i] = "C+" } else if (s[i] >= 55 && s[i] < 60){ gp[i] = 2.75; lg[i] = "B-" } else if (s[i] >= 60 && s[i] < 65){ gp[i] = 3.00; lg[i] = "B" } else if (s[i] >= 65 && s[i] < 70){ gp[i] = 3.25; lg[i] = "B+" }
Writing functions (6) else if (s[i] >= 70 && s[i] < 75){ gp[i] = 3.50; lg[i] = "A-" } else if (s[i] >= 75 && s[i] < 80){ gp[i] = 3.75; lg[i] = "A" } else{ gp[i] = 4.00; lg[i] = "A+" } } # end of for loop return(list(Grade.Point = gp, Letter.Grade = lg)) } # end of function score <- c(80, 45, 55, 90, 75, 38, 62) result <- grade(score) result
Writing functions (7) Example 4: Write a function to calculate the two sample t-statistic. This is an artificial example. The function is defined as follows: twosam <- function(y1, y2) { n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) deg.free <- n1+n2-2 return(list(test.stat = tst, df=deg.free)) }
Writing functions (8) If, x <- c(37, 29, 35, 28, 24, 36, 40, 37, 33, 28, 39) y <- c(22, 32, 27, 30, 24, 34, 32, 20, 24, 25, 28, 26, 26) Then twosam(x, y) $test.stat [1] 3.307523 $df [1] 22
Maximum likelihood estimation Example 4: Maximum likelihood estimation (Gamma distribution as an example) The pdf of Gamma distribution The likelihood and log-likelihood are
Maximum likelihood estimation mle() allows to estimate parameters by maximum likelihood method using iterative methods of numerical calculus to minimize the negative log-likelihood (which is the same of maximizing the log-likelihood). This requires to specify the negative log-likelihood analytical expression as argument and giving some starting parameters estimates.
Maximum likelihood estimation x.gam <- rgamma(200, rate=0.5, shape=3.5) # sample size n=200 from a gamma distribution with # λ=0.5 (scale parameter) and α=3.5 (shape parameter) library(stats4) # loading package stats4 for mle() logL <- function(lambda, alfa) { n <-200 x <- x.gam temp1 <- -n*alfa*log(lambda)+n*log(gamma(alfa)) temp2 <- -(alfa-1)*sum(log(x))+lambda*sum(x) temp1+temp2 # -log-likelihood function } est <- mle(minuslog =logL, start =list(lambda =2, alfa =1))