450 likes | 579 Views
Welcome to lecture 6: An introduction to data analysis with R. IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA. We're moving ahead a bit. The majority of the class does some type of microarray analysis
E N D
Welcome to lecture 6:An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA
We're moving ahead a bit... • The majority of the class does some type of microarray analysis • Microarray analysis utilizes the same programmatic concepts we've been exploring • We’ve covered variables, control structures, data structures, functions in perl • Now we’ll introduce a new language called R • particularly suited for numerical analysis • We’ll learn about explorative data analysis • We’ll write our own functionality in this new language
R(A data-structure focused introduction) • Every R intro I’ve read tends towards statistical tools first and data structures / programmatic concepts as an afterward • I don’t think this is the best way to learn a language • After all, we’ll be doing complex data analysis • Going beyond one-off biostatistics learning • I’m going to introduce R the same way I introduced PERL • I hope to show you that R is just as friendly…
What is R? From the R-project webpage (www.r-project.org): R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
Where to get R? Available for a wide variety of platforms ... (handled well in windows too!) http://www.r-project.org Libraries available via CRAN (like CPAN we used before) The “bioperl” version of R is “bioconductor” - used extensively for routine and experimental microarray analysis
Let’s begin The default mode for R is interactive CLI (with emacs keybindings!) A query – response line mode • An overgrown calculator • R evaluates commands through function calls • > 2+2 • [1] 4 • A programming language • Complex data structures • Control blocks • Objects!
Symbolic variables > x<-2 > x [1] 2 > x + x [1] 4 > x<-”ACTCGATCGACT” > x [1] “ACTCGATGCACT” > x<-T > x [1] TRUE • Variable assignment is handled via arrow notation <- • Variables can be examined by simply calling the variable • The index of the first element of the variable is given in brackets on each line [1] • Scalar elements can be numerical, character, or boolean
Vectors > x<-c(1,2,3,4,5) > x [1] 1 2 3 4 5 > x<-c(“ACT”,”TCA”,”GGA”,”CCG”) > x [1] “ACT” “TCA” “GGA” “CCG” > x<-c(T,T,F,T) > x [1] TRUE TRUE FALSE TRUE • R handles vectors as single objects • R defines three types of vectors: • numerical vectors • character vectors • logical vectors • Vectors are created (and treated) as concatenation of scalar elements:
Vector element access > x<-seq(1,20,2) > x[5] [1] 9 > x[c(1,3,4)] [1] 1 5 7 > x[x>10] [1] 11 13 15 17 19 > x[c(-1,-2,-3,-4,-5)] [1] 11 13 15 17 19 • Very similar to Perl array element access • Access by index • The index itself can be a vector, or any type of data element • Can be an expression • Negative indeces denote exclusion
Vector functions seq (“sequence”) Creates a range of values in a vector > x<-seq(1,10,1) > x [1] 1 2 3 4 5 6 7 8 9 10 > x<-4:12 > x [1] 4 5 6 7 8 9 10 11 12 > x<-LETTERS(1:3) > x [1] A B C
Vector functions rep (“replicate”) Generates repeated values - Can be used to generate complex patterns - Can be used to generate data grouping codes > x<-c(10,100,1000) > rep(x,3) [1] 10 100 1000 10 100 1000 10 100 1000 > rep(x,1:3) [1] 10 100 100 1000 1000 1000 > rep(1:2,c(5,10)) [1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Vector functions sort Sorts an array in-place > x<-c(10000,10,1000) > sort(x) [1] 10 1000 10000
Vector functions factor Grouping for categorical data > x<-c(0,1,2,1,2) > fx<-factor(x,levels=0:2) > levels(fx)<-c(“low”,”middle”,”grande”) > fx [1] low middle grande middle grande
Matrices > x<-seq(1,12) > dim(x)<-c(3,4) > x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > matrix(1:12,nrow=3,byrow=T) [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 • Simply n-dimensional arrays • in R, most everything is an array • Extends elements of any type • Can dynamically set and change dimensions • default matrix dim is by columns
Matrix functions > matrix(1:12,nrow=3,byrow=T) [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > t(x) [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 t (“transpose”) Changes rows and columns
Matrix functions > x<-matrix(1:12,nrow=3,byrow=T) > x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > rownames(x)<-c(“one”,”two”,”three”) > x [,1] [,2] [,3] [,4] one 1 2 3 4 two 5 6 7 8 three 9 10 11 12 rownames Assigns scalars to the row indeces (like a hash)
Matrix functions > x<-matrix(1:12,nrow=3,byrow=T) > x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > colnames(x)<-c(“one”,”two”,”three”,”four”) > x one two three four [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 colnames Assigns scalars to the column indeces (like a hash)
Matrix functions > x<-matrix(1:12,nrow=3,byrow=T) > x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > cbind(x,c(111,222,333)) [,1] [,2] [,3] [,4] [,5] [1,] 1 2 3 4 111 [2,] 5 6 7 8 222 [3,] 9 10 11 12 333 cbind Adds (in the agglomerative sense) cols together like XL
Matrix functions > x<-matrix(1:12,nrow=3,byrow=T) > x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > cbind(x,c(111,222,333,444)) [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 [4,] 111 222 333 444 rbind Adds (in the agglomerative sense) rows together like XL
Object functions > x<-matrix(1:12,nrow=3,byrow=T) > x [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > y<-c(1:5) > z<-list(matrix=x,vector=y) > z $matrix [,1] [,2] [,3] [,4] ... $vector [1] 1 2 3 4 5 list Combines collections into composite objects - Objects are treated as vectors in R, plus methods - Matrices are collections of vectors
list (object) functions > z$vector [1] 1 2 3 4 5 > z$vector[5] [1] 5 > z$matrix[1,3] [1] 3 > z$vector[z$vector>3] [1] 4 5 indexing Since it's a vector, we can obtain the elements
list (object) functions > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > z x y 1 1 6 2 2 7 3 3 8 4 4 9 5 5 10 data.frame If the vectors are the same length, we can agglomerate them in a special data matrix - the data is paired, and has unique row names
Reading data from files > myData<-read.table(“example.txt”, header=T) > myData field_one field_two 10.3 1.08 11.2 0.97 … … data.frame / read.table / read.csv / read.delim • The data frame is ideal for handling delimited files • Assumes a header is present • (takes the header to have n-1 entries) • Can handle a wide variety of interfaces with outputs • Tab, comma delimited txt files • SPSS, SAS, Stata, Minitab, S-PLUS v3 files • Works well with DB interface calls as well
Persistence save.image() / .RData / ls() • The workspace is dynamic • Variables and functions are created or loaded • objects() or ls() shows availability of both • Can be saved to a local .RData file using save.image() • .RData loaded by default upon startup • Can specify the .RData (or whatever you name it) workspace using load() (may have to specify pathname!)
data frame (object) functions > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > subset(z,x>2) x y 3 3 8 4 4 9 5 5 19 subset Allows extraction of a portion of a data frame
data frame (object) functions > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > transform(z,x.log=log(x)) x x.log 1 1 0.0000000 2 2 0.6931472 3 3 1.0986123 4 4 1.3862944 5 5 1.6094379 transform Allows extension of a data frame
data frame (object) functions > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > h<-split(z$x,z$y) > h $”1” [1] 6 $”2” [1] 7 $”3” [1] 8 $”4” [1] 9 $”5” [1] 10 split Lists vectors according to group
data frame (object) functions > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > lapply(z, mean) $x [1] 3 $y [1] 8 lapply Implicit looping over group members
Functions in R very similar to what we've seen in perl! Blocks are the same - Takes arguments - Uses control structures (for, if, while loops, ...) > x<-c(1:5) > my.function<-function(x) { u<-mean(x) } > y<-my.function(x) > y [1] 3
Control structures > myfunction<-function(x) { for (i in 1:10) { do something here } } The variable i will take values of the sequence in turn The range is specified by the sequence for loop Loops over a set range
A stupid function example Just to illustrate passing args back and forth… • > myfun<-function(x) • + { • + X<-x • + for (i in 1:10) • + { • + X<-c(X,i) • + } • + X • + } • > myfun(0) • [1] 0 1 2 3 4 5 6 7 8 9 10
A better function example A function to calculate the two sample t-statistic, showing “all the steps”. (Fromhttp://cran.r-project.org/doc/manuals/R-intro.html) • > twosam <- function(y1, y2) • { • n1 <- length(y1); n2 <- length(y2) • yb1 <- mean(y1); yb2 <- mean(y2) • s1 <- var(y1); s2 <- var(y2) • s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) • tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) • tst • } With this function defined, you could perform two sample t-tests using a call such as: > tstat <- twosam(data$male, data$female); tstat
Control structures > myfunction<-function(x) { while (x>10) { do something here } } The evaluation is tested at the beginning of the loop; Note that in this case, the block may never be executed while loop Loops while an evaluation returns boolean TRUE
Control structures > myfunction<-function(x) { repeat { do something here if (x>10) break } } Uses a conditional if statement; The break is called whenever the boolean evaluation is true and the block is exited repeat loop Loops until told to stop by break
Descriptive statistics > x<-rnorm(100) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.08800 -0.70850 -0.11480 -0.04413 0.76510 2.89100 > summary() Summary statistics related to a numeric variable
Descriptive statistics > x<-rnorm(100) > y<-rnorm(100) > plot(x,y) > plot(rnorm(500)) > lines(rnorm(500)) plot() Simple x vs. y gram (scatterplot)
Descriptive statistics heatmap generation (image) Scatterplot grid color weighted by intensities… - very useful for microarray analysis (we’ll see next time…) - can be used with dendrogram generation
Statistics of populations • The equations so far are for sample statistics • a statistic is a single number estimated from a sample • We use the sample to make inferences about the population. • a parameter is a single number that summarizes some quality of a variable in a population. • the term for the population mean is (mu), and Ybaris a sample estimator of . • the term for the population standard deviation is (sigma), and s is a sample estimator of . IQR Note that and are both elements of the normal probability curve. Source: http://www.bsos.umd.edu/socy/smartin/601/
Measuring probabilities under the normal curve • We can make transformations by scaling everything with respect to the mean and standard deviation. • Let z = the number of standard deviations above or below the population mean. • z = 0 y = • z = 1 y = +/- (p=0.68) • z = 2 y = +/- 2 (p=0.95) • z = 3 y = +/- 3 (p=0.997) IQR
Plotting using hist() and curve() > y<-hist(h,plot=F) > ylim<-range(0,y$density,dnorm(0)) > hist(x,freq=F,ylim=ylim > curve(dnorm(x),add=T)
Difficult to integrate… But probabilities have been Mapped out to this curve. Transformations from other Curves possible…
> qqnorm(x) Plotting using qqnorm()
Box plots (box and whiskers plots, Tukey, 1977) > boxplot(x) > boxplot(log(x)) min((Q3+1.5(IQR)),largest X) Outliers Fence / whiskers Q3 Median IQR Q1 Fence / whiskers max((Q1+1.5(IQR)),smallest X) Plotting using boxplot()
My advice First learn to program in R. Then use the R libraries. Everything in R can be built up piecewise • The data is made of component parts • It’s extremely useful to know how to handle the objects • The graphics are made of component parts • This allows extreme fine-tuning of your visualization! • Go beyond scatterplots and barplots to describe complex data well and visualize hidden trends • A good reference is Data Visualization by Edward Tufte.
Homework A simple problem, but one we may use frequently Use lapply (or sapply) to simulate the result of taking the mean of 100 random numbers from the normal distribution for 10 independent samples.