760 likes | 904 Views
Welcome to lecture 7: An introduction to bioconductor. IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA. What is bioconductor?. From the bioconductor webpage ( www.bioconductor.org ):
E N D
Welcome to lecture 7:An introduction to bioconductor IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA
What is bioconductor? From the bioconductor webpage (www.bioconductor.org): • Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data. • The project was started in the Fall of 2001. The Bioconductor core team is based primarily at the Fred Hutchinson Cancer Research Center. Other members come from various US and international institutions. • Bioconductor is primarily based on the R programming language but we do accept contributions in any programming language. There are two releases of Bioconductor every year (they appear shortly after the corresponding R release). At any one time there is a release version, which corresponds to the released version of R, and a development version, which corresponds to the development version of R. Most users will find the release version appropriate for their needs. In addition there are a large number of meta-data packages available. They are mainly, but not solely oriented towards different types of microarrays.
Where to get bioconductor? From the bioconductor website: Right now, the easiest way to install BioConductor packages is using the biocLite.R installation script. Using the R CLI, simply type the following: >source("http://www.bioconductor.org/biocLite.R") >biocLite() This will install a subset of packages by default. Some optional parameters are accepted by the script: • pkgs • character vector of Bioconductor packages to install. • destdir • the directory where the downloaded packages will be stored. • lib • character vector giving the library directories under which packages may be installed. Recycled as needed.
Bioconductor principles • The following slides were compiled from the bioconductor vignettes and short courses available through the bioconductor website. • Of special note are the presentations by Sandrine Dudoit and Robert Gentleman, 2002 (http://www.bioconductor.org/workshops/Summer02Course/RBioC/RBioC.pdf) and the EMBO short course on microarray analysis, with emphasis on the bioconductor project (http://www.bioconductor.org/workshops/EMBO03), also by Sandrine Dudoit and Robert Gentleman, 2003.
Bioconductor design • Object-oriented class/method design. • Allows efficient representation and manipulation of large and complex biological datasets of multiple types. • Widgets. • Specific, small scale, interactive components providing graphically driven analyses - point & click interface.
Bioconductor tools • Interactive tools for linking experimental results to annotation and literature WWW resources in real time. • E.g. PubMed, GenBank, LocusLink. • Scenario: For a list of differentially expressed genes obtained from multtest or genefilter, use the annotate package to retrieve PubMed abstracts for these genes and to generate an HTML report with links to LocusLink for each gene.
Bioconductor packages • General infrastructure • – Biobase • – annotate, AnnBuilder • – tkWidgets. • Pre-processing for Affymetrix data • – affy • • Pre-processing for cDNA data • – marrayClasses, marrayInput, marrayNorm, marrayPlots • • Differential expression
Bioconductor training • R vignettes system • – comprehensive repository of step-by-step tutorials covering a wide variety of computational objectives in /doc subdirectory; • – integrated statistical documents intermixing text, code, and code output (textual and graphical); • – documents can be automatically updated if either data or analyses are changed. • • Modular training segments • – short courses: lectures and computer labs; • – interactive learning and experimentation with the software • platform and statistical methodology.
R programming, extended • In order to deliver high quality software the Bioconductor project relies on a few programming techniques that might not be familiar: • – environments and closures; • – object oriented programming. • • We review these here for interested programmers (understanding them is not essential but is often very helpful).
Environments and closures • An environment is an object that contains bindings between symbols and values. • It is very similar to a hash table. • Environments can be accessed using the following functions: • – ls(env=e) # get a listing. • – get(“x”, env=e) # get the value of the object in e • with name x. • – assign(“x”,y,env=e) # assign to the name x • the value y in the environment e.
Environments and closures • Since these operations are used a great deal in Bioconductor we have provided two helper functions: • – multiget • – multiassign • • These functions get and assign multiple values into the specified environment.
Environments and closures • Environments can be associated with functions: • • When an environment is associated with a function, then that environment is used to obtain values for any unbound variables. • • The term closure refers to the coupling of the function body with the enclosing environment. • • The annotate, genefilter, and other packages take advantage of environments
Environments and closures • > x <- 4 • > e1 <- new.env() • > assign(“x”,10, env=e1) • > f <- function() x • > environment(f) <- e1 • > x # returns 4 • > f() # returns 10!
OOP • The Bioconductor project has adopted the OOP paradigm presented in Programming with Data, J. M. Chambers, 1998. • • Tools for programming using the class/method mechanism are provided in the methods package.
OOP • A class provides a software abstraction of a real world object. It reflects how we think of certain objects and what information these objects should contain. • • A class defines the structure, inheritance, and initialization of objects. • • Classes are defined in terms of slots which contain the relevant data. • • An object is an instance of a class.
OOP • A method is a function that performs an action on data (objects). • • A generic function is a dispatcher, it examines its arguments and determines the appropriate method to invoke. • • Examples of generic functions include • plot • summary • print
OOP • To obtain documentation (on-line help) about: • – a class: class?classname • so, class?exprSet, will display the help file for the exprSet class. • – a method: methods?methodname • so, methods?print, will display the help file for the print methods.
OOP • The methods package contains a number of functions for defining new classes and methods (e.g. setClass, setMethod) and for working with these classes and methods. • • A tutorial is available at • http://www.omegahat.org/RSMethods/index.html
OOP • > setClass(“simple", • representation(x="numeric",y="matrix“), • prototype = list(x=numeric(),y=matrix(0))) • > z <- new("simple", x=1:10, • y=matrix(rnorm(50),10,5)) • > z@x • [1] 1 2 3 4 5 6 7 8 9 10
Biobase • The Biobase package provides class definitions and other infrastructure tools that will be used by other packages. • • The two important classes defined in Biobase are: • – phenoData: sample level covariate data. • – exprSet: the sample level covariate data combined with the expression data and a few other quantities of interest.
Biobase: exprSet • Slots for the exprSet class: • • exprs: a matrix of expression measures • genes are rows, samples are columns. • • se.exprs: standard errors for the expression measures, if available. • • phenoData: an object of class phenoData that describes the samples. • • annotation: a character vector. • • description: a character vector. • • notes: a character vector.
Biobase: exprSet • One of the most important tasks is to align the expression data and the phenotypic data (and to keep that alignment through the analysis). • • To achieve this, the exprSet class combines these two data sources into one object, and provides subsetting and access methods that make it easy to manipulate the data while ensuring that they are correctly aligned.
Biobase: exprSet • A design principle that was adopted for the exprSet and other classes was that they should be closed under the subset • operation. • • So any subsetting, either of rows or columns, will return a valid exprSet object. • • This makes it easier to use exprSet in other software packages
Biobase: exprSet • Some methods for the exprSet class: • • show: controls the printing (you seldom want a few hundred thousand numbers rolling by). • • subset, [ and $, are both designed to keep correct subsets of the exprs, se.exprs, and phenoData objects. • • split, splits the exprSet into two or more parts depending on the vector used for splitting.
Biobase: exprSet • Some methods for the exprSet class: • • geneNames, retrieves the gene names (row names of exprs). • • phenoData, pData, and sampleNames provide access to the phenoData slot. • • write.exprs, writes the expression values to a file for processing or storage.
Biobase: phenoData • Slots for the phenoData class: • • pData: a dataframe, where the samples • are rows and the variables are columns • (this is the standard format). • • varLabels: a vector containing the • variable names (as they appear in pData) • and a longer description of the variables.
Biobase: phenoData • Methods for the phenoData class include: • – [, the subset operator; • this method ensures that when a subset is taken, both the pData object and the varLabels object have the appropriate subsets taken. • – $, extracts the appropriate column of the • pData slot (as for a dataframe). • – show, a method to control printing, we show • only the varLabels (and the size).
Biobase: exprSet • Some methods for the exprSet class: • • show: controls the printing (you seldom want a few hundred thousand numbers rolling by). • • subset, [ and $, are both designed to keep correct subsets of the exprs, se.exprs, and phenoData objects. • • split, splits the exprSet into two or more parts depending on the vector used for splitting.
OOP • A method is a function that performs an action on data (objects). • • A generic function is a dispatcher, it examines its arguments and determines the appropriate method to invoke. • • Examples of generic functions include • plot • summary • print
Let’s take an example simple affymetrix analysis using the Golub data set • AML/ALL leukemia dataset (1999) ALL ALL 47 ALL TEST AML AML Train 25 AML Hu6800 GeneChip
Affymetrix GeneChip Design 3’ 5’ Reference Sequence …TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT… CCCTTACCCAGTCTTCCGGAGGCTA Perfect Match CCCTTACCCAGTGTTCCGGAGGCTA Mismatch
Terminology • Each gene or portion of a gene is represented by 1q to 20 oligonucleotides of 25 base-pairs. • • Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer. • • Perfect match (PM): A 25-mer complementary to a reference sequence of interest • (e.g., part of a gene). • • Mismatch (MM): same as PM but with a single homomeric base change for the middle (13th) base (transversion purine <->pyrimidine, G <->C, A <->T) . • • Probe-pair: a (PM,MM) pair. • • Probe-pair set: a collection of probe-pairs (1q to 20) related to a common gene or fraction of a gene. • • Affy ID: an identifier for a probe-pair set. • The purpose of the MM probe design is to measure nonspecific binding and background noise.
Affymetrix files Main software from Affymetrix company MicroArray Suite - MAS, now version 5. • DAT file: Image file, ~10^7 pixels, ~50 MB. • CEL file: Cell intensity file, probe level PM and MM values. • CDF file: Chip Description File. Describes which probes go in which probe sets and the location of probe-pair sets (genes, gene fragments, ESTs).
Expression Measures 10-20K genes represented by 11-20 pairs of probe intensities (PM & MM) • Obtain expression measure for each gene on each array by summarizing these pairs • Background adjustment and normalization are important issues • There are many methods (MAS, RMA, Li/Wong, …)
Let’s take an example the Golub training data set We can access this as built-in data • For a description of the Golub et al. (1999) dataset type ? golubTrain and to load these data type: • > library(golubEsets) • > data(golubTrain) • > data(golubTest)
affy: oligo arrays case study • Let’s look at some of the main functions in the affy package for pre-processing • Affymetrix microarray data. A number of sample datasets are available in the package; to list these, type data(package="affy"). To load the package: • > library(affy) • The function ReadAffy is available for reading CEL files. • One of the main classes in affy is the AffyBatch class. Other classes include ProbeSet (PM and MM intensities for individual probe sets), Cdf (information contained in a CDF file), and Cel (single array cel intensity data). • The object Dilution is an instance of the class AffyBatch.
affy: oligo arrays case study • > class(Dilution) • [1] "AffyBatch" • > slotNames(Dilution) • [1] "cdfName" "nrow" "ncol" "exprs" "se.exprs" • [6] "phenoData" "description" "annotation" "notes" • > Dilution • AffyBatch object • size of arrays=640x640 features (12805 kb) • cdf=HG_U95Av2 (12625 affyids) • number of samples=4 • number of genes=12625 • annotation=hgu95av2 • notes= • > annotation(Dilution) • [1] "hgu95av2"
affy: oligo arrays case study • For a description of the target samples hybridized to the arrays • > phenoData(Dilution) • phenoData object with 3 variables and 4 cases • varLabels • liver: amount of liver RNA hybridized to array in micrograms • sn19: amount of central nervous system RNA hybridized to array in micrograms • scanner: ID number of scanner used • > pData(Dilution) • liver sn19 scanner • 20A 20 0 1 • 20B 20 0 2 • 10A 10 0 1 • 10B 10 0 2
affy: oligo arrays case study • The exprs slot contains a matrix with columns corresponding to arrays and rows to • individual probes on the array. To obtain the matrix of intensities for all four arrays: • > e <- exprs(Dilution) • > nrow(Dilution) * ncol(Dilution) • [1] 409600 • > dim(e) • [1] 409600 4 • You can access probe-level PM and MM intensities using: • > PM <- pm(Dilution) • > dim(PM) • [1] 201800 4 • > PM[1:5, ] • 20A 20B 10A 10B • 1000_at1 468.8 282.3 433.0 198.0 • 1000_at2 430.0 265.0 308.5 192.8 • 1000_at3 182.3 115.0 138.0 86.3 • 1000_at4 930.0 588.0 752.8 392.5 • 1000_at5 171.0 128.0 152.3 97.8
affy: oligo arrays case study • To get the probe set names (Affy IDs) • > gnames <- geneNames(Dilution) • > length(gnames) • [1] 12625 • > gnames[1:5] • [1] "1000_at" "1001_at" "1002_f_at" "1003_s_at" "1004_at" • > nrow(e)/length(gnames) • [1] 32.44356 • As with other microarray objects in Bioconductor packages, you can use subsetting • commands for AffyBatch objects • > dil1 <- Dilution[1] • > class(dil1) • [1] "AffyBatch" • > dil1 • AffyBatch object • size of arrays=640x640 features (3204 kb) • cdf=HG_U95Av2 (12625 affyids) • number of samples=1 • number of genes=12625 • annotation=hgu95av2 • notes=
affy: oligo arrays case study • To produce boxplots of probe log intensities: • > boxplot(Dilution, col = c(2, 2, 3, 3))
affy: oligo arrays case study • The boxplots and density plots show that the Dilution data needs normalization. As • described in the dataset help file and in the phenoData slot (pData(Dilution)), two concentrations of mRNA were used and, for each concentration, two scanners were used. • From the plots, we note that scanner effects seem stronger than concentration effects (different colors). Arrays that should be the same are different; arrays that should be different are similar. Because different mRNA concentrations were used, we perform probe-level normalization within concentration groups: • > Dil20 <- normalize(Dilution[1:2]) • > Dil10 <- normalize(Dilution[3:4]) • > normDil <- merge(Dil20, Dil10) • Notice how the boxplot now looks better: • > boxplot(normDil, col = c(2, 2, 3, 3))
affy: oligo arrays case study • Having access to PM and MM data can be useful. Let’s look at a plot of PM vs.MM: • > plot(mm(Dilution[1]), pm(Dilution[1]), pch = ".", log = "xy") • > abline(0, 1, col = "red")
Case study: Golub et. al., 1999 • The microarray expression measures and clinical variables corresponding to each of • the 38 training mRNA samples are stored in an object of class exprSet. For information • on the training data golubTrain: • > class(golubTrain) • [1] "exprSet“ • > slotNames(golubTrain) • [1] "exprs" "se.exprs" "phenoData" "description" "annotation" • [6] "notes“ • > golubTrain • Expression Set (exprSet) with • 7129 genes • 38 samples • phenoData object with 11 variables and 38 cases • varLabels • Samples: Samples • ALL.AML: ALL.AML • BM.PB: BM.PB • T.B.cell: T.B.cell • …
Naïve differential expression: Golub et. al., 1999 • We will illustrate some of the tools for reasoning about differential expression. • First, we compute two-sample t statistics in this restricted dataset: • > library(genefilter) • > ALLinds <- which(small$ALL.AML == "ALL") • > AMLinds <- which(small$ALL.AML == "AML") • > myt <- fastT(exprs(small), ALLinds, AMLinds) • We will focus on genes for which the • signed t statistic exceeds 2: • > bigT <- myt$z > 2 • > heatmap(exprs(small)[bigT, ])
Case study: Golub et. al., 1999 • The phenotypic data are stored in a separate, but linked, object of class phenoData. • An object of class phenoData is a combination of a data frame containing a number of variables for each array and a list that explains what each variable represents. • > phenoTrain <- phenoData(golubTrain) • > class(phenoTrain) • [1] "phenoData" • > slotNames(phenoTrain) • [1] "pData" "varLabels" • > varLabels(phenoTrain) • $Samples • [1] "Samples" • $ALL.AML • [1] "ALL.AML" • … • > pData(phenoTrain) • Samples ALL.AML BM.PB T.B.cell FAB Date Gender pctBlasts Treatment • 1 1 ALL BM B-cell <NA> 9/4/1996 M NA <NA> • …
Case study: Golub et. al., 1999 • The $ operator can be used to extract particular variables from an object of class phenoData. It also can be used directly on the exprSet instance: • > table(phenoTrain$ALL.AML) • ALL AML • 27 11 • > table(golubTest$ALL.AML) • ALL AML • 20 14 • Data on only the first 10 genes in the first 3 chips can be obtained using the subsetting operator "[“ • > golubTrain[1:10, 1:3] • Expression Set (exprSet) with • 10 genes • 3 samples • phenoData object with 11 variables and 3 cases • …
Annotation: Golub et. al., 1999 • One of the largest challenges in analyzing genomic data is associating the experimental data with the available biological metadata, e.g., sequence, gene annotation, • chromosomal maps, literature. • Bioconductor provides two main packages for this purpose: • annotate (end-user) • AnnBuilder (developer) • These packages can be used for querying databases such as GenBank, GO, LocusLink, and PubMed, from R, and • for processing the query results in R.