E N D
1. Introduction to R + Bioconductor J. Verducci
MBI Summer Workshop
August, 2005
2. Outline Advantages & Disadvantages of R+BioC
Getting Started: Setup and Basic Objects
Preprocessing: from probe-level to expressions
Differential Gene Expression in +8 AML
Interpretation via GO
3. Advantages of R + Bioconductor Most Extensive and Flexible Environment
Oligonucleotide & cDNA Arrays
Multiple methods for
Background correction
Probe-specific correction
Normalization over multiple chips
State of the Art Routines
Open Source
Best Platform for Development
FREE!
4. Disadvantages LARGE startup/learning cost in time
No warrantee
Informal Support
Needs LOTS of memory
Code typically not optimized
5. Getting Started Background
Basics of Affymetrics Chip Design
Need to correct for Background Noise
Need for Normalization
NCI Expression Data on +8 subtype of AML
The R computing environment
Installation for Windows
Bioconductor
Installation
Biobase Package
Affy CEL and CDF files
Affy Package
ReadAffy
Expresso
6. Affymetrics Chip System Probes
Oligionucleotides 25 bp in length (~1030 possible)
Attach to different segments of each RNA target
Probe Pair
Perfect Match (PM) probe: exact match to target
Mismatch (MM) probe: middle base pair mismatch
Probe Pair Sets
16 probe pairs for a specific RNA
Ordered from 5 to 3 end of the RNA
Files
CEL: probe level PMs and MMs related to a common affyID (one CEL file for chip)
CDF: relates probe pair sets to location on the array (same map for all chips used in an experiment)
7. Calibration Issues Global Background Correction
Spatial flaws in chip (dust, scratches, etc.)
Variations in binding
due to sample preparation
due to variations in number of G-C bonds
due to positioning of probe along length of target
Probe Specific Background Correction
A MM probe may be similar to other PM probes, some corresponding to highly expressed genes, making MM > PM
Sample may have contained DNA fragments
Normalization to compensate for chip to chip variations
8. Krahe AML Data Acute myeloid leukemia (AML) is a heterogeneous group of diseases.
Normal cytogenetics (CN) constitutes the single largest group
trisomy 8 (+8) as a sole abnormality is the most frequent trisomy
Control Group
CD34(+) cells purified from normal bone marrow (BM) were also analyzed as a representative heterogeneous population of stem and progenitor cells.
Results
Expression patterns of AML patients were clearly distinct from those of CD34(+) cells of normal individuals.
AML+8 blasts overexpress genes on chromosome 8, estimated at 32% on average, suggesting gene-dosage effects underlying AML+8.
Systematic analysis by cellular function indicated
up-regulation of genes involved in cell adhesion in both groups of AML compared with CD34(+) blasts from normal individuals
apoptosis-regulating genes were significantly down-regulated in AML+8 compared with AML-CN
Clinical and cytogenetic heterogeneity of AML is due to fundamental biological differences.
9. The R Computing Environment Download from
http://cran.r-project.org/bin/windows/base/
CRAN
Comprehensive R Archive Network
Site for introduction, help, and news
Object oriented programming
S developed by John Chambers, AT&T
Commercial package Splus (7.0) from Insightful
Freeware R (2.1.0) supported by CRAN
10. Bioconductor Installation From your R session, type:source("http://www.bioconductor.org/getBioC.R") This will download the getBioC functionality into your R session.
To install Bioconductor packages, use the function "getBioC" as follows.
To download and install the default packages, type: getBioC().
Note: despite documentation, tkWidgets [not really needed] must be loaded separately
Win32 package download
Remove the folder tkWidgets from tkWidgets_1.5.20 to unzip.
From the Packages menu, pull down Load package
From the new window select affy
This automatically loads the required additional packages
Biobase
Tools
reposTools
11. Biobase Toy data
geneData, data frame of real, anonymized, expression data
500 rows (genes)
26 cases (chips)
eset, same information saved as an expression set
To make these accessible, at the command sign > , type
data(geneData)
data(eset)
12. Expression Set Fundamental Object used for most analyses
Contains more stored information than a simple Data frame
Information is stored in slots. To access these type expr.set@slot name (e.g eset@exprs)
slotNames(eset)
"exprs" 500 x 26 matrix of expressions
"se.exprs" estimated std. errors
"phenoData" info on covariates assoc. w cases
"description" usually contains MIAME information
"annotation" links affy names of genes to other systems
"notes"
13. Reading in CEL files Create a directory
Move all relevant CEL files to that directory
This will unzip files that have been downloaded
Select Change Dir from the File menu to point to the CEL file directory
Load the affy package [if not already done; you can check using >search()]
Read in the data using
nciAML <- ReadAffy()
The symbol <- is the assignment operator in R.
nciAML is an AffyBatch object
This may require a lot of memory (at least 1G for Krahe AML data)
Quick alternative, requiring less space, is to go directly to an expression file: >AMLeset <- justRMA()
14. AffyBatch Example: nciAML slots nciAML@cdfName Hu6800
nciAML@nrow -- 536 common chip dimension
nciAML@ncol -- 536 common chip dimension
nciAML@exprs <287,296 x 31> matrix
Probe values for 31 chips
nciAML@se.exprs -- <0 x 0 matrix>
Not yet processed
nciAML@phenoData phenoData object with 1 covariate, which gives type of cell for each chip
Chips 1-9: AML-CN
Chips 10-23: AML+8
Chips 24-31: CD34 bone marrow stem cells from normals
15. Preprocessing Background correction
Adjust for spatial biases on each chip
Normalization
Adjust for Chip to Chip technical effects
Probe specific background correction
Adjust PM for MM
Summarize adjusted probe set values into one expression measure (and s.e.)
16. Expresso Combines 4 preprocessing steps:
nciAMLeset <- expresso(nciAML, bgcorrect.method=rma, normalize.method=quantiles, pmcorrect.method=pmonly, summary.method=medianpolish)
Alternatively, preprocessing may be done one step at a time, examining effects of each step (see Exploration of Probe Level Data)
17. Method Options > bgcorrect.methods
mas
none
rma
rma2
> normalize.methods(nciAML)
constant
contrasts
invariantset
loess
qspline
quantiles
quantiles.robust
18. Method Options (continued) > pmcorrect.methods
mas
pmonly
subtractmm
> express.summary.stat.methods
avgdiff
liwong
mas
medianpolish
playerout
19. Generic Versionsof Commercial Alternatives MAS 5.0 (generic version of Affymetrics)
>AMLeset5 <- mas5(nciAML)
>AMLcalls <- mas5calls(nciAML)
dchip (Li and Wongs MBEI)
>AMLdchip <- expresso(nciAML, normalize.method=invariantset, bg.correct=FALSE, pmcorrect.method=pmonly, summary.method=liwong)
20. Exploration of Probe Level Data Its good practice to look at the data in a variety of ways.
Can check on outliers, patterns of bias, need for log transform, etc.
Possible Actions
Correct obvious problems directly
Use standard correction methods appropriate to the data
21. Chip Imagesimage(nciAML)
22. Odd AML+8 Chip
23. ProbeSet class of objects Organizes information from an AffyBatch object into pm and mm matrices, one for each gene.
>gn <- geneNames(affybatch.example)
>ps <- probeset(affybatch.example,gn)
>show(ps[[1]])
ProbeSet object:
id=A28102_at
pm= 16 probes x 3 chips
24. Toy Data (AffyData) barplot.ProbeSet(ps[[1]])
25. Another gene
26. Compare Pairs of Chipspairs.AffyBatch(affybatch.example)
27. The M vs A plot CAGE Xylella Project (Apr/2004) The so-called M-A-plot is a graphical way to see ratios and fluorescence intensity at the same time. It was proposed by: Dudoit et al. Statistica Sinica (2002) 12:111.
The merits are:
it shows non-linear unwanted dependence between ratios and fluorescence intensities and
it shows that using only ratios is a naive way to identify differentially expressed genes.
As defined for two color cDNA arrays in that work: A = 1/2*(log2(Cy5) + log2(Cy3)) M = log2(Cy5 / Cy3)
For Affy arrays,
Cy3 is the reference value, which is the median value of a pm over all chips chosen from a homogeneous group
Cy5 is the pm value on the chip of interest
Using just the ratios or log ratios to visualize the data does not enable us to see the systematic dependence of the ratio on intensity values.
The final result from micro-array should depend only on biological material hybridized and not on the strength of the hybridization signal.
The M-A plot, such as the one in the figure on the next slide, shows if it is necessary to use a non-linear fitting for normalization.
28. Lowess Normalization
29. Chip Bias May Depend on Level
30. Or Not
31. Determining Differential Expression Usually just 2 groups, but sometimes more
Assume Expression Set contains all relevant information
Most direct approach:
Use (some sort of) t-test for each gene
Correct for multiple comparisons (e.g. by restricting the FDR)
Verify statistically significant genes by rtPCR
32. Danger of Using Fold-Difference Using just the ratios would not enable us to see that (corrected) ratios of low-intensity spots are not reliable.
R = 100/10 = 10-fold difference has not the same experimental value of
R' = 1000/100 = 10-fold too, for example.
This is obvious from M-A-plot of "self-self" hybridizations (same biological material labeled with both Cy3 and Cy5), as shown in the figure below:
33. t-test The t-test is the most basic and well-known hypothesis test of Statistics.
For each gene, compares the difference:
(mean expression in group A) (mean expression in group B)
Relative to levels of within-group variability
Some methods try to incorporate std err of the expression values
Some analysts use Bayesian methods to expand within-group variability, especially for low-level expression values
Typically this expression in the expression data set is measured on a log scale, so differences in logs correspond to log ratio
34. Multiple Comparisonsvia Limmahttp://bioinf.wehi.edu.au/limma/usersguide.pdf Load package limma from the Package menu.
>aml.type <- c(rep(1,9),rep(0,14))
>design<-model.matrix(~aml.type)
>colnames(design) <- c("CN","+8")
>fit <- lmFit(AMLeset[,1:23],design)
>results <- decideTests(fit, adjust="fdr", p=0.05)
To select all up genes:
>fit.up <- fit[results[,2] ==1, ]
To select all down genes:
>fit.down <- fit[results[,2]== -1, ]
To output to a file:
>write.fit(fit.up[,2], file="myresults.txt")
or simply
>write(fit.up[,2], file="myresults.txt")
35. Heatmap of Selected Genes
36. Heatmap of Genes from eset
37. More Advanced MethodsIncorporating Gene Correlations Elastic Net
Zou & Hastie, 2005
http://cran.r-project.org/src/contrib/PACKAGES.html
Gradient Directed Regularization
Friedman & Popescu, 2004
http://www-stat.stanford.edu/~jhf/PathSeeker.html
Shrunken Centroid Ordering by Othogonal Projections (SCOOP)
Verducci, 200?
E-mail me (verducci.1@osu.edu) for alpha version
38. Connecting to Gene Ontology OntologyTraverser: an R package for GO analysis
A. Young , N. Whitehouse , J. Cho and C. Shaw *
Baylor College of Medicine, Department of Molecular and Human Genetics Houston, TX, USA
Summary: Gene Ontology (GO) annotations have become a major tool for analysis of genome-scale experiments. We have created OntologyTraverseran R package for GO analysis of gene lists. Our system is a major advance over previous work because
the system can be installed as an R package,
the system uses Java to instantiate the GO structure and the SJava system to integrate R and Java
the system is also deployed as a publicly available web tool.
Availability: Our software is academically available through http://franklin.imgen.bcm.tmc.edu/OntologyTraverser/. Both the R package and the web tool are accessible.
Contact: cashaw@bcm.tmc.edu