Introduction to R Bioconductor

1. Introduction to R + Bioconductor J. Verducci MBI Summer Workshop August, 2005

2. Outline Advantages & Disadvantages of R+BioC Getting Started: Setup and Basic Objects Preprocessing: from probe-level to expressions Differential Gene Expression in +8 AML Interpretation via GO

3. Advantages of R + Bioconductor Most Extensive and Flexible Environment Oligonucleotide & cDNA Arrays Multiple methods for Background correction Probe-specific correction Normalization over multiple chips State of the Art Routines Open Source Best Platform for Development FREE!

4. Disadvantages LARGE startup/learning cost in time No warrantee Informal Support Needs LOTS of memory Code typically not optimized

5. Getting Started Background Basics of Affymetrics Chip Design Need to correct for Background Noise Need for Normalization NCI Expression Data on +8 subtype of AML The R computing environment Installation for Windows Bioconductor Installation Biobase Package Affy CEL and CDF files Affy Package ReadAffy Expresso

6. Affymetrics Chip System Probes Oligionucleotides 25 bp in length (~1030 possible) Attach to different segments of each RNA target Probe Pair Perfect Match (PM) probe: exact match to target Mismatch (MM) probe: middle base pair mismatch Probe Pair Sets 16 probe pairs for a specific RNA Ordered from 5� to 3� end of the RNA Files CEL: probe level PMs and MMs related to a common affyID (one CEL file for chip) CDF: relates probe pair sets to location on the array (same map for all chips used in an experiment)

7. Calibration Issues Global Background Correction Spatial flaws in chip (dust, scratches, etc.) Variations in binding due to sample preparation due to variations in number of G-C bonds due to positioning of probe along length of target Probe Specific Background Correction A MM probe may be similar to other PM probes, some corresponding to highly expressed genes, making MM > PM Sample may have contained DNA fragments Normalization to compensate for chip to chip variations

8. Krahe AML Data Acute myeloid leukemia (AML) is a heterogeneous group of diseases. Normal cytogenetics (CN) constitutes the single largest group trisomy 8 (+8) as a sole abnormality is the most frequent trisomy Control Group CD34(+) cells purified from normal bone marrow (BM) were also analyzed as a representative heterogeneous population of stem and progenitor cells. Results Expression patterns of AML patients were clearly distinct from those of CD34(+) cells of normal individuals. AML+8 blasts overexpress genes on chromosome 8, estimated at 32% on average, suggesting gene-dosage effects underlying AML+8. Systematic analysis by cellular function indicated up-regulation of genes involved in cell adhesion in both groups of AML compared with CD34(+) blasts from normal individuals apoptosis-regulating genes were significantly down-regulated in AML+8 compared with AML-CN Clinical and cytogenetic heterogeneity of AML is due to fundamental biological differences.�

9. The R Computing Environment Download from http://cran.r-project.org/bin/windows/base/ CRAN Comprehensive R Archive Network Site for introduction, help, and news Object oriented programming S developed by John Chambers, AT&T Commercial package Splus (7.0) from Insightful Freeware R (2.1.0) supported by CRAN

10. Bioconductor Installation From your R session, type:source("http://www.bioconductor.org/getBioC.R") This will download the getBioC functionality into your R session. To install Bioconductor packages, use the function "getBioC" as follows. To download and install the default packages, type: getBioC(). Note: despite documentation, tkWidgets [not really needed] must be loaded separately Win32 package download Remove the folder tkWidgets from tkWidgets_1.5.20 to unzip. From the Packages menu, pull down �Load package� From the new window select �affy� This automatically loads the required additional packages Biobase Tools reposTools

11. Biobase �Toy� data geneData, data frame of real, anonymized, expression data 500 rows (genes) 26 cases (chips) eset, same information saved as an expression set To make these accessible, at the command sign �>� , type data(geneData) data(eset)

12. Expression Set Fundamental Object used for most analyses Contains more stored information than a simple Data frame Information is stored in �slots�. To access these type �expr.set�@�slot name� (e.g eset@exprs) slotNames(eset) "exprs" 500 x 26 matrix of expressions "se.exprs" estimated std. errors "phenoData" info on covariates assoc. w cases "description" usually contains MIAME information "annotation" links affy names of genes to other systems "notes"

13. Reading in CEL files Create a directory Move all relevant CEL files to that directory This will unzip files that have been downloaded Select �Change Dir� from the File menu to point to the CEL file directory Load the affy package [if not already done; you can check using >search()] Read in the data using nciAML <- ReadAffy() The symbol �<-� is the assignment operator in R. nciAML is an �AffyBatch� object This may require a lot of memory (at least 1G for Krahe AML data) Quick alternative, requiring less space, is to go directly to an expression file: >AMLeset <- justRMA()

14. AffyBatch Example: nciAML slots nciAML@cdfName � �Hu6800� nciAML@nrow -- 536 common chip dimension nciAML@ncol -- 536 common chip dimension nciAML@exprs � <287,296 x 31> matrix Probe values for 31 chips nciAML@se.exprs -- <0 x 0 matrix> Not yet processed nciAML@phenoData � phenoData object with 1 covariate, which gives type of cell for each chip Chips 1-9: AML-CN Chips 10-23: AML+8 Chips 24-31: CD34 bone marrow stem cells from normals

15. Preprocessing Background correction Adjust for spatial biases on each chip Normalization Adjust for Chip to Chip technical effects Probe specific background correction Adjust PM for MM Summarize adjusted probe set values into one expression measure (and s.e.)

16. Expresso Combines 4 preprocessing steps: nciAMLeset <- expresso(nciAML, bgcorrect.method=�rma�, normalize.method=�quantiles�, pmcorrect.method=�pmonly�, summary.method=�medianpolish�) Alternatively, preprocessing may be done one step at a time, examining effects of each step (see Exploration of Probe Level Data)

17. Method Options > bgcorrect.methods mas none rma rma2 > normalize.methods(nciAML) constant contrasts invariantset loess qspline quantiles quantiles.robust

18. Method Options (continued) > pmcorrect.methods mas pmonly subtractmm > express.summary.stat.methods avgdiff liwong mas medianpolish playerout

19. Generic Versionsof Commercial Alternatives MAS 5.0 (generic version of Affymetrics) >AMLeset5 <- mas5(nciAML) >AMLcalls <- mas5calls(nciAML) dchip (Li and Wong�s MBEI) >AMLdchip <- expresso(nciAML, normalize.method=�invariantset�, bg.correct=FALSE, pmcorrect.method=�pmonly�, summary.method=�liwong�)

20. Exploration of Probe Level Data It�s good practice to look at the data in a variety of ways. Can check on outliers, patterns of bias, need for log transform, etc. Possible Actions Correct obvious problems directly Use �standard� correction methods appropriate to the data

21. Chip Imagesimage(nciAML)

22. Odd AML+8 Chip

23. ProbeSet class of objects Organizes information from an AffyBatch object into pm and mm matrices, one for each gene. >gn <- geneNames(affybatch.example) >ps <- probeset(affybatch.example,gn) >show(ps[[1]]) ProbeSet object: id=A28102_at pm= 16 probes x 3 chips

24. �Toy� Data (AffyData) barplot.ProbeSet(ps[[1]])

25. Another gene

26. Compare Pairs of Chipspairs.AffyBatch(affybatch.example)

27. The M vs A plot CAGE Xylella Project (Apr/2004) The so-called M-A-plot is a graphical way to see ratios and fluorescence intensity at the same time. It was proposed by: Dudoit et al. Statistica Sinica (2002) 12:111. The merits are: it shows non-linear unwanted dependence between ratios and fluorescence intensities and it shows that using only ratios is a naive way to identify differentially expressed genes. As defined for two color cDNA arrays in that work: A = 1/2*(log2(Cy5) + log2(Cy3)) M = log2(Cy5 / Cy3) For Affy arrays, �Cy3� is the reference value, which is the median value of a pm over all chips chosen from a homogeneous group �Cy5� is the pm value on the chip of interest Using just the ratios or log ratios to visualize the data does not enable us to see the systematic dependence of the ratio on intensity values. The final result from micro-array should depend only on biological material hybridized and not on the strength of the hybridization signal. The M-A plot, such as the one in the figure on the next slide, shows if it is necessary to use a non-linear fitting for normalization.

28. Lowess Normalization

29. Chip Bias May Depend on Level

30. Or Not

31. Determining Differential Expression Usually just 2 groups, but sometimes more Assume Expression Set contains all relevant information Most direct approach: Use (some sort of) t-test for each gene Correct for multiple comparisons (e.g. by restricting the FDR) Verify �statistically significant� genes by rtPCR

32. Danger of Using Fold-Difference Using just the ratios would not enable us to see that (corrected) ratios of low-intensity spots are not reliable. R = 100/10 = 10-fold difference has not the same experimental value of R' = 1000/100 = 10-fold too, for example. This is obvious from M-A-plot of "self-self" hybridizations (same biological material labeled with both Cy3 and Cy5), as shown in the figure below:

33. t-test The t-test is the most basic and well-known hypothesis test of Statistics. For each gene, compares the difference: (mean expression in group A) � (mean expression in group B) Relative to levels of within-group variability Some methods try to incorporate std err of the expression values Some analysts use Bayesian methods to expand within-group variability, especially for low-level expression values Typically this expression in the expression data set is measured on a log scale, so differences in logs correspond to log ratio

34. Multiple Comparisonsvia Limmahttp://bioinf.wehi.edu.au/limma/usersguide.pdf Load package �limma� from the Package menu. >aml.type <- c(rep(1,9),rep(0,14)) >design<-model.matrix(~aml.type) >colnames(design) <- c("CN","+8") >fit <- lmFit(AMLeset[,1:23],design) >results <- decideTests(fit, adjust="fdr", p=0.05) To select all up genes: >fit.up <- fit[results[,2] ==1, ] To select all down genes: >fit.down <- fit[results[,2]== -1, ] To output to a file: >write.fit(fit.up[,2], file="myresults.txt") or simply >write(fit.up[,2], file="myresults.txt")

35. Heatmap of Selected Genes

36. Heatmap of Genes from �eset�

37. More Advanced MethodsIncorporating Gene Correlations Elastic Net Zou & Hastie, 2005 http://cran.r-project.org/src/contrib/PACKAGES.html Gradient Directed Regularization Friedman & Popescu, 2004 http://www-stat.stanford.edu/~jhf/PathSeeker.html Shrunken Centroid Ordering by Othogonal Projections (SCOOP) Verducci, 200? E-mail me (verducci.1@osu.edu) for alpha version

38. Connecting to Gene Ontology OntologyTraverser: an R package for GO analysis A. Young , N. Whitehouse , J. Cho and C. Shaw * Baylor College of Medicine, Department of Molecular and Human Genetics Houston, TX, USA Summary: Gene Ontology (GO) annotations have become a major tool for analysis of genome-scale experiments. We have created OntologyTraverser�an R package for GO analysis of gene lists. Our system is a major advance over previous work because the system can be installed as an R package, the system uses Java to instantiate the GO structure and the SJava system to integrate R and Java the system is also deployed as a publicly available web tool. Availability: Our software is academically available through http://franklin.imgen.bcm.tmc.edu/OntologyTraverser/. Both the R package and the web tool are accessible. Contact: cashaw@bcm.tmc.edu

Introduction to R Bioconductor

Introduction to R Bioconductor

Presentation Transcript

Introduction to BioConductor

Introduction to R

BioConductor

5. Introduction to R and Bioconductor

Introduction to R and Bioconductor BMI 731 Winter 2005

Intro to R / BioConductor

Introduction to R

An Introduction to Bioconductor

Bioconductor

R/Bioconductor Workshop Session I – Surviving in R

R/Bioconductor Workshop Session II - Bioconductor-ing

Getting Data into R & Bioconductor

Enabling Bioconductor R packages for caGrid services

Functional Genomics with R and Bioconductor

Welcome to lecture 7: An introduction to bioconductor

Introduction to BioConductor

Introduction to R / sma / Bioconductor

An Introduction to Bioconductor

Microarray Data Analysis using R and Bioconductor

Introduction to R and Bioconductor

Overview of R & Bioconductor

Introduction to R and Bioconductor BMI 731 Winter 2005

Introduction to R Bioconductor