460 likes | 597 Views
Functional Genomics with R and Bioconductor. Wolfgang Huber EMBL-EBI 3 June 2008. The S language. The S language has been developed since the late 1970s by John Chambers and his colleagues at Bell Labs.
E N D
Functional Genomics withR and Bioconductor Wolfgang Huber EMBL-EBI 3 June 2008
The S language • The S language has been developed since the late 1970s by John Chambers and his colleagues at Bell Labs. • The language has been through a number of major changes but has been relatively stable since the mid 1990s • The language combines ideas from a variety of sources (e.g. Awk, Lisp, APL...) and provides an environment for quantitative computations and visualization.
Implementations • S-Plus is a commercialization of the Bell Labs code. • R is an independent open source version that was originally developed at the University of Auckland but which is now developed by a world wide group of developers. • Each version has advantages and problems.
What can you do with R http://cran.r-project.org/web/views e.g. Optimization; Multivariate
How to get information on R Online manuals (show) Function manual pages CRAN task views (show) Package vignette (e.g. vsn) Books Courses Google Mailing lists: R-help, R-devel - Check the list archives (e.g. Google) - Read the posting guide- Send email
Main features of R • Most comprehensive collection of statistical models + functions • Publication quality graphics • Package system with dependency management, name spaces; typical sessions with dozens of packages from different authors • Functional language • Object oriented programming • Foreign language interface (shared memory) • Pragmatic: emphasis on inclusion of many different tools and ideas, and on making particular tasks simple; but not on stringent overall design or safety
R is a functional language • cf. Lisp, Scheme, Haskell • Functions are “first-class” objects (like integers, vectors, etc.) and can be passed to other functions, computed upon, modified, be applied to a list of arguments etc. • Lazy evaluation (evaluate function arguments only if and when needed) • Pass-by-value semantics: function calls are not supposed to change the state of the calling environment. If two different variable names x and y start out with the same value, there is (almost) no way for R-level operations on the value of x to change the value of y. • Weak typing: e.g. f = function (x, y) { x*y } • ‘…’ argument allows variable length argument lists and encapsulation of function arguments in nested functions f = function(a, …) { p = transform(a) plot(p, …) }
Major drawbacks of R • Its loops are slow • Pass-by-value semantics can cause a lot of unnecessary copying of large objects – wasting CPU time and memory • Ad 1.: operators and many functions are vectorized • Not difficult to include user-defined C functions for time-critical loops • Ad 2.: R has some support for references and mutable state of objects, and future versions of R may support this more (http://www.stat.uiowa.edu/~luke/R/references.html)
Foreign language interface demo.R
Debugging R code demo.R
S4 object system Class: like a C struct, plus user-defined validity check methods (e.g. to make slot values internally consistent) Generic function: exists besides and independent of classes; depending on argument types, dispatch to specific methods; multiple dispatch! e.g. “plot(x, y, z, …)” A frequently used paradigm in OOP is of an object sitting in memory (e.g. representing a user interface window, or a database) and sending and receiving messages to/from other objects, and so changing its state. Since R is a functional language, this paradigm does not work well.
Memory and CPU usage profiling Extensive support for profiling • in which parts of the code CPU time is spent • where and how much memory allocation is caused
References R Programming for Bioinformatics by Robert Gentleman, Chapman & Hall/CRC (July 2008)
Packages • Packages are the main unit of software authoring, versioning and distribution • CRAN is the major repository for R packages. It is hosted by TU Vienna and ETH Zürich, and has many mirrors world-wide • Bioconductor is a repository for biology related packages. It is hosted at the Fred Hutchinson Cancer Research Centre.
Bioconductor an open sourceand open development software project for the analysis of biomedical and genomic data was started in the autumn of 2001 and includes core developers in the US, Europe, and Australia R and the R package system are used to design and distribute software
Goals of the Bioconductor project Provide access to powerful statistical and graphical methods for the analysis of genomic data. Facilitate the integration of biological metadata (e.g. Entrez, Ensembl, GO(A), PubMed) in the analysis of experimental data. Allow the rapid development of extensible, interoperable, and scalable software. Promote high-quality documentation and reproducible research. Provide training in computational and statistical methods.
Bioconductor Strict 6-monthly release cycle (in sync with R), starting with about 15 packages 1.0 in March 2003, now at 2.2 with 260 packages Thousands of downloads within 4 weeks after release Aggressive development Focus on cutting edge research Packages vary in their maturity: software ecosystem
Why are we Open Source? • so that you can find out what algorithm is being used, and how it is being used • so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs • so that they can be used by others as components (potentially modified)
Good scientific software is like a good scientific publication oReproducible oPeer-review oEasy to access by other researchers, society o Builds on the work of others o Others will build their work on top of it o Commercialization of spin-offs can make sense
Some Bioconductor packages • limma – differential expression for designed (ANOVA-type) experiments; moderated variance estimation (sharing info across genes) • affy – more sensitive / specific preprocessing of Affymetrix genechip data • beadarray, lumi - … for Illumina beadarrays • GSEA, GOstats, topGO … - gene set enrichment methods • cellHTS2, EBImage – RNAi and high content screening
Self-describing, standardized data structures • One of the challenges of Bioinformatics is that datasets are more complex than just tables and matrices (e.g. a microarray experiments) • Another one is that we want to use software packages from many different authors (e.g. normalisation, quality assessment, differential expression) • A major contribution of Bioconductor to R are common data structures
D Sample-ID red R Sample-ID green G Physical coordinates Sample-ID blue B Sequence Array-ID _ALL_ Target gene ID NChannelSet Physical coordinates assayData can contain N=0, 1, 2, ..., matrices of the same size Sequence Target gene ID Sample-ID red Sample-ID green Sample-ID blue Array ID “pheno”Data (AnnotatedDataframe) featureData (AnnotatedDataframe) labelDescription channelDescription labelDescription varMetaData
Bioconductor support for HTS Bioconductor support for HTS is still at an early stage. So far, we have these packages: • Biostrings: Infrastructure to handle large amounts of character or genomic data; simple exact matching • ShortReads*: Reading in data from the SolexaPipeline and Maq • HilbertCurveDisplay*: Visualization of genomic data • TileQC: raw-data quality control tools * not yet released (available from SVN)
How to get started • Go to www.bioconductor.org and follow installation instructions (Linux, Windows, Mac) • Read the book • Subscribe to the mailing list • Contribute your own package!
Courses Brixen / Bressanone 15-20 June 2008 W Huber, R Gentleman, A Ladurner, R Bourgon, F Hahne, S Anders Day 1: Introduction to R and Bioconductor; microarray preprocessing, quality assessment and normalisation Day 2: Chromatin Biology; ChIP-chip analysis Day 3: Statistical methods for differential expression; Using gene annotation Day 4: Gene set enrichment analysis; techniques for data integration, clustering, classification, visualisation. Day 5: Graphs, networks; Introduction to higher throughput sequencing data analysis and ChIP-Seq
High-resolution mapping of meioticcrossovers and noncrossovers Eugenio Mancera*, Riuchard Bourgon*, Alessandro Brozzi, Wolfgang Huber, Lars Steinmetz EMBL – HD & EBI
Meiosis • Two rounds of cell division. • Reduction from diploid to haploid gametes, required for sexual reproduction. • Recombination • Provides physical contact between homologs. • Increases genetic diversity, and is the primary determinant of haplotype structure: within-species genetic similarity and difference. From Molecular Biology of the Cell, Fourth Edition.
Double-strand break repair • Recombination initiates with a double-strand break in one DNA molecule. • Only two DNA molecules (homologs) are shown here. CO: NCO:
Saccharomyces cerevisiae microarray data • Two strains: S96 and YJM789. • Tiling microarrays: • ≈ 6.5 M 5µ features, tiling non-repetitive S96 every 4 bases. • ≈ 4% of probes are specific to YJM789 sequence. • Markers: • Data • 25 parental genomic hybridizations. • 208 wildtype offspring hybes. • 20 msh4 offspring hybes. • 20 mms4 offspring hybes.
Questions • Rate inhomogeneity • Which parts of the genome are hot or cold? • Do COs and NCOs behave similarly? • The “hot spot conversion paradox” • How transient are hot spots? • Do biases in mismatch repair contribute to hot spot transiency, or to base usage bias in general? • Gene conversion and haplotype structure • How prevalent is gene conversion, and what are its effects on haplotype structure? • Mechanisms of meiotic recombination • Are observed phenomena consistent with current models? • What do msh4 and mms4 mutants tell us about these models?
Chromosome length and events per meiosis • An average of 90.5 COs and 46.2 observed NCOs per meiosis. • 30% of COs fall between markers: correct the 46.2 NCOs to ≈66 per meiosis. • Up to 1% of the genome of each meiotic product falls within a recombination interval — and is thus subject to gene conversion — in a single meiosis.
CO and NCO rates along the chromosome • Centromeres are uniformly cold, as boundaries of large insertions and a large transversion. • Many regions show elevated CO or NCO rate, but total event counts are too low for locus-specific testing. • Global CO/NCO bias testing: an excess of skewed ratios from IMIs corresponding to ≈100 kb, or 1.4% of the recombinatorially active fraction (p < .0005).
Correspondence between DSBs and event intervals • After adjusting for marker spacing, event counts correlate well with recent ChIP-chip experiments for the Spo11 maps. (C Buhler et al., PLoS Biology 5:e324, 2007.)
Conclusions Used high-density tiling arrays to genotype >50 tetrads (>200 spores) from crossing of two phenotypically diverse strains 1.8% - 4.5% of genome is part of recombination events in a single meiosis (up to 544 Kb) Prevalence and relevance of gene conversion (CO and NCO associated) Chromosomal position is associated with whether outcome of DSB is CO or NCO Conversion hotspots unlink genomic regions from the linkage map
Eugenio Mancera Ramos Richard Bourgon • Lars Steinmetz Acknowledgement Seattle Robert Gentleman Martin Morgan Hervé Pagés Seth Falcon Harvard Vince Carey WEHI Gordon Smyth JHSPH Rafael Irizarry EBI • Simon AndersElin Axelsson • Ligia Bras • Alessandro Brozzi • Tony Chiang • Audrey Kauffmann • Greg Pau • Oleg Sklyar • Jörn Tödling • David Jitao Zhang • All contributors to R and Bioconductor projects