370 likes | 663 Views
Preprocessing and normalization of microarray data, cell-based assays, mass spectrometry Uni- and multivariate statistical analysis methods Machine Learning Harvesting and using metadata from biological databases Visualization Graphs and Networks in molecular biology
E N D
Preprocessing and normalization of microarray data, cell-based assays, mass spectrometry Uni- and multivariate statistical analysis methods Machine Learning Harvesting and using metadata from biological databases Visualization Graphs and Networks in molecular biology Including the software code (R) to reproduce all examples, figures etc.
Goals of statistical analysis To get the most out of your experiment - optimal sensitivity, specificity To present your data in easily accessible, intuitive, visually attractive ways - beyond long boring tables of numbers or messy Excel bargraphs To associate your own data with all relevant existing data in public databases or from other authors - this can substantially add value!
Bioconductor an open source and open development software project for the analysis of biomedical and genomic data. Started in the fall of 2001 by Robert Gentleman (then at Harvard) and now includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. Initial focus on microarrays, now also: proteomics, cell-based assays, bioinformatic metadata, graph-theoretic methods
Bioconductor Strict 6-monthly release cycle, starting with about 15 packages 1.0 in March 2003, now at 1.7 with ca. 140 packages Each release has several 1000 downloads Aggressive development - state of the art algorithms, backward compatibility is desired but sometimes not possible Packages vary in their maturity: somewhere between "software textbook" and "software journal"
Acknowledgments Ben Bolstad, UC Berkeley Vince Carey, Biostatistics, Harvard Sandrine Dudoit, Biostatistics, UC Berkeley Seth Falcon, Fred Hutchinson Cancer Res Ctre, Seattle Robert Gentleman, FHCRC Jeff Gentry, Dana-Farber Cancer Institute Florian Hahne, DKFZ Rafael Irizarry, Biostatistics, Johns Hopkins Li Long, Swiss Institute of Bioinformatics, Switzerland. James MacDonald, University of Michigan, USA Martin Maechler, ETH Zürich, CH Denise Scholtens, U Chicago Gordon Smyth, WEHI … and many others
Goals of Bioconductor Provide access to powerful statistical and graphical methods for the analysis of genomic data Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed, Ensembl) in the analysis of experimental data Allow the rapid development of extensible, interoperable, and scalable software Provide high-quality documentation Promote reproducible research Provide training in computational and statistical methods
Tools in bioconductor Main platform: R But we also use many other tools: • graphviz • Boost Graph Library (BGL) • libxml • mySQL • Biomart/Ensembl • imageMagick • C/C++, Perl, Java • MAGEstk • tcl/tk, Gtk Philosophy: don’t reinvent the wheel
Component software Most interesting problems will require the coordinated application of many different techniques. Thus we need integrated interoperable software. Don’t think your method is the end of it all. Design your piece to be a cog in a big machine. Software modules with standardized I/O instead of stand-alone applications Web service instead of web site
Why are we Open Source so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components Transparency Pursuit of reproducibilty Efficiency of development Training
Good scientific software is like a good scientific publication oReproducibility oPeer-review oEasy accessibility by other researchers, society o Build on the work of others o Others will build their work on top of it o Commercialize those developments that are successful and have a market
Commercial profit ports We coordinate with Insightful to help provide ArrayAnalyzer (which contains many Bioconductor packages and resources) Bioconductor is also interfaced by Genespring, Spotfire, ExpressionNTI Our software remains free, but some users are willing to pay money for professional support (hotlines, handbooks, stricter enforcement of uniform user interfaces,…) Win-win arrangement
Bioconductor packagesRelease 1.7, Oct 2005 Ca. 140 Packages • General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings • Annotation: annotate, biomaRt, AnnBuilder & data packages • Graphics: geneplotter, hexbin • Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma • Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality • Differential gene expression: genefilter, limma, multtest, siggenes, EBArrays, factDesign • Graphs and networks: graph, RBGL, Rgraphviz • Other data: SAGElyzer, DNAcopy, PROcess, aCGH, prada
affy package • Pre-processing oligonucleotide chip data: • diagnostic plots, • background correction, • probe-level normalization, • computation of expression measures. plotAffyRNADeg barplot.ProbeSet R. Irizarry, B. Bolstad, L. Gautier, C. Miller image plotDensity
affycomp Cope, Irizarry et al. Bioinformatics (2004)
tilingArray package Collaboration with Lars Steinmetz EMBL-HD
LIMMA Package:Linear Models for Microarray Data Analysis of differential expression studies Complex designed experiments: linear models, contrasts empirical Bayes methods for differential expression: t-tests, F-tests, posterior odds inference methods for duplicate spots, technical replication analyse log-ratios or log-intensities spot quality weights control of FDR across genes and contrasts stemmed heat diagrams, Venn diagrams pre-processing: background correction, normalization Gordon Smyth
Classification, class prediction, machine learning: Predict outcome on basis of past observations of some explanatory variables (features) Outcome: E.g. tumor class, type of bacterial infection, response to treatment, survival Features: gene expression measures, covariates such as age, sex
R class prediction packages class: k-nearest neighbor (knn), learning vector quantization (lvq) classPP: projection pursuit e1071: support vector machine (svm) MASS: linear and quadratic discriminant analysis (lda, qda) sma: diagonal linear and quadratic discriminant analysis, naïve Bayes nnet: feed-forward neural networks and multinomial log-linear models rpart: classification and regression trees knnTree: k-nn classification with variable selection inside leaves of a tree randomForest: random forests LogitBoost: boosting for tree stumps ipred: bagging, resampling based estimation of prediction error mlbench: machine learning benchmark problems. pamR: prediction analysis for microarrays gpls: generalized partial least squares
graph, RBGL, Rgraphviz graph basic class definitions and functionality RBGL graph algorithms (e.g. shortest path, connectivity) Rgraphviz rendering. Different layout algorithms. Seamlessly combinable with R graphics.
RBGL: interface to the Boost Graph Library Connected components cc = connComp(rg) table(listLen(cc)) 1 2 3 4 15 18 36 7 3 21 1 Choose the largest component wh = which.max(listLen(cc)) sg = subGraph(cc[[wh]], rg) Depth first search dfsres = dfs(sg, node = "N14") nodes(sg)[dfsres$discovered] [1] "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53" [9] "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35" [17] "N48" "N09" rg
Algorithms Shortest paths (edge weights: 1, positive, real) Connectivity (strong, weak) Graph traversal Minimal spanning tree
domain combination graph Apic, Huber, Teichmann, J. Struct. Fct. Genomics (2003)
Reproducible Research and Compendia There is a tendency to accept seemingly realistic computational results, as presented by figures and tables, without any proofof correctness. F. Leisch, T. Rossini, Chance 16 (2003) We re-analyzed the breast cancer data from van‘t Veer et al. (2002). ... Even with some helpof the authors, we were unable to exactly re- produce this analysis. R. Tibshirani, B. Efron, SAGMB (2002)
Re-analysis of a breast cancer outcome study E. Huang et al., Gene expression predictors of breast cancer outcome, The Lancet 361 (9369): 1590-6 (2003) 89 primary breast tumors on Affymetrix Chips (HG U95av2) , among them: 52 with 1-3 positive lymph nodes, 18 led to recurrence within 3 years, 34 did not. Goal: predict recurrence Claim: 5 misclassification errors, 1 unclear (leave-one-out cross-validation) Method: Bayesian binary prediction trees (at the time, unpublished) http://www.cagp.duke.edu
…we tried to reproduce these results, starting from the published marray raw data (CEL files) But couldn't. The paper (and supplements) didn't contain the necessary details to re-implement their algorithm. Authors didn't provide comparisons to simple well-known methods. In our hands, all other methods resulted in worse misclassification results. Is their new Bayesian tree method miles better than everything else? Or was their analysis over-optimistic? (over-fitting, selection bias)
A general pattern New publications often present a new microarray data set, and a new classification method. Merits of the methods, and merits of the data are entangled. Is it necessary to develop an ideosyncratic method? Which result could be achieved with standard approaches? (accuracy vs. interpretability) Is there a big difference and what are the reasons for it ? (errors happen … in implementation /validation)
Compendia Interactive documents that contain: • Primary data • Processing methods (computer code) • Derived data, figures, tables and other output • Text: research report (result, materials and methods, conclusions) Package compHuang: reanalysis of Huang et al. data, using different classification and preprocessing methods and a correct cross-validation procedure for estimating the prediction error Based on R/Bioconductor's package and vignette technologies M. Ruschhaupt, W. Huber, A. Poustka, U. Mansmann, Statistical Applications in Genetics and Molecular Biology (2004)
processed document (here: PDF) source markup (here: latex & R) Sweave <<MCRestimate call,eval=FALSE,echo=TRUE>>=r.forest <- MCRestimate(eset,class.label, class.function="RF.wrap", select.fun=red.fct,cross.outer=10, cross.inner=5,cross.repeat=20)@ <<rf.save,echo=FALSE,results=hide>>=savepdf(plot(r.forest, main="Random Forest"),"image-RF.pdf")@ <<result>>=r.forest@ The final document includes results of the calculation, graphicaloutputs, tables, and optionally parts of the R-Code which has beenused. Also the description of theexperiment,the interpretation of theresults, and the conclusion can be integrated. In this example we applied our compendium toT. Golubs ALL/AMLdata~\cite{Golub.1999}. \begin{figure}[h]\begin{center}\includegraphics[width=0.4\textwidth]{image-RF}\end{center}\end{figure}\smallskip <<summary,echo=FALSE>>=method.list <- list(r.forest,r.pam,r.logReg,r.svm)name.list<- c("RF","PAM","PLR","SVM")conf.table <-MCRconfusion(method.list, col.names=name.list)@ <<writinglatex1,echo=FALSE, results=tex>>=xtable(conf.table,"Overall number ofmisclassifications",label="conf.table",display=rep("d",6))@ %\input{samples.1}%\input{conf.table} \begin{thebibliography}{1}\bibitem[Golub et al. ,1999]{Golub.1999} Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES.\newblock Molecular classification of cancer: class discovery and class prediction by gene expression monitoring\newblock\textit{Science} 286(5439): 531-7 (1999). \end{thebibliography}
Structure of a compendium Package directory General info: author, version, … software documentation source markup additional software code data directory function definition manual page data files function definition manual page data files function definition manual page . . . data files . . . . . .
Courses NGFN: 4 x year Berlin, München, Saarbrücken, Heidelberg EMBO-course: June 2006 at EBI, Cambridge UK (A. Brazma, W. Huber, J. Quackenbush) Brixen/Bressano (Tyrol): June 2006 (R. Irizarry, W. Huber, R. Gentleman)
Courses EBI, Hinxton / Cambridge 26-30 June 2006 A Brazma, M Kapushevsky, J Quackenbush, A Culhane, W Huber Array Express / Expression Profiler Bioconductor MEV
Courses Brixen / Bressanone 19-23 June 2006 W Huber, R Irizarry, R Gentleman Day 1: Introduction to R and Bioconductor; biology for nonbiologists; microarray and other high throughput technologies. Day 2: AffyMetrix and cDNA experiments; data preprocessing and normalization; quality assessment; model based analysis. Day 3: machine learning approach: supervised and unsupervised; data visualization. Day 4: annotation, KEGG, GO, bind, meta analysis: combining microarray experiments. Day 5: graphs, networks, biochemical pathways, proteomics.