1 / 37

Preprocessing and normalization of microarray data, cell-based assays, mass spectrometry

Preprocessing and normalization of microarray data, cell-based assays, mass spectrometry Uni- and multivariate statistical analysis methods Machine Learning Harvesting and using metadata from biological databases Visualization Graphs and Networks in molecular biology

idana
Download Presentation

Preprocessing and normalization of microarray data, cell-based assays, mass spectrometry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preprocessing and normalization of microarray data, cell-based assays, mass spectrometry Uni- and multivariate statistical analysis methods Machine Learning Harvesting and using metadata from biological databases Visualization Graphs and Networks in molecular biology Including the software code (R) to reproduce all examples, figures etc.

  2. Goals of statistical analysis To get the most out of your experiment - optimal sensitivity, specificity To present your data in easily accessible, intuitive, visually attractive ways - beyond long boring tables of numbers or messy Excel bargraphs To associate your own data with all relevant existing data in public databases or from other authors - this can substantially add value!

  3. Bioconductor an open source and open development software project for the analysis of biomedical and genomic data. Started in the fall of 2001 by Robert Gentleman (then at Harvard) and now includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. Initial focus on microarrays, now also: proteomics, cell-based assays, bioinformatic metadata, graph-theoretic methods

  4. Bioconductor Strict 6-monthly release cycle, starting with about 15 packages 1.0 in March 2003, now at 1.7 with ca. 140 packages Each release has several 1000 downloads Aggressive development - state of the art algorithms, backward compatibility is desired but sometimes not possible Packages vary in their maturity: somewhere between "software textbook" and "software journal"

  5. Acknowledgments Ben Bolstad, UC Berkeley Vince Carey, Biostatistics, Harvard Sandrine Dudoit, Biostatistics, UC Berkeley Seth Falcon, Fred Hutchinson Cancer Res Ctre, Seattle Robert Gentleman, FHCRC Jeff Gentry, Dana-Farber Cancer Institute Florian Hahne, DKFZ Rafael Irizarry, Biostatistics, Johns Hopkins Li Long, Swiss Institute of Bioinformatics, Switzerland. James MacDonald, University of Michigan, USA Martin Maechler, ETH Zürich, CH Denise Scholtens, U Chicago Gordon Smyth, WEHI … and many others

  6. Goals of Bioconductor Provide access to powerful statistical and graphical methods for the analysis of genomic data Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed, Ensembl) in the analysis of experimental data Allow the rapid development of extensible, interoperable, and scalable software Provide high-quality documentation Promote reproducible research Provide training in computational and statistical methods

  7. Tools in bioconductor Main platform: R But we also use many other tools: • graphviz • Boost Graph Library (BGL) • libxml • mySQL • Biomart/Ensembl • imageMagick • C/C++, Perl, Java • MAGEstk • tcl/tk, Gtk Philosophy: don’t reinvent the wheel

  8. Component software Most interesting problems will require the coordinated application of many different techniques. Thus we need integrated interoperable software. Don’t think your method is the end of it all. Design your piece to be a cog in a big machine.  Software modules with standardized I/O instead of stand-alone applications Web service instead of web site

  9. Why are we Open Source so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components Transparency Pursuit of reproducibilty Efficiency of development Training

  10. Good scientific software is like a good scientific publication oReproducibility oPeer-review oEasy accessibility by other researchers, society o Build on the work of others o Others will build their work on top of it o Commercialize those developments that are successful and have a market

  11. Commercial profit ports We coordinate with Insightful to help provide ArrayAnalyzer (which contains many Bioconductor packages and resources) Bioconductor is also interfaced by Genespring, Spotfire, ExpressionNTI Our software remains free, but some users are willing to pay money for professional support (hotlines, handbooks, stricter enforcement of uniform user interfaces,…) Win-win arrangement

  12. Overview of the Bioconductor Project

  13. Bioconductor packagesRelease 1.7, Oct 2005 Ca. 140 Packages • General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings • Annotation: annotate, biomaRt, AnnBuilder & data packages • Graphics: geneplotter, hexbin • Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma • Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality • Differential gene expression: genefilter, limma, multtest, siggenes, EBArrays, factDesign • Graphs and networks: graph, RBGL, Rgraphviz • Other data: SAGElyzer, DNAcopy, PROcess, aCGH, prada

  14. affy package • Pre-processing oligonucleotide chip data: • diagnostic plots, • background correction, • probe-level normalization, • computation of expression measures. plotAffyRNADeg barplot.ProbeSet R. Irizarry, B. Bolstad, L. Gautier, C. Miller image plotDensity

  15. affycomp Cope, Irizarry et al. Bioinformatics (2004)

  16. tilingArray package Collaboration with Lars Steinmetz EMBL-HD

  17. LIMMA Package:Linear Models for Microarray Data Analysis of differential expression studies  Complex designed experiments: linear models, contrasts  empirical Bayes methods for differential expression: t-tests, F-tests, posterior odds  inference methods for duplicate spots, technical replication  analyse log-ratios or log-intensities  spot quality weights  control of FDR across genes and contrasts  stemmed heat diagrams, Venn diagrams  pre-processing: background correction, normalization Gordon Smyth

  18. limma GUI

  19. Classification, class prediction, machine learning: Predict outcome on basis of past observations of some explanatory variables (features) Outcome: E.g. tumor class, type of bacterial infection, response to treatment, survival Features: gene expression measures, covariates such as age, sex

  20. R class prediction packages class: k-nearest neighbor (knn), learning vector quantization (lvq) classPP: projection pursuit e1071: support vector machine (svm) MASS: linear and quadratic discriminant analysis (lda, qda) sma: diagonal linear and quadratic discriminant analysis, naïve Bayes nnet: feed-forward neural networks and multinomial log-linear models rpart: classification and regression trees knnTree: k-nn classification with variable selection inside leaves of a tree randomForest: random forests LogitBoost: boosting for tree stumps ipred: bagging, resampling based estimation of prediction error mlbench: machine learning benchmark problems. pamR: prediction analysis for microarrays gpls: generalized partial least squares

  21. graph, RBGL, Rgraphviz graph basic class definitions and functionality RBGL graph algorithms (e.g. shortest path, connectivity) Rgraphviz rendering. Different layout algorithms. Seamlessly combinable with R graphics.

  22. RBGL: interface to the Boost Graph Library Connected components cc = connComp(rg) table(listLen(cc)) 1 2 3 4 15 18 36 7 3 21 1 Choose the largest component wh = which.max(listLen(cc)) sg = subGraph(cc[[wh]], rg) Depth first search dfsres = dfs(sg, node = "N14") nodes(sg)[dfsres$discovered] [1] "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53" [9] "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35" [17] "N48" "N09" rg

  23.  Algorithms Shortest paths (edge weights: 1, positive, real) Connectivity (strong, weak) Graph traversal Minimal spanning tree

  24. Rgraphviz: the different layout engines

  25. domain combination graph Apic, Huber, Teichmann, J. Struct. Fct. Genomics (2003)

  26. Reproducible Research and Compendia There is a tendency to accept seemingly realistic computational results, as presented by figures and tables, without any proofof correctness. F. Leisch, T. Rossini, Chance 16 (2003) We re-analyzed the breast cancer data from van‘t Veer et al. (2002). ... Even with some helpof the authors, we were unable to exactly re- produce this analysis. R. Tibshirani, B. Efron, SAGMB (2002)

  27. Re-analysis of a breast cancer outcome study E. Huang et al., Gene expression predictors of breast cancer outcome, The Lancet 361 (9369): 1590-6 (2003) 89 primary breast tumors on Affymetrix Chips (HG U95av2) , among them: 52 with 1-3 positive lymph nodes, 18 led to recurrence within 3 years, 34 did not. Goal: predict recurrence Claim: 5 misclassification errors, 1 unclear (leave-one-out cross-validation) Method: Bayesian binary prediction trees (at the time, unpublished) http://www.cagp.duke.edu

  28. …we tried to reproduce these results, starting from the published marray raw data (CEL files) But couldn't. The paper (and supplements) didn't contain the necessary details to re-implement their algorithm. Authors didn't provide comparisons to simple well-known methods. In our hands, all other methods resulted in worse misclassification results. Is their new Bayesian tree method miles better than everything else? Or was their analysis over-optimistic? (over-fitting, selection bias)

  29. A general pattern New publications often present a new microarray data set, and a new classification method. Merits of the methods, and merits of the data are entangled. Is it necessary to develop an ideosyncratic method? Which result could be achieved with standard approaches? (accuracy vs. interpretability) Is there a big difference and what are the reasons for it ? (errors happen … in implementation /validation)

  30. Compendia Interactive documents that contain: • Primary data • Processing methods (computer code) • Derived data, figures, tables and other output • Text: research report (result, materials and methods, conclusions) Package compHuang: reanalysis of Huang et al. data, using different classification and preprocessing methods and a correct cross-validation procedure for estimating the prediction error Based on R/Bioconductor's package and vignette technologies M. Ruschhaupt, W. Huber, A. Poustka, U. Mansmann, Statistical Applications in Genetics and Molecular Biology (2004)

  31. processed document (here: PDF) source markup (here: latex & R) Sweave <<MCRestimate call,eval=FALSE,echo=TRUE>>=r.forest <- MCRestimate(eset,class.label, class.function="RF.wrap", select.fun=red.fct,cross.outer=10, cross.inner=5,cross.repeat=20)@ <<rf.save,echo=FALSE,results=hide>>=savepdf(plot(r.forest, main="Random Forest"),"image-RF.pdf")@ <<result>>=r.forest@ The final document includes results of the calculation, graphicaloutputs, tables, and optionally parts of the R-Code which has beenused. Also the description of theexperiment,the interpretation of theresults, and the conclusion can be integrated. In this example we applied our compendium toT. Golubs ALL/AMLdata~\cite{Golub.1999}. \begin{figure}[h]\begin{center}\includegraphics[width=0.4\textwidth]{image-RF}\end{center}\end{figure}\smallskip <<summary,echo=FALSE>>=method.list <- list(r.forest,r.pam,r.logReg,r.svm)name.list<- c("RF","PAM","PLR","SVM")conf.table <-MCRconfusion(method.list, col.names=name.list)@ <<writinglatex1,echo=FALSE, results=tex>>=xtable(conf.table,"Overall number ofmisclassifications",label="conf.table",display=rep("d",6))@ %\input{samples.1}%\input{conf.table} \begin{thebibliography}{1}\bibitem[Golub et al. ,1999]{Golub.1999} Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES.\newblock Molecular classification of cancer: class discovery and class prediction by gene expression monitoring\newblock\textit{Science} 286(5439): 531-7 (1999). \end{thebibliography}

  32. Structure of a compendium Package directory General info: author, version, … software documentation source markup additional software code data directory function definition manual page data files function definition manual page data files function definition manual page . . . data files . . . . . .

  33. Courses NGFN: 4 x year Berlin, München, Saarbrücken, Heidelberg EMBO-course: June 2006 at EBI, Cambridge UK (A. Brazma, W. Huber, J. Quackenbush) Brixen/Bressano (Tyrol): June 2006 (R. Irizarry, W. Huber, R. Gentleman)

  34. Courses

  35. Courses EBI, Hinxton / Cambridge 26-30 June 2006 A Brazma, M Kapushevsky, J Quackenbush, A Culhane, W Huber Array Express / Expression Profiler Bioconductor MEV

  36. Courses Brixen / Bressanone 19-23 June 2006 W Huber, R Irizarry, R Gentleman Day 1: Introduction to R and Bioconductor; biology for nonbiologists; microarray and other high throughput technologies. Day 2: AffyMetrix and cDNA experiments; data preprocessing and normalization; quality assessment; model based analysis. Day 3: machine learning approach: supervised and unsupervised; data visualization. Day 4: annotation, KEGG, GO, bind, meta analysis: combining microarray experiments. Day 5: graphs, networks, biochemical pathways, proteomics.

  37. I don't need statistics

More Related