1 / 40

Analysis of Time Course Microarray Experiments

Analysis of Time Course Microarray Experiments. Claudia Angelini Istituto per le Applicazioni del Calcolo c.angelini@iac.cnr.it. Outline. Background From Steady-state to Time-course microarray experiments From Biological to Statistical questions

gilon
Download Presentation

Analysis of Time Course Microarray Experiments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of Time Course Microarray Experiments Claudia Angelini Istituto per le Applicazioni del Calcolo c.angelini@iac.cnr.it

  2. Outline • Background • FromSteady-statetoTime-coursemicroarrayexperiments • FromBiologicaltoStatisticalquestions • Detection and estimationof gene’ expressionprofiles • StatisticalmethodsforTimeCourseMicroarray and their Software • Real Data Application • Otherrelatedproblems • Conclusion EDGE, TimeCourse, BATS

  3. Background on Gene Expression • Each cell contains a complete copy of the organism's genome and cells are of many different types and states • What makes the cells different is the way they synthesize proteins and develop their biological functions. Understanding differences among different types of cells and the way they react to a given stimulus or treatment is the key for understanding functional genomics. • In most of the cases the difference among biological samples is proportional to the “genes’ expression” (i.e., transcription level or abundance, during which DNA is transcribed into mRNA) • Differential gene expression, i.e., when, where, and how much each gene is expressed.

  4. Background on Microarray Microarray experiments are high throughput biological assays for measuring the abundance of DNA or mRNA sequences in different types of cell samples (for several thousands of sequences simultaneously) and hence yield information on “gene expression levels”. Based on the Hybridization

  5. Background on Microarray Gene Expression assays • Spotted cDNA arrays (Brown/Botstein) • Short oligonucleotide arrays (Affymetrix) • Long oligonugleotide arrays (Agilent Inkjet) • Fibre optic arrays (IlluminaBeadChips) • Etc….. In the following we will focus on cDNA Microarray experiments, however most of the statistical approaches apply to other platforms The fluorescent spectrum in a picture of the relative DNA /mRNA abundance in the two samples under a given conditions (or in a specific time)

  6. cDNA- microarray: experiment description for case-control design Note: After some pre-processing is computed for each spot (gene) “Treated Sample” “Control Sample” “statistical level” GREEN represents High Control hybridization RED represents High Treated Sample hybridization YELLOW represents acombination of Control and Treated Samplewhere both hybridized equally.BLACK represents areas where neither the Control nor Samplehybridized.

  7. Static microarray experiments Experiments are replicated (either technical or biological replicates) Statistical problems • Testing • Estimation • Classification • Clustering • Genes’ Network Severalmethodshavebeenproposed and implemented in standard software

  8. More generalexperimentaldesigns • Expression level of genes in a given cell can be influenced by a pathological status, a pharmacological or medical treatment • The response to a given stimulus is usually different for different genes and may depend on the time, in fact thegene expression is often a dynamic process • Time Course microarray can carried out in order to study this dynamics Study dynamic biological processes • Cell Cycle • Response to temperature/enviroments changes or treatments .. • Developmental studies, Immune response.. • Etc (i.e., Age or dose response…) Source: Ernst & Bar –Joseph, 2006 About 30% of microarray experiments are time course

  9. Some comments About 70-80% ofmicroarraytimeseriesexperiments are short: • 5-10 timepoints & veryfewreplicates • Oftensamples are nottaken on regularityspacedgrid • Costofmicroarray • Limitedavailabilityofbiological material • Presenceofmissing data • High levelofnoise in the data (evenafterpreprocessing) Statistical problems related to time series microarray experiments • Automatic detection of periodic (cell cycle) genes • Automatic detection of differentially expressed profiles • Estimation • Clustering (or also Classification) • Genes’ Network Focus of the talk

  10. Testing and Estimation in timecourseexperiments Several experimental designs can be considered, each of them may require tailored statistical methods “one (statistical) sample” “two (statistical) samples” “Simple and preliminary” statistical questions: Given the observations For which index the curves are different from zero? Or from which index the two curves are different each other in the two samples ? And if the curve is different from zero (or with respect to the other sample) how we can estimate the “treatment effect”, i.e., the temporal response of each gene to the treatment, etc?

  11. “Standard ”Software for Microarray Data Analysis MostofAnalyzing Software are designedfor “Static Gene Expression Data” • Do not take advantageof the sequential information in timeseries data • Ofteninvariant under permutationoftimepoints • Do notuse the valueof the timeexplicitely Statistical methods not specifically designed for time course data can lack of statistical power

  12. “Standard ”methods for time series analysis On the otherhand, mostof the methodsfortimeseries data are based on trasformationsuchas Fourier or wavelet or on asympthoticapproaches. Theycannotbeappliedtomicroarrayexperimets due to the limitednumber or observationsavailable . Moreover, when testing significance a “global” answer is required and since thousand of curves are simultaneously compared, some “multiple comparisons control procedures” are mandatory (i.e. to control the false positives) Global answer vs. pointbypointanalysis NEED OF “NEW” STATISTICAL METHODS AND SPECIFIC SOFTWARE

  13. “New” approaches Recently new methods have been proposed in the literature for the identification and the estimation of time-course gene expression profiles from microarray data Among others: • EDGE • TimeCourse • BATS • Other are currently underdevelopment… These methods are based on different assumptions and usually can be applied on different contexts For a comparisonamongdifferentmethods: M. Mutarelli, L. Cicatiello, M. Ravo, O.L. Grober, A. Facchiano, C. Angelini, A. Weisz (2007).Time course whole-genome microarray analysis of estrogen effects on hormone-responsive breast cancer cells. BMC Bioinformatics (to appear)

  14. What is EDGE? EDGE is a user-friendly software for the Extraction and Analysis of Differential Gene Expression. It implements the novel approach proposed in Storey et al (2005) EDGE allows the user to automatically identify and rank differentially expressed genes, but it does notexplicitlyestimate their expression profiles. It also controls FDR

  15. What EDGE is used for? EDGE can be applied to both ‘one sample’ and ‘two sample’ design; both ‘longitudinal’ and independent samples data EDGE implements a functional approach in which gene expression is expanded into a B-spline basis with fixed (common) degree; an F-test similar to the one used into Anova model is used as a test statistics and the p-values are estimated using bootstrap, an FDR type control is applied.

  16. The statistical model in EDGE Number of genes Number of time-points Number of replicates B-spline basis of degree p; The coefficients are estimated from the data Z using “least squares” FunctionalApproach For the “one sample” problem

  17. Testing significance with EDGE Sum of squares under the alternative Sum of squares under the null Statistic The p-values are estimated using bootstrap/permutations Automatic detection is carried out by controlling q-values For the “two sample” problem

  18. What else about EDGE? EDGE was first to propose a truly functional approach and was found to perform well in several application, however • Since p-values are evaluated via bootstrapEDGEcan becomputational intensiveand can suffer of the so called“granularity”problem • EDGEdoes not allow “missing values”,however it implementsKNN-procedurein order to fill them. • EDGE impose that all genes are expanded in a B-spline bases with fixed degree p (the degree can be estimated from the data or chosen by the user) EDGEhas been developed atUniversity of Washington in the Leek JT, Monsen EC, Dabney AR, and Storey JD. (2006) EDGE: Extraction and analysisofdifferential gene expression.Bioinformatics, 22: 507-508. Storey JD, Xiao W, Leek JT, Tompkins RG, and Davis RW. (2005) Significanceanalysisoftimecoursemicroarrayexperiments.Proceedingsof the National AcademyofSciences, 102: 12837-12842.

  19. EDGE web site EDGE can be download at http://www.biostat.washington.edu/software/jstorey/edge//index.php; It require R >2.5 and Bioconductor EDGE is free for academic, non-comercial use

  20. What is TimeCourse? TimeCourseis a new R-package that implements the novel Multivariate Empirical Bayes approach described in Tai and Speed (2006) • TimeCourse allows the user only to rank thegenes’ expression profiles. It does not provide any automatic cut-off for selecting differentially expressed genes, neither it controls multiple comparisons error. • TimeCourse can be applied to both ‘one sample’ and ‘two sample’ design. However in the latter case it is applicable only to data sets with identical time grids. • TimeCourseis particularly designed for purely‘longitudinal’data

  21. The statistical model in TimeCourse Number of genes Number of replicates Values observed at Note: TimeCourse requires the number of replicates to be the same for each time points, they can be different for each gene. Missing values are not allowed For gene i, and replicates/induvidualk Designedforlongitudinalstudies , i.e., the “sameindividual” k isassumedtoberecorded at alltimepoints Note: In the “two” sample design the grid have to be the same for all samples and all replicates

  22. Multivariate Bayesian model in TimeCourse Note the Bayesian approach is elicited in the time-domain!! The “value” of the time points does not enter in the model Gene “non affected” by the treatment Gene “affected” by the treatment All hyper-parameters can be estimated from the data

  23. Testing significance with TimeCourse For the “one sample” problem Model + observed data Posterior distribution analytically known Posterior Distribution Prior Information TimeCourseranks the gene’ expression profiles using T2-Hotellingstatistics if all the genes have the same number of replicates or theMB-statisticsif genes have different number of replicates. Explicit form of both Hotelling and MB-statistics are analytically available in Tai and Speed 2006 (not showed here for brevity) For the “two sample” problem

  24. What else about TimeCourse? TimeCourseonly ranks the genes’ expression profiles, not automatically select the ones “differentially expressed”. All the parameters of the method can be estimated from the data of chosen by the user. • The ‘two sample’ design is applicable only to data sets with identical time grids. • No missing data are allowed (preliminary procedure for filling the missing values or filtering the data are necessary) • Since the Multivariate Bayesian approach is elicited in the physical-domain, the time variable does not enter explicitly in the model TimeCoursehas been developed at the SpeedBerkeleyResearch Group Tai, Y.C. and Speed, T.P. (2006) A multivariate empirical Bayes statistic for replicated microarray time course data. Annals of Statistics, 34, 2387-2412 Tai Y.C. and Speed T.P. (2007) On the gene ranking of replicated microarray time course data, Tech. rep 735 BerkeleyUniversity

  25. What is BATS ? BATS is a user-friendly software for Bayesian Analysis of Time Series microarray experiments based on the novel functional Bayesian approach proposed in Angelini et al (2007). BATS allows the user to automatically identify and rank differentially expressed genes, to control multiple comparisons error and to estimate their expression profiles. BATS manages successfully various technical difficulties which arise in microarray time-course experiments such as a small number of observations available, non-uniform sampling intervals, presence of missingormultiple data as well as temporal dependence between observations for each gene. BATS is suited for the “one sample” statistical design “two sample” statistical design (without any grid restriction) is under implementation.

  26. The statistical model in BATS …… Number of genes Number of time-points Number of replicates Gene “true” functional profile Where we assume The gene expression time profile is a smooth curve expanded in an orthogonal system on [0,T] Noise. i.i.d. For the “one sample” problem

  27. Bayesian model in BATS We assume that genes are conditionally independent And we place a prior on unknown parameters i.e., Poisson truncated at Gene “affected” by the treatment Gene’s specific variance Gene “non affected” by the treatment Prior probability of not being affected by the treatment (to be estimated from the data)

  28. Noise model in BATS We distinguish 3 models Model 1) Model 2) Model 3) i.e., the marginal distribution of the noise is Gaussian i.e., the marginal distribution of the noise Student T i.e., the marginal distribution of the noise Double-exponential It is possible to model ”non gaussian noises” Model + observed data Posterior Distribution Prior Information In cases 1)-3) analitically known

  29. Testing significance with BATS i=1,…,N In general, given the posterior distribution, testing can be carried out looking at the posterior probability of being significant or equivalently at the Bayes Factor For the models under consideration the Bayes Factor (BF) can be analytically evaluated Multiple comparisons control with BATS In order to account for multiple comparisons BATS implements the Bayesian Multiple Testing Procedureby Abramovich & Angelini (2006). The procedure is based on“orderd Bayes Factors”and is similar in the spirit to Benjamini and Hochberg FDR control. Estimating treatment’s effect with BATS

  30. BATS website BATS Version 1.0 freely downloadable at http://www.na.iac.cnr.it/~bats/

  31. BATS – Main windows Several additional tools for filtering data, display profiles and comparing results “About” & “Help” buttons are available for each section For running a simulation study For analyzing a given dataset

  32. What about BATS BATS can carry out analysis with both simulated and real experimental data BATS is written in MATLAB, executable files for Windows, Linux and Macintosh are now available. BATS is currently implemented for single processor, however future release of the software will include also version for workstation of multi-processors BATS has been developed atIAC-CNRand itis a part of theCNR-Bioinformatics Interdepartmental Project C. Angelini, D. De Canditiis, M. Mutarelli, M. Pensky, (2007).Bayesian Approach to Estimate and Testing in Time Course Microarray Experiments. StatisticalApplications in Genetics and MolecularBiology: vol6 : Iss. 1, Article 24. C. Angelini,L. Cutillo, D. De Canditiis, M. Mutarelli, M. Pensky (2007).BATS: A Bayesian User-Friendly software for analyzing time series microarray data. (Technical report IAC CNR 331/07) C. Angelini, D. De Canditiis, M. Pensky, (2008).Bayesian models for the two-sample time-course microarray experiments, (Technical report IAC CNR in preparation) M. Mutarelli, L. Cicatiello, M. Ravo, O.L. Grober, A. Facchiano, C. Angelini, A. Weisz (2007).Time course whole-genome microarray analysis of estrogen effects on hormone-responsive breast cancer cells. BMC Bioinformatics (to appear)

  33. Time course experiments: a Case Study Aim of the experiment is to identify estrogen responsive genes in a human breast cancer cell line. Control-sample:ZR-75.1 (human breast cancer cell) Treated-sample:ZR-75.1 cells stimulated with a mitogenic dose of 17ß-estradiol Treated samples were taken at time Control samples were always taken at time t=0 For each time point the experiment was replicated times Biological questions: • Which genes are activated or repressed due to the treatment? • And if a gene is affected by the treatment, what is the treatment effects?

  34. Cicatiello et al (2004) dataset description In the experiment, ZR-75.1 cells were stimulated with a mitogenic dose of 17ß-estradiol, after 5 days of starvation on an hormone-free medium, and samples were taken after t =1,2,4,6,8,12,16,20,24,28,32 hours,(Non regular grid)with a total of 11 time points covering the completion of a full mitotic cycle in hormone-stimulated cells. For each time point at least 2 replicates were available (3 replicates at t = 2,8,16 hours). After suitable filtering and preprocessing Yang et al.(2002) and Cui et al. (2002) N=8161genes were analyzed by our method in order to detect estrogen response genes (Note that more about 350 genes were presenting at least a missing value) The normalized dataset is contained as an example for a guided analysis in BATS Cicatiello, L., Scarfoglio, C., Altucci, L., Cancemi, M., Natoli, G., Facchiano, A., Iazzetti G., Calogero, R., Biglia, N., De Bortoli, M., Sfiligol, C., Sismondi, P., Bresciani, F. and Weisz, A., (2004). A genomic view of estrogen actions in human breast cancer cells by expression profiling of the hormone-responsive trascriptome. Journal of Molecular Endocrinology, 32, 719--775.

  35. Results using BATS BATS is very robust with respect to the list of genes detected as significant: 574 genes where common to all 28 lists; while 958 genes were selected by at least one combination of methods/parameters Comparing with the 344 genes selected by hand in Cicatiello et al. (2004) the list of 574 common genes includes 270 genes; among the remaining 74 genes, 16 were filtered out in our analysis due to a more stringent selection of quality before processing the data. On the other hand 309 out of 344 were selected by at least one combination. Interestingly, 17 out of 304 newly selected genes were replicate spots of some genes already selected in the Cicatiello et al. (2004) and most of the remaining are known to be involved in biological processes related to estrogen response, such as cell cycle and cell proliferation (AREG, NOLC1, cyclin D1), DNA replication (MCM7, RFC5), mRNA processing (SFRS1) and lipid metabolism (APOD and LDHA).

  36. Comparisons with EDGE and TimeCourse We compare results ofBATSanalysis with a newly available alternative user friendly software EDGE(Bioinformatics, 2006) and with the R package Timecourse(Speed & Tai, Annals of Statistics, 2006). • On real data BATS shows a much wider overlap with “biologists inspired selection” than EDGE and R-timecourse package Similar results are also confirmed by simulations using FDR,FNR etc as “goodness” measure

  37. Otherrelatedproblems : Clustering The identification of genes which are responsive with respect to a given treatment is often a preliminary step for answering other questions of interest Biological questions: • Which genes show similar response to the treatment? Severalmethodshavebeenproposedforgenes’ clustering, howeververyfewofthem are designedfortimecoursemicroarray data As a consequence most of the information contained in the data cannot be properly used and the results are often not stable Heard, N. A., Holmes, C. C., Stephens, D. A., Hand, D. J. and Dimopoulos, G. (2005) Bayesian Co-clustering of Anopheles Gene Expression Time Series: A Study of Immune Defense Response To Multiple Experimental Challenges. Proceedings of the National Academy of Science USA, 102, 47, 16939-16944 Heard, N.A., Holmes, C.C., and Stephens, D. A. (2006). A quantitative studyof gene regulationinvolved in the Immune responseofAnophelineMosquitoes: An applicationofBayesianhierarchicalclusteringofcurves. J. Amer. Statist. Assoc.,101, 18--29.

  38. Clustering • SplineClust • “functional” Bayesian approach with the possibility of estimating the number of clusters from the posterior distributions • but • No missing data are allowed • Only one observation per time-point is allowed • Same degree for all functions • Computationally fast, but with a price: it uses hierarchical clustering (often not optimal) • No “goodness” measure available There is still a shortage of specifically designed methods and of a careful analysis of their performance

  39. Conclusion Time-course microarray experiments are becoming extremely popular as tool for investigating the gene expression dynamics, however they provide new challenges to statisticians and computer scientists which have to develop specifically designed tools for handling and analyzing them. We have presented and compare several currently available methods and related software for analyzing time course microarray data with particular focus on the problem of the automatic identification and estimation of gene expression profiles. There is still space for improving the available methods and several projects are now devoted to this topic (since specific questions require specific tools) Surfing the web

  40. Thanks the Organizers and all the participants for the attention For any information contact me: c.angelini@iac.cnr.it

More Related