1 / 26

Statistical Analysis of Microarray Data

Statistical Analysis of Microarray Data. By H. Bjørn Nielsen & Hanne Jarmer. Question Experimental Design. The DNA Array Analysis Pipeline. Array design Probe design. Sample Preparation Hybridization. Buy Chip/Array. Image analysis. Normalization. Expression Index Calculation.

ifama
Download Presentation

Statistical Analysis of Microarray Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Analysis of Microarray Data By H. Bjørn Nielsen & Hanne Jarmer

  2. Question Experimental Design The DNA Array Analysis Pipeline Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression Index Calculation Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

  3. What's the question? Typically we want to identify differentially expressed genes Example: alcohol dehydrogenase is expressed at a higher level when alcohol is added to the media alcohol dehydrogenase without alcohol with alcohol

  4. He’s going to say it However, the measurements contain stochastic noise There is no way around it Statistics

  5. statistics Noisy measurements p-value You can choose to think of statistics as a black box But, you still need to understand how to interpret the results

  6. The output of the statistics P-value The chance of rejecting the null hypothesis by coincidence ---------------------------- For gene expression analysis we can say: the chance that a gene is categorized as differentially expressed by coincidence

  7. The statistics gives us a p-value for each gene We can rank the genes according to the p-value But, we can’t really trust the p-value in a strict statistical way!

  8. Why not! For two reasons: We are rarely fulfilling all the assumptions of the statistical test We have to take multi-testing into account

  9. The t-test Assumptions 1. The observations in the two categories must be independent 2. The observations should be normally distributed 3. The sample size must be ‘large’ (>30 replicates)

  10. Multi-testing? In a typical microarray analysis we test thousands of genes If we use a significance level of 0.05 and we test 1000 genes. We expect 50 genes to be significant by chance 1000 x 0.05 = 50

  11. Bonferroni: P ≤ Confidence level of 99% 0.01 N Benjamini-Hochberg: P ≤ i N 0.01 Correction for multiple testing N = number of genes i = number of accepted genes

  12. But really, those methods are too strict What we can trust is the ability of the statistical test to rank the genes according to their reliability The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff If we permute the samples we can get an estimate of the False Discovery Rate (FDR) in this set

  13. P-value log2 fold change (M) Volcano Plot

  14. What's inside the black box ‘statistics’ t-test or ANOVA

  15. The t-test Calculate T Lookup T in a table

  16. Density Intensity of gene x The t-test II The t-test tests for difference in means () mut mutant wt wt

  17. The t-test III The t statistic is based on the sample mean and variance t

  18. ANOVA ANalysis Of Variance Very similar to the t-test, but can test multiple categories Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2 Advantage: it has more ‘power’ than the t-test

  19. Variance between groups Variance within groups ANOVA II Density Intensity

  20. Blocks and paired tests Some undesired factors may influence the experiments, the effect of such can be greatly reduced if they are blocked out or if the experiment is paired. Some possible blocks: - Dye - Patient - Technician - Batch - Day of experiment - Array

  21. Paired t-test Hypothesis: There is no difference between the mean BLUE and RED Block 1 Block 2 Block 3

  22. A187 CreA Glucose Ethanol Glucose Ethanol 3x An example: (2-way ANOVA with blocks) Experimental Design: Everything was done in batches to capture the systematic noise

  23. Example: Batch to batch variation • Within batch variation is lower than the between batch variation

  24. Two-way ANOVA Effect 1 Ethanol Glucose Effect 2 A187 CreA Batch ABC Example: Data analysis - Blocking • We can capture the batch variation by blocking

  25. A genotype p-value A growth media p-value Two-way ANOVA An interaction p-value Media effect Glucose Ethanol Genotype effect A187 CreA Batch ABC Ex: Result of the 2-way ANOVA(3 p-values)

  26. Conclusion • Array data contains stochastic noise • Therefore statistics is needed to conclude on differential expression • We can’t really trust the p-value • But the statistics can rank genes • The capacity/needs of downstream processes can be used to set cutoff • FDR can be estimated • t-test is used for two category tests • ANOVA is used for multiple categories

More Related