1 / 36

Statistics for Differential Expression

Statistics for Differential Expression. Naomi Altman Oct. 06. Some things to consider before we start. Model Replication Correlation / Independence Treatments (conditions, varieties ...). Some things to consider before we start. Model

shawn
Download Presentation

Statistics for Differential Expression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics for Differential Expression Naomi Altman Oct. 06

  2. Some things to consider before we start Model Replication Correlation / Independence Treatments (conditions, varieties ...)

  3. Some things to consider before we start Model Using a statistical model sheds light on the analysis by quantifying features such as condition effects, sources of biological and experimental variation, etc. Models can be written down before the data are collected, which clarify how the data should be collected and analyzed. When an estimate of variability is available, the model can be used to determine appropriate sample size. Replication Correlation / Independence Treatments

  4. Some things to consider before we start Model Replication Statistical methods compare the condition means to the variation within condition. The within condition variation can only be estimated by replication of the condition. Often technical replication (multiple probes in a probeset or multiple hybridizations of the same sample) are treated as if it has biological meaning, but this is not true replication. Correlation / Independence Treatments

  5. Some things to consider before we start Model Replication Correlation / Independence Observations are correlated because: they are taken on the same individual they are measured on the same array they are processed in the same replicate Most simple analysis methods assume independence and hence must be modified to handle correlated data. Treatments

  6. Some things to consider before we start Model Replication Correlation / Independence Treatments: what is interesting? what is the "action"? how many can we really handle

  7. 2 treatments We have already considered the simple case of 2 treatments using t-tests (or permutation, bootstrap or Wilcoxon versions of the tests) Which tests do we use and when are they appropriate?

  8. Tests for 2 treatments Two-sample "t-tests" (and similar tests) require independent samples within and between the 2 treatments i.e. • all RNA samples are biologically independent • Each sample is hybridized to a different array single channel arrays such as Affy, Nimblegen, CodeLink 2 channel arrays with a reference sample in the same channel on each array (use M as the data)

  9. Tests for 2 treatments The paired "t-test" (and similar tests) 1. Each array includes both treatments. 2. Different arrays come from different biological samples. 3. There is no dye effect or technical dye-swaps have been done and the technical replicates have been averaged.

  10. Tests for 3 or more treatments with independent samples Requires independent samples. (We cannot extended the paired sample idea, because we do not have 3 or more channels on the array.) H0: all the population means are equal HA: At least one of the means differs

  11. Tests for 3 or more treatments with independent samples examples: Cancers: several cancer types with 1 sample per patient, several patients with each cancer Genotypes: several genotypes of mice with 1 sample per mouse, several mice per genotype Drug: different doses applied to different individuals with 1 sample per individual, several individuals per dose

  12. Tests for 3 or more treatments with independent samples The t-test assumes that the spreads are all approximately equal and that the populations are approximately normally distributed. The other versions of the test do not require normality. The test statistic is the ratio of the variance among the sample means to the variance of each sample

  13. Tests for 3 or more treatments with independent samples If there are T treatments, with ni observations from the ith treatment. N=n1+ ... + nT F*=MStr/MSE has an F-distribution when the null is true. One-Way ANOVA

  14. One-way ANOVA summary(aov(iris$Sepal.Length~iris$Species)) Df Sum Sq Mean Sq F value Pr(>F) iris$Species 2 63.212 31.606 119.26 < 2.2e-16 *** Residuals 147 38.956 0.265 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Permution, bootstrap and rank tests (Kruskal-Wallace test) are readily extended to this situation

  15. More complex situations Many microarray experiments do not fall into this simple situation due to correlation in the data due to: biological correlation (same cell-line, individual ...) using 2-channel microarrays having multiple probes for the same gene Also, we may have multifactor studies: e.g. 2 genotypes, control and exposed, time course For this we use Linear Mixed Models

  16. Linear Models It is useful to consider a model for the observed data (on a single probe or probeset): Y=log2(intensity) = m+ a + b + g + ... + error mis the mean over all the conditions and arrays error is the random error that is a mixture of measurement error and biological variability the other terms are systematic deviations from the mean, due to the treatments, array effects, lab effects, etc.

  17. Linear Models e.g. Comparison of liver and kidney tissue in male and female mice on 2-channel arrays with 3 replicate spots per gene 5 males and 5 females Y is the log2(intensity) in one channel for one spot. We need to remember that dye might have an effect.

  18. Linear Models Fixed effects are the conditions of interest in the experiment: Random effects are conditions which explain some of the noise in the model:

  19. How does the model help us? Generally, differential expression analysis is looking for differences between treatments that are larger than expected by chance. The model helps us to understand the meaning of "by chance". The model also allows us to design our experiment to minimize the probability of chance observation of large differences.

  20. How Does the Model Help Us? What is larger than expected by chance? difference between male and female in liver difference between liver and kidney in males Suppose the arrays are: 5 arrays - male and female liver 5 arrays - male and female kidney Suppose the arrays are: 5 arrays - male liver and kidney 5 arrays - female liver and kidney

  21. The simplest model 2 treatments on 2 channel arrays with independent biological samples, no dye effect and no dye-swap All of the data are independent. M=log2(Red) - log2(Green) Mi =m+ errori No differential expression implies H0: m=0 The F-test for this model is just t2 from the paired t-test

  22. One-Way "ANOVA" Yij = m + ai + errorij m is the mean expression for the gene over the entire experiment. ai is the deviation of the mean of the ith condition from the overall mean Si ai=0 The error variance should not depend on the condition.

  23. More Complicated Models with Fixed Effects Only Yijk = m + ai +bj+(ab)ij +errorijk We may have 2 or more factors, e.g. • genotype and drug dose • genotype and time point • treatment and dye m is the mean expression for the gene over the entire experiment. ai is the deviation of the mean of the ith level of factor A from the overall mean, Si ai=0 bi is the deviation of the mean of the ith level of factor B from the overall mean, Sj bj=0 (ab)ij is the deviation of the mean of the ijth combination of levels from m + ai +bj, mean Si (ab)ij=Sj (ab)ij=0 The error variance should not depend on the condition.

  24. More Complicated Models with Fixed Effects Only No interaction among factors Interaction among factors

  25. More Complicated Models with Fixed Effects Only Yijk = m + ai +bj+(ab)ij +errorijk Normal Theory ANOVA is readily extended to this situation and more factors can be added. Permutation and bootstrap methods begin to get complicated, but can still be applied. Rank-based methods are available for 2 factors, but get complicated

  26. Replicates that are not Independent We often have replicates that are NOT independent: multiple spots for the same gene on an array multiple arrays from the same RNA multiple RNAs from the same tissue multiple samples from the same individual multiple labs multiple "batches"

  27. Replicates that are not Independent e.g. A dye-swap experiment in which the dye-swaps are technical replicates (1 dye-swap pair per sample) and there are 2 spots per gene on the array with 2 or more treatments Yijkt = m + ti +dj +ak + gs + bt + errorijkt m is the mean expression for the gene over the entire experiment. ti is the deviation of the mean of the ith treatment, Si ti=0 di is the deviation of the mean of the ith level of dye from the overall mean, dr+dg=0 ak is the array effect which induces a correlation between the 2 spots on the same array ak~N(0,sa2) gs is the spot effect which induces a correlation between the 2 channels at the same spot gs~N(0,sg2) bt is the biological sample effect which induces a correlation between the 2 arrays in the dye-swap pair bt~N(0,sb2)

  28. Replicates that are not Independent The lack of independence can be modeled as a random effect. This is handled in a straightforward manner by ANOVA modeling but ... all the other methods get MUCH more complicated. Much of the available software does not handle this very well.

  29. Replicates that are not Independent In some cases, we can return to fixed effects models by averaging (but this loses power). e.g. technical replicates can be averaged and the averages can be used as if they were the primary data This is much better than discarding technical replicates, but not as good as modeling them.

  30. A1 A2 A3 B1 B2 B3 B1 B2 B3 A1 A2 A3 Replicates that are not Independent Example 2 conditions on a 2-channel array with replicate spots for each gene, and a dye-swap technical replicate. e.g. 2 genotypes of mouse 3 mice per genotype 1 mouse from each genotype on each array 2 arrays from each pair of mice 4 replicate spots per array We will simplify by modeling M, rather than each channel.

  31. A1 A2 A3 B1 B2 B3 B1 B2 B3 A1 A2 A3 Replicates that are not Independent Example effects: • mouse pair • dye (or equivalently genotype) • array pair and array are random dye is fixed we need to keep track of whether M is R-G or A-B (genotype difference) We do not need to include spot as we are using M

  32. A1 A2 A3 B1 B2 B3 B1 B2 B3 A1 A2 A3 Replicates that are not Independent Example data for 1 mouse pair (m): 2 arrays, with 4 spots per array Mmdas m is the mouse pair identifier (1,2,3) d is the dye for genotype A (r,g) a is the array (1-6 or 1,2 within m) s is the spot (1-4 within array)

  33. A1 A2 A3 B1 B2 B3 B1 B2 B3 A1 A2 A3 Replicates that are not Independent Example Mijkt = m + mi + dj + ak + errorijkt effects: m mouse pair (random) d dye (or equivalently genotype in R) a array (random) t=1,2,3,4 for the spots The hypothesis of no genotype effect is m=0. Notice that we have to be careful about the sign of M. If we code the effects in the way it is usually done for ANOVA, M=A-B not R-G

  34. A1 A2 A3 B1 B2 B3 B1 B2 B3 A1 A2 A3 Replicates that are not Independent Example Mijkt = m + mi + dj + ak + errorijkt Our estimate of m is just the sample mean of M over all the spots. But our estimate of the SE of ave(M) is not the sample average, due to the other effects.

  35. Replicates that are not Independent Example Mijkt = m + mi + dj + ak + errorijkt 3 mouse pairs 6 arrays 24 observations/gene

  36. What if we ignore the Dependence We would use: The denominator of the ordinary t-test is much too small Compare with

More Related