190 likes | 350 Views
Differential expression and testing. Consider a case where we have observed two genes with estimated fold changes of 2 Is this worth reporting? Some journals require measures of statistical significance (“p-values”). Repeated experiment. Repeated experiment. Back to Basics.
E N D
Differential expression and testing Consider a case where we have observed two genes with estimated fold changes of 2 Is this worth reporting? Some journals require measures of statistical significance (“p-values”)
Back to Basics If we have two measurements X, Y then Y-X may have a different distribution under the null hypothesis for different genes More specifically the standard deviation of Y-X may be different. We could consider (Y-X) / instead But we do not know ! What is ? Why is it not 0? How about taking samples and using the t-statistic?
Back to Basics Observations: Averages: SD2 or variances:
Back to Basics t - statistic:
Back to Basics If the number of replicates N is very large the t-statistic is normally distributed with mean 0 and and SD of 1 If the observed data is normally distributed then the t-statistic follows a t-distribution regardless of sample size We can then compute probability that t-statistic is as extreme or more, when null hypothesis is true: the p-value Where does this probability come from? We will see that the t-statistic is not a good strategy when N is small…
Estimating the variance The t-test considers difference between group means to standard deviation of data within groups F-test is a generalization of this idea to more than 2 groups But with few replicates, estimates of SE are not stable This explains why t-test is not powerful There are many proposals for estimating variation Many share information across genes Empirical Bayesian Approaches are popular SAM, an ad-hoc procedure, is even more popular Many are what some call “moderated” t-tests
Notation: T is average log expression in Tx C is average log expression in Control S is SD Note taking log before average is important Tests: Average log fold-change: (T-C) t-statistic: (T-C) / S SAM shrunken t-statistic: (T-C) / (S + S0) Bayesian posteriors: (T-C) / √(S2+K2) Wilcoxon Rank test Ad-hoc pairwise comparison No formula Some Examples of Tests
Once you have a score for each gene, how do you decide on a cut-off? p-values are popular. Are they appropriate? Test for each gene null hypothesis: no differential expression. Notice that if you have look at 10,000 genes for which the null is true you expect to see 500 attain p-values of 0.05 This is called the multiple comparison problem. Statisticians fight about it. But not about the above. Main message: p-values can’t be interpreted in the usual way A popular solution is to report FDR instead. One final problem
Multiple Hypothesis Testing • What happens if we call all genes significant with p-values ≤ 0.05, for example? Null = Equivalent Expression; Alternative = Differential Expression
Error Rates • Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses PCER = E(V)/m • Per family error rate (PFER): the expected number of Type I errors PFER = E(V) • Family-wise error rate: the probability of at least one Type I error FEWR = Pr(V ≥ 1) • False discovery rate (FDR) rate that false discoveries occur FDR = E(V/R; R>0) = E(V/R | R>0)Pr(R>0) • Positive false discovery rate (pFDR): rate that discoveries are false pFDR = E(V/R | R>0).
p >> n Goal: find statistically significant associations of biological conditions or phenotypes with gene expression. Consider the two class problem. Data: n (10…100) points in a p-dimensional (5000…30000) space. Problem: There are infinitely many ways to separate the space into two regions by a hyperplane such that the two groups are perfectly separated. This is a simple geometrical fact and holds as long as n<p!
p >> n Problem: If I find such a perfectly separating hyperplane, it doesn’t mean anything. It is not surprising. It is not a significant finding. I would always find it, no matter how random the data are! Answer: regularization Rather than searching in the huge space of all hyperplanes in n-1 dimensional space, restrict ourselves to a much smaller space. Two major approaches: - only the hyperplanes perpendicular to one of the n coordinate axis gene-by-gene discrimination, gene-by-gene hypothesis testing. - any other reasonable, not too complex set of hypersurfaces machine learning
Gene by gene tests t-test Wilcoxon F-test / more complex linear models Cox-regression Problem: Treating each gene independently of each other wastes information – many properties may be shared among genes. E.g. their within-group variability.
The volcano plot shows, for a particular test, negative log p-value against the effect size (M) A useful plot
Complications • What is the distribution of SAM statistic? • How about t-statistic is it really t-distributed? • How can we get p-values when we don’t know the distribution?
p-values by permutations We focus on one gene only. For the bth iteration, b = 1, , B; Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”. For each gene, calculate the corresponding two sample t-statistic, tb. After all the B permutations are done; Put p = #{b: |tb| ≥ |tobserved|}/B (p lower if we use >).