1 / 17

Non -specific filtering and control of false positives: an update

Non -specific filtering and control of false positives: an update. Richard Bourgon 16 March 2009 bourgon@ebi.ac.uk. Experiment-wide type I error rates.

dori
Download Presentation

Non -specific filtering and control of false positives: an update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-specific filtering and control offalse positives: an update Richard Bourgon 16 March 2009 bourgon@ebi.ac.uk

  2. Experiment-wide type I error rates • Family-wise error rate:P(V > 0), i.e., the probability of one or more false positives. For large m0, this is very difficult to keep small. • False discovery rate (FDR): let Q = V/R, or 0 if R is 0. The FDR is E(Q), i.e., the expected fraction of false positives among all discoveries.

  3. A nice property of CDFs for continuous RVs > X = rnorm(100000) > F = pnorm > hist(X, breaks = 50) > hist(F(X), breaks = 50)

  4. A “nice” property? • To compute a p-value for testing a null hypothesis H0, we typically… • Define a test statistic T, and compute its value t for the observed data. • Assume we know the distribution of T when H0 is true: F0. • Compute p = 1 – F0(t), i.e., define p = P(T > t | H0 is true). • Compare p to some α. • Now define the random variable P = 1 – F0(T). If H0 is true, then… • F0(T) is uniformly distribution on [0,1]. • By symmetry, P is uniformly distribution on [0,1] as well. • Suppose 20% of genes are differentially expressed, so that

  5. Observed p-values: a mixture

  6. Observed p-values: a mixture

  7. Non-specific filtering • For a given gene, write the data as ((c1,Y1),…,(cp,Yp)). • First group (c = 1): i = 1, …, p1. • First group (c = 2): i = p1 + 1, …, p1 + p2. • Conditions under which we expect little variation in Y: • Genes which are absent in both samples. (Probes will still report noise and cross-hybridization, typically at the same level in both groups.) • Probe sets which do not respond to target. • Genes which are not differentially expressed. • A “non-specific” filter: • Ignores c1, …, cp, i.e., f(Y). • Helps identify any of these three classes, based on our a priori understanding of array behavior. • Apply standard testing to genes passing the filter, using some g(c,Y).

  8. Increased detection rate • Stage 1 non-specific filter statistic: compute and remove the θ smallest. • Stage 2: standard two-sample t-test for genes passing stage 1.

  9. Increased power? • An increased detection rate implies increased power only if we are still controlling type I errors at the nominal level.

  10. Result: independence of stage 1 and stage 2 test statistics • For genes for which the null hypotheses is true, f(Y) and g(c, Y) are statistically independent in both of the following cases: • For normally distributed data: • Stage 1: overall variance, • Stage 2: the standard two-sample t-statistic. • Non-parametrically: • Stage 1: any function of the data which doesn’t depend on the order of the arguments. S2 above, or the IQR, are both candidates. • Stage 2: the Wilcoxon rank sum test statistic. • Both can be extended to the multi-class context: ANOVA and Kruskal Wallis. • Bonferroni and Holm go through easily — in expectation.

  11. Independence: Benjamini & Hochberg and Storey FDR adjustments • What is the FDR associated with use of cutoff α? Naive estimator: • V is not observable, but E(V) is m0α, bounded by mα. • E(R) cannot be computed, but R can be used as an estimator. • Evaluating at each p(i) using morgives BH95 or Storey adjustments, respectively:

  12. Independence: Benjamini & Hochberg and Storey FDR adjustments The foregoing motivation for the BH95 and Storey procedures uses E(V(α)) = m0α. Marginal independence of true null f(Y) and g(c,Y) means that this still applies at stage 2 in expectation. Define M0 to be the random number of true nulls passing stage 1. Then

  13. For true nulls, we show independence between P and f(Y) over repeated data realizations. The P within a single realization may be correlated. • FDR control is on average only: no guarantees for a single realization Repeated data realizations Genes: stage I and stage II statistics

  14. Correlation and a single data instance • Given pervasive correlation (here, all pairs at +ρ), the empirical distribution of p-values for a single data instance can vary widely. Most extreme distributions in 1000 trials

  15. FWER: Westfall and Young • Westfall and Young (1993) controls FWER with more power, but depends on the joint distribution of all p-values: • WY93 is valid under subset pivotality. If this holds for the one-stage procedure, it holds for the two-stage non-specific filtering approach as well. • Distribution of min Pj under is typically estimated by permutation. If filtering changes correlation structure, new structure is used by permutation!

  16. Correlation and FDR control Storey et al. q-values. Correlation: all pairs at +ρ. Some anti-conservative bias in FDR estimation. oFDR substantially greater than nominal for a small fraction of data instances. BH more conservative, since fixed at 1.

  17. Conclusions • In actual examples, non-specific filtering leads to (biologically) significant increases in the number of genes identified. • Commonly used stage 1/stage 2 test statistic pairs are statistically independent for genes which are not differentially express • Given this independence, Bonferroni and Holm FWER control is valid in the two-stage procedure. • Correlation structure may change under filtering. • Permutation-based Westfall and Young correction accounts for this. FDR control, however, may suffer. • Effect of filtering on correlation can be checked, and impact, assessed.

More Related