570 likes | 754 Views
Significance Testing of High-Throughput Data. CSHL Data Analysis 2012 Mark Reimers. Goals of Testing High-Throughput Data. To identify those genes most likely changed To prioritize candidates for focused follow-up studies
E N D
Significance Testing of High-Throughput Data CSHL Data Analysis 2012 Mark Reimers
Goals of Testing High-Throughput Data • To identify those genes most likely changed • To prioritize candidates for focused follow-up studies • To characterize functional changes reflected in changes in gene regulation • In practice we don’t need exactp-values… …but we do need critical thinking!
Outline • Family wide error rates • False discovery rates • Benjamini-Hochberg • Storey positive FDR • Correlated errors • Permutations and empirical p-values • Empirical Bayes approaches • Power to detect differences
Characterizing False Positives • Family-Wide Error Rate (FWE) or ‘corrected p-values’ • probability of at least one false positive arising from the selection procedure • Strong control of FWE: • Bound on FWE independent of number changed • False Discovery Rate: • Proportion of false positives arising from selection procedure • This is unknown; we can only estimate this!
Catalog of Type I Error Rates • Per-family Error Rate PFER = E(V) • Per-comparison Error Rate PCER = E(V)/m • Family-wise Error Rate FWER = p(V ≥ 1) • False Discovery Rate i) FDR = E(Q), where Q = V/R if R > 0; Q = 0 if R = 0 (Benjamini-Hochberg) ii) FDR = E( V/R | R > 0) (Storey)
Simple Multiple Testing Example • Suppose 10,000 genes on a chip • Suppose no genes really changed • all samples drawn from same population • Each test statistic has a 5% chance of exceeding the threshold at a p-value of .05 • Type I error • The test statistics for 500 genes should exceed .05 threshold ‘by chance’
What is the Distribution of Null P-Values? • The ‘p-value’ is the probability, if there is no real difference, of getting a test statistic at least as extreme as the one observed • If one Null test has a p-value of 0.3, then 30% of all Null tests should have bigger test stats, hence smaller p-values • Therefore 30% of p-values are under 0.3
Distributions of p-values Real Microarray Data Random Data Expected histogram height under Null
Distribution of Numbers of p-values • Each bin of width w contains a random number of p-values • The expected number is Nw • Each p-value has a probability w of lying in the bin • The distribution follows the Poisson law • SD ~ (mean)1/2
When Might it not be Uniform? • When actual distribution of test statistic departs from reference distribution • Outliers in data may give rise to more extremes • More small p-values • Approximate tests – often conservative • P-values are larger than occurrence probability • Distribution shifted right
General Issues for Multiple Comparisons • FWER vs FDR • Are you willing to tolerate some false positives • FDR: E(FDR) or P(FDR < Q)? • Actual (random) FDR has a long-tailed distribution • But E(FDR) methods are simpler and cleaner • Correlations • Many procedures surprise you when tests are correlated • Always check assumptions of procedure! • Models for Null distribution: a matter of art • Strong vs weak control • Will the procedure work for any combination of true and false null hypotheses?
FWER - Setting a Higher Threshold • Suppose want to test N independent genes at overall level a • What level a* should each gene be tested at? • Want to ensure • P( any false positive) < a • i.e. 1 – a = P( all true negatives ) • = P( any null accepted )N • = ( 1 – a* ) N • Solve for a* = 1 – (1 – a )1/N
Expectation Argument • P( any false positive ) • <= E( # false positives ) • = N E( any false positive) • = N a* • So we set a* = a / N • NB. No assumptions about joint distribution
‘Corrected’ p-Values for FWE • Sidak (exact correction for independent tests) • pi* = 1 – (1 – pi)N if all pi are independent • pi* @ 1 – (1 – Npi + …) gives Bonferroni • Bonferroni correction • pi* = Npi, if Npi < 1, otherwise 1 • Expectation argument • Still conservative if genes are co-regulated (correlated) • Both are too conservative for array use!
Traditional Multiple Comparisons Methods • Key idea: sequential testing • Order p-values: p(1), p(2), … • If p(1) significant then test p(2) , etc … • Mostly improvements on this simple idea • Complicated proofs
Holm’s FWER Procedure • Order p-values: p(1), …, p(N) • If p(1) < a/N, reject H(1) , then… • If p(2) < a/(N-1), reject H(2) , then… • Let k be the largest n such that p(n) < a/n, for all n <= k • Reject p(1) … p(k) • Then P( at least one false positive) < a • Proof doesn’t depend on distributions
Hochberg’s FWER Procedure • Find largest k: p(k) < a / (N – k + 1 ) • Then select genes (1) to (k) • More powerful than Holm’s procedure • But … requires assumptions: independence or ‘positive dependence’ • When one type I error, could have many false positives
Holm & Hochberg Adjusted P • Order p-valuespr1 , pr2, …, prM • Holm (1979)step-down adjusted p-values p(j)* = maxk = 1 to j {min ((M-k+1)p(k), 1)} Adjust out-of-order p-values in relation to those lower (‘step-down’) • Hochberg (1988) step-up adjusted p-values p(j)* = mink = j to M {min ((M-k+1)p(k), 1) } Adjust out-of-order p-values in relation to those higher (‘step-up’)
False Discovery Rate • In genomic problems a few false positives are often acceptable. • Want to trade-off power .vs. false positives • Could control: • Expected number of false positives • Expected proportion of false positives • What to do with E(V/R) when R is 0? • Actual proportion of false positives
Truth vs. Decision Decision Truth
Estimating False Discovery Rate • Expect 20; see 75
Benjamini-Hochberg Procedure • Can’t know what FDR is for a particular sample • B-H suggest procedure controlling average FDR • Order the p-values : p(1), p(2), …, p(N) • If any p(k) < k a /N • Then select genes (1) to (k) • NB: acceptable FDR may be much larger than acceptable p-value (e.g. 0.10 ) • NB: Theorem guarantees FDR for the procedure: set the target, then select the threshold and genes • Most people apply it adaptively – fiddle with level until get a gene list they like • The B-H theorem does not validate this use
Benjamini-Hochberg Example • FDR target 0.1; N = 1,000 • P-value Threshold Condition • 2e-4 1e-4 F • 2.4e-4 2e-4 F • 2.5e-4 3e-4 T • 3.2e-4 4e-4 T • 6e-4 5e-4 F
Argument for B-H Method • If no true changes (all H0’s hold) • Q = 1 condition of Simes’ lemma holds • Therefore probability < a • Otherwise Q = 0 • If all true changes (no H0 holds) • Q = 0 < a • Build argument by induction from both ends and up from N = 2
Simes’ Lemma • Suppose we order the p-values from N independent tests using random data: • p(1), p(2), …, p(N) • Pick a target threshold a • P( p(1) < a /N || p(2) < 2 a /N || p(3) < 3 a /N || … ) = a a/2 P = P( min(p1,p2) < a/2) + P(min(p1,p2) > a/2 & max(p1,p2) < a) Area = (a/2 + a/2 – a2/4 ) + a2/4 p2 a/2 p1
Simes’Test for Some Non-Nulls • Pick a target threshold a • Order the p-values : p(1), p(2), …, p(N) • If for any k, p(k) < k a /N, reject complete Null • Test valid against complete Null hypothesis, if tests are independent or ‘positively dependent’ • Doesn’t give strong control (i.e. if some alternatives are true) • Somewhat non-conservative if tests are negatively correlated
Practical Issues • Actual proportion of false positives varies from data set to data set • Mean FDR could be low but could be high in your data set
Distributions of numbers of p-values below threshold • 10,000 genes; • 10,000 random drawings • L: Uncorrelated R: Highly correlated
Controlling the Number of FP’s in One Study • B-H procedure only guarantees long-term average value of E(V/R|R>0)P(R>0) • can be quite badly wrong in individual studies • Korn’smethod gives confidence bound on FDR for individual studies • also addresses issue of correlations • Builds on Westfall-Young approach to control tail probability of proportion of false positives (TPPFP)
Korn’s Procedure • To guarantee no more than k false positives • Construct null distribution as in Westfall-Young • Order p-values: p(1), …,p(M) • Reject H(1), …,H(k) • For next p-values • Compare p-value to full null • N.B. This gives strong control • Continue until one H not rejected
Issues with Korn’s Procedure • Valid if select k first then follow through procedure, not if try a number of different k and pick the one with most genes – as people actually proceed • Only approximate FDR • Computationally intensive • Available in BRB
Storey’s pFDR • Storey argues that E(Q | V > 0 ) is what most people think FDR means • Sometimes quite different from B-H FDR • Especially if number of rejected nulls needs to be quite small in order to get acceptable FDR • E.G. if P(V=0) = 1/2 , then pFDR = 2*FDR
A Bayesian Interpretation • Suppose nature generates true nulls with probability p0 and true alternatives with P = p1 • Then define the FDR as the probability of a false positive among the rejected tests • pFDR= P( H0true | test statistic) • Issue: We don’t know p0 • Storeysuggests estimating p0by examining the right end of the p-value distribution Expected density if all H0 true Observed density in right half
Storey’s Procedure • Estimate #of true Nulls (p0) as 2 (#p > ½ ) • Try several p-value thresholds p1 • ‘fishing’ is OK with this procedure (unlike B-H) • Estimate probability of p-value for true Null in rejection region • Form ‘naïve’ ratio: p0 p1 M/ {# p < p1} • ‘Adjust’ for small numbers • Bootstrap ratio to obtain confidence interval for pFDR
Q-Values • P-value is minimum test level at which a gene gets selected (declared ‘significant’) • Q-value is minimum FDR at which a gene is included in the selected set • In Storey’s procedure this is a Bayesian posterior probability • Term is commonly applied to B-H procedure
Confidence in pFDR • Storey estimates confidence intervals for his procedure by bootstrapping p-values • This is relatively easy to do • However the bootstrap correct procedure is to resample then re-compute p- and q-values • The confidence intervals obtained this way are very different than by resampling p-values, if the tests are moderately correlated, as is often the case
Correlated Tests and FWER • Typically tests are correlated • Extreme case: all tests highly correlated • One test is proxy for all • ‘Corrected’ p-values are the same as ‘uncorrected’ • Intermediate case: some correlation • Usually probability of obtaining a p-value by chance is in between Sidak and uncorrected values
Symptoms of Correlated Tests P-value Histograms
Distributions of numbers of p-values below threshold • 10,000 genes; • 10,000 random drawings • L: Uncorrelated R: Highly correlated
Permutation Tests • We don’t know the true distribution of gene expression measures within groups • We simulate the distribution of samples drawn from the same group by pooling the two groups, and selecting randomly two groups of the same size we are testing. • Need at least 5 in each group to do this!
How To Do Permutation Tests • Suppose samples 1,2,…,10 are in group 1 and samples 11 – 20 are from group 2 • Permute 1,2,…,20: say • 13,4,7,20,9,11,17,3,8,19,2,5,16,14,6,18,12,15,10 • Construct mean differences (or t-scores) for each gene based on these groups • Repeat many times to obtain Null distribution of random mean differences (or t-scores) • This will be a z- or t-distribution if the original distribution is roughly Normal (has no outliers)
Critiques of Permutations • Variances of permuted values for really separate groups are inflated • Permuted t -scores for many genes may be lower than from random samples all drawn from the same population • Therefore somewhat too conservative p-values for some genes
Multivariate Permutation Tests • Want a null distribution with same correlation structure as given data but no real differences between groups • Permute group labels among samples • redo tests with pseudo-groups • repeat ad infinitum (10,000 times)
Westfall-Young Approach • Procedure analogous to Holm, except that at each stage, they compare the smallest p-value to the smallest p-value from an empirical null distribution of the hypotheses being tested. • How often is smallest p-value less than a given threshold if tests are correlated to the same extent and all Nulls are true? • Construct permuted samples: n = 1,…,N • Determine p-values pj[n] for each sample n
Westfall-Young Approach – 2 • Construct permuted samples: n = 1,…,N • Determine p-values pj[n] for each sample n • To correct the i-th smallest p-value, drop those hypotheses already rejected (at a smaller level) • The i-th smallest p-value cannot be smaller than any previous p-values