Dealing With Statistical Uncertainty

Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics

Synopsis • Hypothesis testing • P-values • Confidence intervals • T-test • Non-parametric tests • Sample size calculations • Permutations • Bootstrap

Hypothesis testing Null Hypothesis H0 Alternative Hypothesis H1 Examples of H1: Mean of population > 3.0 There is an association between disease and genotype Examples of H0: • Mean of a population is 3.0 • In a genetic association study, there is no association between disease state and the genotypes of a particular SNP

P-values • The P-value of a statistic Z is the probability of sampling a value more extreme than that observed value z if the null hypothesis is true. • For a one-sided test, the p-value would be a = Prob( Z > z | H0) • For a two-sided test it is a = Prob( |Z| > z | H0) • Note this is not the probability density at x, but the tails of the cumulative distribution function Upper and lower 2.5% tails of the N(0,1) distribution, corresponding to |z|=1.96

Power • The power of a statistic Z is the probability of sampling a value more extreme than that observed value z if the alternative hypothesis is true. • For a one-sided test, the power would be b = Prob( Z > z | H1) • For a two-sided test it is b = Prob( |Z| > z | H1) • Note this is not the probability density at x, but the tails of the cumulative distribution function H0: N(1,0) H1: N(1.5,1) H2: N(4,1) 0.05 = Prob( Z> 1.644|H0) (5% upper tail) 0.44 = Prob( Z> 1.644|H1) 0.99 = Prob( Z> 1.644|H2)

The Likelihood • Likelihood = “probability of the data given the model” • Basis of parametric statistical inference • Different hypotheses can often be expressed in terms of different values of the parameters q • Hypothesis testing equivalent to comparing the likelihoods for different q

The Likelihood Ratio Test • General Framework for constructing hypothesis tests: S = Likelihood( data | H1) / Likelihood( data | H0) Reject H0 if S > s(H0, H1) Threshold s is chosen such that there is a probability a of making a false rejection under H0. • is the size of the test (or false positive rate or Type I error) e.g. a=0.05 • is the power of the test, the probability of correctly rejecting H0 when H1 is true e.g. b=0.8 1- b is the type II error, the false negative rate Generally, for fixed sample size n, if we fix a then we can’t fix b If we fix a and b then we must let n vary. The Neyman Pearson Lemma states that the likelihood ratio test is the most powerful of all tests of a given size Type III error: "correctly rejecting the null hypothesis for the wrong reason".

Example: The mean of a Normal Distribution H0: mean = m0vs H1: mean = m1 data: y1, … yniid N(m,s2) Assume variance is the same and is known Therefore we base all inferences on the sample mean compared to the difference in the hypothesised means

The Normal and T distributions • For a sample from a Normal distribution with known mean and variance, the sample mean can be standardised to follow the standard Normal distribution • And the 95% confidence interval for the mean is given by • But we often wish to make inferences about samples from a Normal Distribution when the variance is unknown: The variance must be estimated from the data as the sample variance • Because of this uncertainty the distribution of the sample mean is broader, and follows the Tn-1 distribution

The T distribution • The Tn and Normal distributions are almost identical for n>20, • As a rule of thumb, an approximate 95% confidence interval for n>20 is

T-tests • The T-test compares the means of samples • The T-test is optimal (in LR sense) if the data are sampled from a Normal distribution • Even if the data are not Normal the test is useful in large samples • There are several versions, depending on the details

T tests for the comparison of sample meansOne Sample Test • One sample y1 ….. yn • H0: the sample mean = m0 • H1: the sample mean != m0 • Reject H0 if • tn-1(a/2) is the quantile of the Tn-1 distribution such that the probability of exceeding tn-1(a/2) is a/2 • Two sided test • in R, tn-1(0.025) = pt(0.025,n-1)

T tests for the comparison of sample means Paired T-test • Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn) • Example: • two experimental replicates taken from n individuals, we wish to test if the replicates are statistically similar • H0: mean(x) = mean(y) • Take differences d1= x1- y1 … dn= xn - yn • H0: mean(d) = 0 • One sample T-test with m0=0

T tests for the comparison of sample meansTwo-Sample T-test • Two samples of unpaired data x1, x2 ….. xn y1, y2 ….. ym • H0: mean(x) = mean(y) • If we assume the variance is the same in each group then we can estimate the variance as the pooled estimator • The test statistic is (The case with unequal variances is more complicated)

T tests in R t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) • x a numeric vector of data values. • y an optional numeric vector data values. • alternative a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less". • mu a number indicating the true value of the mean • paired a logical indicating whether you want a paired t-test. • var.equal a logical variable indicating whether to treat the two variances as being equal. • conf.level confidence level of the interval. • formula a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups. • data an optional data frame containing the variables in the model formula. • subset an optional vector specifying a subset of observations to be used.

T tests in R samp1 <- rnorm( 20, 0, 1) # sample 20 numbers from N(0,1) distribution samp2 <- rnorm( 20, 1.5, 1) # sample from N(1.5,1) t.test( samp1, samp2,var.equal=TRUE ) Two Sample t-test data: samp1 and samp2 t = -3.5892, df = 38, p-value = 0.000935 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.9272037 -0.5372251 sample estimates: mean of x mean of y 0.4040969 1.6363113

OutliersA small number of contaminating observations with different distribution • Outliers can destroy statistical genuine significance Reason: the estimate of the variance is inflated, reducing the T statistic • And sometimes create false significance : Type III error: "correctly rejecting the null hypothesis for the wrong reason". • There are Robust Alternatives to the T-test: Wilcoxon tests > samp3 <- samp2 > samp3[1] = 40 # add one outlier > t.test( samp1, samp3,var.equal=TRUE ) Two Sample t-test data: samp1 and samp3 t = -1.5911, df = 38, p-value = 0.1199 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.0519594 0.8450837 sample estimates: mean of x mean of y 0.4040969 3.5075347 > var(samp2) [1] 1.160251 > var(samp3) [1] 74.88977

Non-parametric tests:Wilcoxon signed rank test(alternative to Paired T-test) • Two samples of paired data (x1,y1), (x2-y2) …. (xn,yn) • Take differences d1= x1 - y1 … dn= xn – yn • H0: distribution of d’s is symmetric about 0 • Compute the ranks of the absolute differences, rank(|d1|) etc • Compute the signs of the differences • Compute W, the sum of the ranks with positive signs • If H0 true then W should be close to the mean rank n/2 • The distribution of W is known and for n>20 a Normal approximation is used. • in R wilcox.test(x, y, alternative = c("two.sided", "less", "greater"), mu = 0, paired = TRUE, exact = NULL, correct = TRUE, conf.int = FALSE, conf.level = 0.95, ...)

Wilcoxon rank sum test(alternative to unpaired T-test) • Two samples of unpaired data x1, x2 ….. xn y1, y2 ….. ym • Arrange all the observations into a single ranked series of length N = n+m. • R1 = sum of the ranks for the observations which came from sample 1. The sum of ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N + 1)/2 • Compute U = min(R1 ,R2) - n(n + 1)/2. • z = (U –m)/s Normal approximation m = nm/2 s2 = nm(n=m+1)/12 • The Wilcoxon Rank sum test is only about 5% less efficient than the unpaired T-test (in terms of the sample size required to achieve the same power) in large samples.

Wilcoxon rank sum test(alternative to unpaired T-test) >wilcox.test(samp1,samp2) Wilcoxon rank sum test data: samp1 and samp2 W = 82, p-value = 0.001041 alternative hypothesis: true mu is not equal to 0 > wilcox.test(samp1,samp3) # with outliers Wilcoxon rank sum test data: samp1 and samp3 W = 82, p-value = 0.001041 alternative hypothesis: true mu is not equal to 0

Binomial Sign Test(T-test on one sample) • One sample y1 ….. yn H0: the sample mean = m0 • Count M, the number of y’s that are > m0 • If H0 is true, this should be about n/2, and should be distributed as a Binomial random variable B(n/2,n)

Contingency Tables • Data are a table of counts, with the row and column margins fixed. • Ho: the counts in each cell are consistent with the rows and columns acting independently

The Chi-Squared statistics

Fisher’s Exact Testfor 2 x 2 Contingency tables • Similar idea to a permutation test (see below) • Given the marginal totals in a 2x2 table, we can compute the probability P of getting the observed cell counts assuming the null hypothesis of no associations between classifying factors. • hypergeometric distribution. • Find all tables with same margins and with probabilities no bigger than P • P-value of the table is the sum of these probabilities

Fisher’s Exact Test R implementation: m <- matrix(c(10,10,5,12),nrow=2) fisher.test(m) Fisher's Exact Test for Count Data data: m p-value = 0.3152 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.513082 11.942154 sample estimates: odds ratio 2.342796 >

Permutation TestsAnother non-parametric Alternative • Permutation is a general principle with many applications to hypothesis testing. • But it is not useful for parameter estimation • The method is best understood by an example: • Suppose a data set is classified into N groups. We are interested in whether the differences between group means (eg if N=2 then the two sample T test or Wilcoxon Rank Sum test is appropriate) • Compute a statistic S (eg the difference between the maximum and minimum group mean) • If there are no differences between groups then any permutation of the group labellings between individuals should produce much the same result for S • Repeat i = 1 to M times: • permute the data • compute the statistic on the permuted data, Si • The Permutation p-value is the fraction of permutations that exceed S • It’s that simple!

Bootstrapping • Given a set of n things S = {x1 …. x2}, a bootstrap sample is formed by sampling with replacement from S, to create a new set S* of size n • Original • Resample • Some elements of S will be missing in S* and some will be present multiply often (about 1/e = 63% of the elements will be present in S* on average) • S* can be used as a replacement for S in any algorithm that works on S, and we can repeatedly draw further bootstrap samples. • Any statistic Z that can be calculated on S can also be calculated on S*, and so we can construct the bootstrap sampling distribution of Z from repeated samples. • From this we can make inferences about the variation of Z, eg construct a 95% confidence interval (if Z is numeric) • Can be used for parameter estimation

Bootstrapping • Bootstrapping is a general way to evaluate the uncertainty in estimates of statistics whilst making few assumptions about the data • It is particularly useful for complex statistics whose distribution cannot be evaluated analytically • Example: • In a gene expression study, we wish to find a minimal set of SNPs that in combination will predict the variation in a phenotype. • Suppose we have a deterministic algorithm A that will find us a plausible set of SNPs (ie a model). • This is likely to be a complex programme and contain some arbitrary steps. • It is also unlikely to give any measure of uncertainty to the model. • There may be many possible sets of SNPs with similar claims, so we would like a way to average across different models, and evaluate the evidence for a SNP as the fraction of models in which the SNP occurs • To do this, we need a way to perturb the data, to generate new data sets on which we can apply the procedure A • Bootstrapping is one way to do this

Dealing With Statistical Uncertainty