Confirmatory Statistics: Identifying, Framing and Testing Hypotheses

Confirmatory Statistics:Identifying, Framing and Testing Hypotheses Andrew Mead (School of Life Sciences)

Contents • Philosophy and Language • Why test hypotheses? • Underlying method • Terminology and Language • Statistical tests for particular hypotheses • Tests for means – t-tests and non-parametric alternatives • Tests for variances – F-tests • Tests for frequencies – chi-squared test • More tests for means – analysis of variance • Issues with multiple testing

Comparative Studies • Much research is concerned with comparing two or more treatments/conditions/systems/… • Interest is often in identifying which is “best” • Or, more likely, whether the best is better than some other • Statistical hypothesis testing provides a way of assessing this • In other areas of research we want to know the size or shape of some response • Estimation, including a measure of uncertainty • Modelling, including estimation of model parameters • And testing of whether parameters could take particular values

Hypothesis testing • Scientific method: • Formulate hypothesis • Collect data to test hypothesis • Decide whether or not to accept hypothesis • Repeat • Scientific statements are falsifiable • Hypothesis testing is about falsifying them.

Hypothesis: Men have big feet What does this mean? What kind of data do we need to test it? What data would cause us to believe it? What would cause us not to believe it? Example

What does it mean? • Men have big feet • On an absolute scale? • Relative to what – women (or children or elephants)? • For all men? • Need to add some precision to the statement • The average shoe size taken by adult males in the UK is larger than the average shoe size taken by adult females in the UK • Perhaps after adjusting for general size • Should think about what the alternative is (if our hypothesis is not true)

What kind of data are useful? • Shoe sizes of adult men • And of adult women? • Or of children? • Additional data • To adjust for other sources of variation • Ages, heights, weights • Paired data from brothers and sisters? • To control for other sources of variation • Fraternal twins? • How much data?

Assessing the hypothesis • What would cause us to believe it? • If in our samples, the feet sizes of men were consistently larger than those of women • And perhaps that we couldn’t explain this by height/weight • Do we care how much bigger? • What would cause us not to believe it? • If in our sample men’s shoe sizes were not on average bigger • Or maybe that some were bigger and some smaller • If the average was bigger, but (perhaps after adjusting for height/weight) it was not so much bigger that it might not have plausibly resulted from sampling variability • How can we assess this?

Conclusions • Need to carefully define your hypothesis • Think carefully about what you really mean • Be precise • Make sure you measured everything relevant • Or choose your samples to exclude other sources of variability • Need to have a way of assessing the evidence • Does the evidence support our hypothesis, or could it have occurred by chance? • We usually compare the hypothesis of interest with a (default) hypothesis that nothing interesting is happening • i.e. that the apparent effect is just due to sampling variability

Assessing Evidence • Statistical significance testing is the classical statistical way to do this • Standard way used in science • Typically this involves: • Comparing two or more hypotheses • Often a default belief (null hypothesis) and what we are actually interested in (alternative hypothesis) • Considering how likely the evidence would be if each of the hypotheses were true • Deciding if there is enough evidence to choose the non-default (alternative) hypothesis over the default (null) hypothesis

Hypothesis testing in Science • In science, we take the default to be that nothing interesting is happening • cfOccam’s razor ‘Do not multiply entities beyond need’ • ‘the simplest explanation is usually the correct one’ • Call this the null hypothesis • Compare with the alternative hypothesis that something interesting is happening • E.g. Men have big feet • We generally deal with quantitative data, and so can set quantitative criteria for rejecting the null hypothesis

Outcomes • Three possible outcomes from a significance test • We reach the correct conclusion • We incorrectly reject the null hypothesis • type 1 error • most serious mistake • equivalent to a false conviction • as in a criminal trial, we strive to avoid this • We incorrectly accept the null hypothesis • type 2 error

Type 1 error • Incorrectly reject the null hypothesis • Probability of making a Type 1 error is called the sizeof the test • This is the quantity (usually denoted a) usually associated with significance tests • if a result is described as being significant at 5%, then this means: • “given a test of size 5% the result led to the rejection of the null hypothesis” • so, the probability of rejecting the null hypothesis when it is true is 5% • Usually we want to control the size of the test, and choose for this to be small • We want to be fairly certain before we change our view from the default (null) hypothesis

Type 2 error • Incorrectly accept the null hypothesis • Probability of not making a Type 2 error is called the power of the test • the probability of (correctly) rejecting the null hypothesis when it is false • power commonly written as (1 - b) • so probability of making a type 2 error is b • Alternative hypotheses are usually not exact • e.g. men’s feet are bigger than women’s are • not men’s feet are 10% bigger than women’s are • powerof a test will vary according to which exact statement is true about the alternative hypothesis • a test may have small power to detect a small difference but will have higher power to detect a large difference • so usually talk about the power of a test to detect some specified degree of difference from the null hypothesis • can calculate power for a range of differences and construct a power curve

Formal language

Conventional levels • Significance tests conventionally performed at certain ‘round’ sizes • 5% (lowest level normally quoted in journals), 1% and 0.1% • may sometimes be reasonable to quote 10% • values available in books of tables • Computer packages generally give exact levels (to some limit) • traditionally round up to nearest conventional level • editors becoming more accepting of quoting the exact level • but shouldn’t quote really small values • in tables significance sometimes shown using asterisks • usually * = 5%, ** = 1%, *** = 0.1%

Confusing scientific and statistical significance • Statistical significance indicates whether evidence suggests that null hypothesis is false • does not mean that differences have biological importance • If our experiment/sample is too big • treatment differences of no importance can show up as significant • consider size of treatment differences as well as statistical significance to decide whether treatment effects matter • If our experiment/sample is too small • real and important differences between treatments may escape detection • if a treatment difference is not significant we cannot assume that treatments are equal • was power of test sufficient to detect important differences? • if a test is not significant this does not mean strong evidence for null hypothesis, but lack of strong evidence for alternative hypothesis

Other potential problems • Significance testing deliberately makes it difficult to reject the null hypothesis • Sometimes this is not what we are interested in doing • May want to estimate some characteristic from the data • May want to fit some sort of model to describe the data • Hypotheses tested here in terms of the parameter values • Possible to test inappropriate hypotheses • Standard tests tend to have the null hypothesis that two (or more) treatments are equal, or that a parameter equals zero. • If the question of interest is whether a parameter equals one, testing whether it’s different from zero doesn’t help • Could also be interested in determining that two treatments are similar (equivalence testing), so don’t want to test whether they are different

Summary of hypothesis testing theory (1) • Compare alternative hypothesis of interest to null hypothesis • Null hypothesis (default) says nothing interesting happening • Do we really believe it? • Believe null hypothesis unless compelled to reject it • Need strong evidence in favour of the alternative hypothesis to reject the null hypothesis • Size of test () gives the probability of rejecting the null hypothesis when it is true • Usually referred to as the significance level for the test

Summary of hypothesis testing theory (2) • Pick a test statistic with good power for the alternative hypothesis of interest • Power of a test (1 - ) gives the probability of rejecting the null hypothesis when it is false • Power changes with the particular statement that is true for the alternative hypothesis • Size of test is used to determine the critical value of test statistic at which the null hypothesis is rejected • Statistical and scientific significance are different!

Applying statistical tests • Almost always using the collected data to test hypotheses about some larger population • Using statistical methods to make inferences from the collected data about a broader scenario • What is the larger population? • How broadly can the inferences be applied? • Most tests have associated assumptions that need to be met for the test to be valid • If assumptions fail then conclusions from test are likely to be flawed • Need to assess the assumptions • Often related to the form of data – the way the data were collected

Selecting an appropriate test • ‘100 Statistical Tests’ (G.K. Kanji, 1999, SAGE Publications) • General introduction • Example applications • Classification of tests • By number of samples • 1 sample, 2 samples, K samples • By type of data • Linear, Circular • By type of test • Parametric classical, Parametric, Distribution-free (non-parametric), sequential • By aim of test • Central tendency, proportion, variability, distribution functions, association, probability, randomness, ratio

Student’s t-test – for means • Three types of test : • One-sample • To test whether the sample could have come from a population with a specified mean value • Two-sample • To test whether the two samples are from populations with the same means • Paired-sample • To test whether the difference between pairs of observations from different samples is zero

Student’s t-test • One-sample t-test • H0 : μ = μ0 • H1 : μ ≠ μ0 • Given a sample x1, x2,…, xn the test statistic, t, is the absolute difference between the sample mean and μ0, divided by the standard error of the mean • Compare the test statistic with the critical value from a t-distribution with (n - 1) degrees of freedom • For a test of size 5%, we will reject H0 if t is greater than the critical value such that 2.5% of the distribution is in each tail

Why 2.5% ? • Interested in detecting difference from the specified (null hypothesis) value • Don’t care in which direction • Formula for test statistic looks at absolute value of difference • So reject the null hypothesis if t in either tail of the distribution • With 2.5% in each tail, as shaded in the figure, we get 5% in total

Example • Yields of carrots per hectare from 14 farmers: 97.1 99.2 95.6 97.6 99.7 94.2 95.3 74.6 112.8 110.0 91.5 96.3 85.7 112.4 • “standard” yield per hectare is 93, is this an abnormal year? • Test H0 : μ = 93 against H1: μ ≠ 93

Calculations • Mean yield = 97.29 • Standard deviation = 10.15 • Standard error of mean = 10.15 / √14 = 2.71 • Test statistic: t = |97.29 – 93| / 2.71 t = 4.29/2.71 t = 1.58 • Critical value is t13; 0.025 = 2.160 • Test statistic is smaller than this • So fail to reject (accept) H0 at the 5% significance level – not an abnormal year

Power analysis • Alternative hypotheses were not exact, so we cannot calculate an exact power for this test • But we can calculate the power of the test to detect various specified degrees of difference from the null hypothesis • Reminder: Power is the probability of rejecting the null hypothesis when it is false • For the example, we would have accepted the null hypothesis if the absolute difference in means was less than the least significant difference

Calculations • For the test, the power for any given “alternative” mean value, μ1,is • the probability of getting a value greater than 98.86 (mean + LSD = 93 + 5.86) • PLUS • the probability of getting a value less than 87.14 (mean – LSD = 93 - 5.86) • for a t-distribution with 13 degrees of freedom with mean μ1 and standard error as calculated from the observed standard deviation

Power Curve

Two sample test • Two-sample test • H0 : μ1 = μ2 • H1 : μ1 ≠ μ2 • The usual assumption for a two-sample t-test is that the distributions from which the two samples are taken have the same variances • An alternative test allows the variances to be different • Given two samples x1, x2,…, xm and y1, y2,…, yn the test statistic t is calculated as the absolute value of the difference between the sample means, divided by the standard error of that difference (sed) • Compare the test statistic with the critical value from a t-distribution with (n +m - 2) degrees of freedom • For a test of size 5%, we will reject H0 if t is greater than the critical value such that 2.5% of the distribution is in each tail

Paired sample t-test • Here we have paired observations, one from each sample, and are interested in differences between the samples, when we also believe there are differences between pairs • H0 : μ1 = μ2 • H1 : μ1 ≠ μ2 • Because of the differences between pairs, it’s more powerful to test differences of the pairs • Given two paired samples x1, x2,…, xn and y1, y2,…, yn, we calculate the differences between each pair, d1, d2,…, dn, and calculate the mean, μd • Then we do a one-sample test to compareμd to zero

Assumptions • General assumption for all three types of test is that the values from each sample are independent and come from Normal distributions • Assumption is for differences for the paired sample t-test • For the two-sample t-test we have the additional assumption that the distributions have the same variance (though there is a variant that allows different variances) • Homoscedasticity • For the paired t-test we have the additional assumption that each observation in one sample can be ‘paired’ with a value from the other sample

One- and two-sided tests • All the t-tests described so far have been what is called two-sided • That is they have alternative hypotheses of the form ‘two things are different’ • There are very similar tests available when the alternative hypothesis is that one mean is greater (or, alternatively, less) than the other • These are called one-sidedtests • Now calculate a signed test statistic • For a test of size 5%, compare with critical value such that 5% of distribution is in the tail

Power • A one-sided test is more powerful than a two-sided one to test a one-sided hypothesis • It can never reject the null hypothesis if the means differ in the direction not predicted by the alternative hypothesis • So when calculating the rejection region, we can use the entire size in the direction of interest

Alternative (distribution-free) tests • Appropriate when data cannot be assumed to be from a Normal distribution • Generally still for continuous data • Wilcoxon-Mann-Whitney rank sum test • For two populations with the same mean • Sign tests for medians • One-sample and two-sample tests • Signed rank tests for means • One-sample and paired sample tests

F-test • Comparison of variances for two populations • Obvious application to test whether two samples for a two-sample t-test do come from populations with the same variance • Rarely actually used for that • Actually used in Analysis of Variance and Linear Regression • H0 : σ12 = σ22 • H1 : σ12 ≠ σ22 • Given two samples x1, x2,…, xm and y1, y2,…, yn, the test statistic F is given by the ratio of the sample variances, with the larger variance always in the numerator

Type of test and rejection regions • When comparing two samples we are usually interested in a two-sided test (no prior expectation about which will be larger) • Compare the test statistic with the critical value of an F-distribution with (m – 1) and (n – 1) degrees of freedom • For a test of size 5% we will reject H0 if F is greater than the critical value such that 2.5% is in the upper tail • For a one-sided test, with H1: σ12 > σ22 we calculate the test statistic with the variance for the first sample in the numerator, and reject H0 if F is greater than the critical value such that 5% is in the upper tail • Assumptions • That the data are independent and normally distributed for each sample

Alternative Tests • Bartlett’s Test, Hartley’s Test • Extensions to cope with more than 2 samples • Siegel-Tukey rank sum dispersion test • Non-parametric alternative for comparing two samples

Chi-Squared Test • Two main applications • Testing goodness-of-fit • e.g. for observed data to a distribution, or some hypothesised modekl • Testing association • e.g. between two classifications of observations • Both applications essentially the same • The test compares observed counts to those expected under the null hypothesis • Where these differ we would anticipate that the test statistic will cause us to reject the null hypothesis

Testing Association • Test for association between (independence of) two classifications of observations • The Chi squared test involves comparing the observed counts with the expected counts under the null hypothesis • Under the null hypothesis of the independence of the two classifications, the counts in each row (column) of the table will be in the same proportions as the sums across all rows (columns) • Expected frequencies, eij, for each cell of the table are given by • Ri = row totals, Cj = column totals, N = overall total

Test statistic • The test statistic is calculated from the squared difference between the observed and expected frequencies divided by the expected frequency for each cell, by taking their sum • Compare the test statistic, χ2, with the critical values of a χ2-distribution with degrees of freedom equal to the number of rows minus one, times the number of columns minus one • For a test of size 5% we then reject the null hypothesis of independence if χ2 is greater than the critical value such that 5% of the distribution is in the upper tail

Pooling • The Chi-square test is an asymptotic test • This means thatthe distribution of the test statistic under the null hypothesis only approximately follows the stated distribution • The approximation is good if the expected number in each cell is more than about 5 • Therefore, if some cells have an expected count of fewer than five, we must pool rows or columns until this constraint is satisfied • In pooling we should aim to avoid removing any interesting associations if possible

Goodness of Fit • The Chi-squared test can also be used to test goodness of fit of some observed counts to those predicted by a statistical distribution or model • Examples include statistical distributions, Mendel’s laws of genetic inheritance, … • Expected values are calculated based on the predicted probabilities • If any expected values are fewer than five then an appropriate pooling of categories must be made • The test statistic is calculated as the same sum as for the test of association • It is compared to a Chi-squared distribution with degrees of freedom equal to one fewer than the number of elements in the sum, less one for each parameter estimated from the data

Summary of Chi-squared test • Allows testing ‘goodness of fit’ of observations to expected values under the model specified as the null hypothesis • Expected values can be from a probability distribution, or what is expected under some postulated relationship between variables • Most commonly independence in contingency tables • There are better tests for goodness of fit to a distribution • Test statistic is sum of contributions of the form (observed - expected)2 / expected • Compare with critical values from a Chi-squared distribution, with degrees of freedom depending on the number of contributions and the number of model parameters • Asymptotic test: critical values only approximate • Approximation is bad if too few expected in any ‘cell’ – hence need all expected values to be at least 5.

Analysis of Variance (ANOVA) • Initially a simple extension of the two-sample t-test to compare more than two samples • Null hypothesis: all samples are from populations with the same mean • Alternative hypothesis: some samples are from populations with different means • Test statistic compares variance between sample means with (pooled) variance within samples • Reject null hypothesis if between-sample variance is sufficiently larger than within-sample variance • Use a one-sided F-test – between-sample variance will be larger if the sample means are not all the same • Still need to identify those samples that are from populations with different means • Use two-sample t-tests based on pooled within-sample variance • Same assumptions as for a two-sample t-test

Extensions • Can be applied to a wide range of designs • Identify different sources of variability within an experiment • Blocks – sources of background (nuisance) variation • Treatments – what we usually care about • Construct comparisons (contrasts) to address more specific questions • Also used to summarise the fitting of regression models • Assess whether the variation explained by the model is large compared with the background variation • Can also be used to compare two alternative (nested) models\ • Does a more complex model provide an improved fit to the data? • Other approaches also available to address this question

Multiple testing • Remember what specifying a test of size 5% means • This is the probability of rejecting the null hypothesis when it is true • With a large number of related tests, will incorrectly reject some null hypotheses that are true • Multiple testing corrections modify the size of each individual test so that the size of the combined tests is 5% • i.e. the overall probability of incorrectly rejecting any of the null hypotheses is 5% • Many different approaches with different assumptions • Tukey test (HSD), Dunnett’s test, Link-Wallace test, … • Generally concerned with making all pairwise comparisons • A well-designed experiment/study will have identified a number of specific questions to be addressed • Often these comparisons will be independent, so less need to adjust the sizes of individual tests

Confirmatory Statistics • Hypothesis Testing – a five-step method (Neary, 1976) • Formulate the problem in terms of hypotheses • Calculate an appropriate test statistic from the data • Choose the critical (rejection) region • Decide on the size of the critical region • Draw a conclusion/inference from the test • Large number of tests developed for particular problems • Many readily implemented in statistical packages • Approaches can be extended for more complicated problems • Identification of appropriate test depends on the type of data, the type of problem, and the assumptions that we are willing to make

Confirmatory Statistics: Identifying, Framing and Testing Hypotheses