The end of statistical significance Rob Herbert

The end of statistical significance Rob Herbert

Significance testing and hypothesis testing • Significance tests and hypothesis tests have become ubiquitous in health research. • Few researchers appreciate that these procedures have profound limitations. • This presentation will: • review the reasoning behind significance testing and hypothesis testing. • outline problems with null hypothesis statistical tests. • considers how researchers could analyse data and report statistical analyses without conducting significance tests or hypothesis tests.

Significance testing and hypothesis testing Both significance testing and hypothesis testing: • are conducted within a frequentist framework. • begin by positing a null hypothesis. • involve calculating p – the probability, when a study is hypothetically replicated many times, of observing an effect as large or larger than was actually observed, if the null hypothesis is true.

Significance testing and hypothesis testing • Fisher developed significance testing. • He considered that small p values (say, p < 0.05) provide evidence that the null hypothesis is not true.

Significance testing and hypothesis testing • Neymanand Pearson developed hypothesis testing. • In hypothesis testing, there are two hypotheses: the null hypothesis and an alternative hypothesis. • Neyman and Pearson used p values only to choose between these two hypotheses: If p ≥ 0.05, retain the null hypothesis.If p < 0.05, reject the null hypothesis and accept the alternative.

Significance testing and hypothesis testing • Significance testing and hypothesis testing are mathematically similar but differ greatly in interpretation and reporting. • Few researchers (maybe even few journal editors) appreciate these differences. Modern practices are often an inconsistent and illogical mish-mash of the two traditions.

Limitations of significance testing and hypothesis testing • p is not the probability that a hypothesis is (or is not) true. • p is the probability of observing the actually-observed data given that the null hypothesis is true. • We should be more interested in the probability that the null hypothesis is true given the actually-observed data.

Limitations of significance testing and hypothesis testing • p is not evidence. • The probability of an observation given a particular hypothesis (like the null hypothesis) does not, on its own, provide evidence for or against that hypothesis. • It is only possible to quantify the strength of evidence for a hypothesis compared to a competing hypothesis. Data provide evidence for hypothesis A and against hypothesis B if the data were more likely to be observed given hypothesis A than given hypothesis B.

Limitations of significance testing and hypothesis testing • Significant findings are not replicable. “the probability of non-replication of published studies with p-values in the range 0.005 to 0.05 is roughly 0.33.” Boos DD, Stefanski LA (2011). American Statistician 65: 213-221.

Limitations of significance testing and hypothesis testing • In most clinical research, the null hypothesis must be false. • The null hypothesis posits that the effect of interest is exactly zero in the population. • In clinical research, it is generally not plausible that the effect of interest is exactly zero. • If it is not plausible that the null hypothesis is exactly true, all claims of statistical significance must be correct and all findings of a lack of statistical significance must represent failures to detect an effect that really does exist.

Limitations of significance testing and hypothesis testing • We need to know about the size of effects. • Information about the size of effects, rather than just the existence of effects, is needed if we are to identify important causes of disease, determine the primary mechanisms by which health interventions work, and assess if the effects of interventions are large enough to make them worthwhile. • p says nothing about the size of effects.

A simple alternative • There are alternative approaches to statistical inference that do not involve significance testing or hypothesis testing. • The simplest is estimation. • Estimation is built on the same frequentist framework and the same mathematics as significance testing and hypothesis testing. • In estimation, the aim is to estimate effects (parameters) of populations using sample data. The uncertainty or imprecision of those estimates is communicated with confidence intervals.

What would research look like in a world without “p < 0.05”? • Study objectives would be described in terms of the estimation of effects

What would research look like in a world without “p < 0.05”? • The words “statistically significant” and “p < 0.05” would disappear from journals

A landmark issue of The American Statistician In October 2017, the American Statistical Association convened a two-day symposium on statistical inference. The proceedings were published as 43 papers in The American Statistician in March 2019.

The lead editorial (Wasserstein et al)

What would research look like in a world without “p < 0.05”? • The interpretation of data would focus on point estimates and confidence intervals. • Is the point estimate of the effect large enough to be, in some sense, important? Or is the effect trivially small? • Is the point estimate likely to be biased? If so, in what direction and by how much? • How much certainty can we have in the estimate of effect? We should interpret confidence intervals by “describ[ing] the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits.” Amrhein V, Greenland S, McShane B. Retire statistical significance. Nature. 2019;567:305-307.

Summary • Contemporary approaches to statistical inference are usually an illogical and inconsistent mish-mash of significance testing and hypothesis testing. • Significance testing and hypothesis testing are uninformative, misleading, and unnecessary. • Journals should discourage or ban the use of significance testing and hypothesis testing (except, perhaps, in certain contexts) and explicit or implicit references to statistical significance. • A simple, easily implemented alternative is to employ estimation. • Estimation focuses on interpretation of the estimated size of effects and the uncertainty of those estimates.

The end of statistical significance Rob Herbert