210 likes | 422 Views
Null Hypothesis Significance Testing. What the heck have we been doing this whole time?. Some thoughts. “statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution” (Schmidt & Hunter, 1997, p. 37).
E N D
Null Hypothesis Significance Testing What the heck have we been doing this whole time?
Some thoughts • “statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution” (Schmidt & Hunter, 1997, p. 37). • “The almost universal reliance on merely refuting the null hypothesis is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (Meehl, 1978, p. 817). • Cohen (1994) suggested that Statistical Hypothesis Inference Testing produces a more appropriate acronym. • What is NHST, what isn’t it, and why is it over- and mis-used?
What is hypothesis testing about? • Use an inferential procedure to examine the credibility of a hypothesis about a population based on the probability of our sample data
How is NHST made possible? • The sampling distribution tells us the degree of variability to expect with regard to some statistic. We can then see whether our sample stat varies greatly from the random error we would expect from sampling from a population with a particular value (point estimate) for that statistic. • Example: is the mean of test scores from this school all that different from the national average?
Logic of the NHST • If we believe something to be different why do we start by hypothesizing that things are the same? • Falsification • cannot prove things true • Provides a basis for statistical test • Gives us somewhere to start i.e. something to test; We don’t know the precise value of an alternative
Hypothesis testing • The explosion: NHST almost non-existent prior to 1940, today almost used exclusively • Used to see extremely controlled, low N or N of 1 studies. The idea was to get rid of the error beforehand. However the practical side of psychology (esp. in education) wanted to see different groups tested.
Fisher vs. Neyman • vs. • For heavyweight stats champion thingy
Fisher • Rejected the Bayesian model of p(H|D) for the frequentist approach of p(D|H), claims too subjective. • His work in the early part of the 20th century eventually led to near unilateral use of many of his techniques in the field psychology
Fisher’s “level of significance” • How determined • Early Fisher: set some acceptable standard, say .05 • Later: State exact level as a communication to researchers
Neyman-Pearson’s • ‘Level of significance’ must be set before the experiment to interpret it as a long run frequency of error (Type I): level • Also added (Type II), power (1-), alternative hypothesis • So now that we have this new sort of thing to worry about (), how do we make it more confusing? • Set the standard level at… .05.
So what does a significant result mean? • Fisher: epistemic interpretation about the likelihood of the null hypothesis (how much do we believe in the false null), p is a property of the data • N-P: behavioristic interpretation (reject or don’t) that refers to repeated experimentation, p is a property of the test • in fact we don’t really have a p-value to report, our statistic either falls in our region of rejection or doesn’t
So what does a non-significant result mean? • Fisher: nothing, can’t prove the null (can only disprove) • N-P: act as if the null were true.
What the p-value means • Probability obtained tells us: If the null hypothesis were true, the probability of obtaining a sample statistic (mean, difference between means, etc.) of the kind observed
What we want it to mean • We want the p value to be a probability about a hypothesis. • P(H0|D) • Some probability of H0 conditional on the data
Psych today- the hybrid • Fisher and N-P interpretations of p-value, incorrect inferences about the probabilities of hypotheses or error rates, dogmatic approach to scientific investigation.
What’s the alternative • “a magic alternative to NHST, some other objective mechanical ritual to replace it. It doesn’t exist” (Cohen, 1997, p. 31) • Goodness of fit intervals (pretty tricky stats) • Bayesian (which does give P(H|D), but has its own problems) • Confidence intervals • Effect sizes • Graphs and more descriptives
Solutions • Don’t forget to use the noggin when conducting analyses- don’t let stat programs or textbooks tell you what it is “significant”. • There are other ways to analyze data without using NHST. But don’t fall in to the same trap of rigid thinking with those either. • Focus on effect sizes, report as much information as possible, let others know exactly why you came to your conclusions • Collect good data (not as easy as it sounds) and have good theories and clear ideas driving the motivation for your research.
Resources • Gigerenzer, G. (1993). The Superego, The Ego and the Id in Statistical Reasoning. In Keren & Lewis (Eds.) Data Analysis in the Behavioral Sciences. • Cohen, J. (1994). The earth is round, p < .05. American Psychologist, 49, 997-1003. • Hubbard R. & Bayarri, M.J. (2003). Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing. The American Statistician. Volume: 57 Number: 3 Page: 171 – 178 • Oakes, M. 1986. Statistical Inference: A Commentary for the Social and Behavioral Sciences. Chichester, John Wiley & Sons. • Abelson, Robert. Statistics as Principled Argument. Mahwah, NJ:Erlbaum, 1995.
Quotes • http://www.indiana.edu/~stigtsts/quotsagn.html