1 / 34

Statistical power

Statistical power. Learning outcomes Distinguish between different types of error in statistical inference Define effect sizes for common statistical tests Describe the principles of statistical power analysis. Hypothesis-testing.

ajones
Download Presentation

Statistical power

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical power Learning outcomes • Distinguish between different types of error in statistical inference • Define effect sizes for common statistical tests • Describe the principles of statistical power analysis

  2. Hypothesis-testing • Hypothesis-testing is a process designed to evaluate competing explanations for a phenomenon • Based on the work of Fisher and Neyman-Pearson • Influenced by Karl Popper

  3. Fisher • Gave us ANOVA and various other statistical tests • Statistician in the field of agronomy For example, effect of fertiliser on wheat • Argued for NHST • Constructed testable and specific null hypotheses • Suggested a 1 in 20 (p < 0.05) threshold to accept or reject null-hypothesis

  4. Neyman-Pearson • Fisher had a long standing feud with Karl Pearson, who worked in the same field • Pearson coined the term standard deviation, and introduced tests of correlation, and chi-square • Pearson’s son, Egon collaborated with Jerzy Neyman on hypothesis-testing • Proposed decision about alternative (experimental or research) hypothesis • Appropriate in process-control settings

  5. Null-hypothesis significance testing (NHST) • We wish to test an experimental hypothesis • For example, cognitive-behavioural treatment for chronic pain is more effective than a control • We obtain a random sample of participants • We construct a null hypothesis that the two population means are identical (Note. This is just an example.) • We measure the difference between the experimental and control group • We calculate the probability of finding the measured difference between groups • If this passes a set threshold (usually 1 in 20) then we reject the null hypothesis

  6. How do we interpret p? • You run a t test, which gives you a p value of 0.43. Which is the correct interpretation? • (a) Given the data, the probability that the null hypothesis is true is 0.57 • (b) Given the data, the probability that the experimental hypothesis is true is 0.57 • (c) Given the data, the probability that the experimental hypothesis is true is 0.43 • (d) If the null hypothesis is true, then the probability of getting these results is 0.43 • (e) If the null hypothesis is false, then the probability of getting these results is 0.57 • (f) If the null hypothesis is false, then the probability that the experimental hypothesis is true is 0.43

  7. The correct answer is: ….. • Misinterpretation of hypothesis-testing is widespread and pervasive

  8. Non-rejection of the null hypothesis • We now know about the rejection of the null hypothesis • if the null hypothesis is true, then the probability of getting these results is  .05 • the null hypothesis is rejected • But what happens if p > .05 do we ... • accept the null hypothesis? • neither accept nor reject the null hypothesis? • A statistically non-significant result does not allow us to make a decision

  9. Hypothesis-testing in applied psychology • IQ scores linked to height • 14000 children tested • Controlled for age and sex, socio-economic status, birth-order and family size • Statistically significant result • Authors speculate that smaller children may be treated as less mature • What additional information do we need in order to evaluate the importance of this claim?

  10. Cohen (1990) • With 14000 cases a correlation of only 0.0278 is statistically significant (p < 0.001) • This only accounts for 0.077% of variance in the sample • We need to understand the effect size • Effect size (ES) is a measure of the amount of variance explained by a test result

  11. Close your eyes and ... Imagine that you are a counseling psychologist. You are interested in the effectiveness of an intervention aimed at raising the morale of carers working with older adults. You compare morale scores of a control group with a group which has received the intervention. What possible outcomes are there?

  12. Possible outcomes (1) • There is a statistically significant difference between the groups and we reject the null hypothesis (i) This conclusion is correctbecause in reality your intervention is beneficial(ii) This conclusion is incorrectand your intervention is no more effective than the control treatment

  13. Possible outcomes (2) • There is no statistically significant difference between the groups and we do not reject the nullhypothesis(i) This conclusion is correctbecause in reality the intervention is no more effective than the control treatment(ii) This conclusion is incorrectbecause your intervention is more effective than the control, but your study did not detect it

  14. What kinds of error are 1(ii) and 2(ii)? • Which kind of error is the more important one to avoid in psychological research? 1 (ii) is termed a Type I error 2 (ii) is termed a Type II error

  15. Type I and Type II errors are mutually exclusive • The probability of making a Type I error is referred to as alpha () • The probability of making a Type II error is referred to as beta () • A goal of good research is finding a way to reduce the probability of making a Type I and a Type II error. • How can we reduce these errors?

  16. Reducing Type I errors • Reduce the rejection level (alpha),for example from .05 to .01 • There are cases when this is justified (when you are running a large number of tests) • But, generally it is not a good idea

  17. Reducing Type II errors  depends on a number of factors ·: if we reduce the probability of making a Type I error, the probability of making a Type II error increases. ·Sample size: an increase in sample size reduces the probability of a Type II error ·Effect size: the greater the experimental effect, the lower the probability of making a Type II error

  18. Power • The probability of rejecting the null hypothesis when it is false is 1 - , and this is referred to as the power of the study: the probability that you will detect an experimental effect • If our study has a large effect size then this will increase its power • Standardised effect sizes are independent of sample size • Example: the number of standard deviations that two means differ • What measures of effect size are there?

  19. Measures of effect size (i)t test for difference Cohen's d, r between two means (ii)Pearson’s correlation r2 or r (iii)chi-square w (iv)ANOVA 2, f, 2 , 2 (v)multiple regression R2 (vi)logistic regression odds ratio, RL2

  20. How do you calculate an effect size? Difference between means d = mean(A) - mean(B) s So imagine that you have compared the means of two groups on IQ scores. One group scores 90 and the other group 100. • What is the value of d? (remember IQ has a SD of 15) • What does this value actually mean? • r = sqrt(t2/(t2 + df)), t and df from t test results

  21. Correlation and multiple regression • For correlation effect size is calculated by r2 or r • r2 (x100) is very informative because it tells you the amount of variability in one variable which is attributable to the other variable. • E.g. suppose we correlate height and weight and find a r = +0.6therefore r2 is 0.36that is to say that for our sample 36% of the variability in our participants' weight is attributable to their height. • A similar conclusion can be drawn about R2 in multiple regression

  22. ANOVA df SS MS F p Factor 1 1 61.3 37.354 37.354 .0001 Factor 2 2 5.6 2.8 1.697 NS F1 x F2 2 11.467 5.733 3.475 .0473 Error 24 39.6 1.65 Total 29 118.0 • 2= SSFactor 1/SSTotal = 61.3/118.0   = 0.52 or (x100) 52% of the variability in scores • Cohen uses f, a measure in terms of SDs (like d) (not to be confused with F); f = sqrt(2/(1 - 2)) and 2 = f2 / (1 + f2) • 2 is less biased, but conservative (and a more accurate measure of effect size for ANOVA): 2 = (SSFactor - (k - 1)MSerror)/(SSTotal + MSerror), k = number of levels of the independent variable • 2 is an approximately unbiased estimate of the (population) parameter for proportion of explained variance: 2 = (SSFactor - (k - 1)MSerror)/SSTotal

  23. Statistical power • Cohen (1962) reviewed published research, and found that the average power in reported studies was only .48 (i.e. that there was only a 48% of successfully rejecting the null hypothesis) given a medium effect size • Sedlmeier and Gigerenzer (1989), in an updated review, claimed that the situation was becoming worse, not better • Maxwell (2004) found that psychological research studies remain underpowered • Cohen (1992) offered some simple guidelines for calculating sample size, aimed at producing studies with a power of .80

  24. Prospective statistical power(a priori power analysis) • We can use our calculation for effect size to predict how many participants we need to test • Suppose that we are interested in examining the IQ of 2 different groups (A and B) and we predict that the difference will be 10 IQ points. We know that  = 15. How many participants do we need in each group? • First we need to calculate the effect size

  25. What is the effect size? Cohen (1992) helps us out here Effect Size small medium large d .20 .50 .80 so an effect size of .67 for d is (between) a medium (and large) effect

  26. we can now use this to work out how many participants we need in our sample  = .05 t test small medium large   393 64 26 • so we need 64 in each group to have a power of .80 • More accurate is to conduct a power analysis using (a) power tables (e.g. Cohen, 1988; Clark-Carter, 1997, 2004, 2010) or (b) power analysis software (SamplePower, GPower)

  27. But, how do we know what our effect size will be for a study that has not yet been conducted? • examine previous research as a guide to effect sizes • calculate an effect size from a pilot study • if all else fails, decide beforehand what effect size you want to detect, based on conventions for effect sizes (Cohen, 1988)

  28. Retrospective statistical power(post-hoc power analysis) • Sometimes the results of our studies do not yield any significant results • We can determine how likely a significant effect would have been, given our number of participants and a hypothesised psychologically meaningful effect size (i.e. a difference which researchers/practitioners would be interested in) • This will tell us whether it is worthwhile to run an additional study

  29. Alternatives • Counter-null statistic (Rosenthal & Rubin, 1994) • non-null magnitude effect size that has the same p-value as the null value of the effect size • Related to effect size • Prep (Killeen, 2005) • probability of replicating the obtained result • related to NHST • p intervals (Cummings, 2008) • Show the unreliability of p values • Provide another justification for the use of confidence intervals

  30. Alternatives (2) • (Power analysis for) minimum-effect tests (Murphy & Myors, 1999; Murphy et al., 2009) • The nil hypothesis is almost always wrong • Minimum-effect tests are alternatives to traditional hypothesis-testing • Test the hypothesis that treatment effects are negligible • Use one-stop tables or one-stop calculator for minimum-effect tests • Magnitude-based inference (Batterham & Hopkins, 2006) • Takes into account the smallest important effect in making inferences • Uses qualitative descriptors in inference • Mechanistic and clinical (practical) inference

  31. Preparation for next week • Study statistical power • Lecture notes • Further reading (see module guide) • Practical exercises

  32. Summary • Stated the characteristics of null hypothesis significance testing (NHST) • Defined Type I and Type II errors • Stated the factors affecting statistical power • Summarised the major measures of effect size • Illustrated prospective power analysis • Summarised alternatives to NHST

  33. Bibliography • Alternatives • Batterham, A.M., & Hopkins, W.G. (2006). Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50-57. • Buchheit, M. (2016). The numbers will love you back in return—I promise. International Journal of Sports Physiology and Performance, 11, 551 -554 • Cummings, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300. • Hopkins, W. G., & Batterham, A. M. (2016). Error rates, decisive outcomes and publication bias with several inferential methods. Sports Medicine, 46(10), 1563-1573. doi:10.1007/s40279-016-0517-x • Killeen, P. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16, 345-353. • Murphy, K.R., & Myors, B. (1999). testing the hypothesis that treatments have negligible effects: minimum-effect tests in the general linear model. Journal of Applied Psychology, 84(2), 234-248. • Murphy, K.R., Wolach, A.H., Myors, B. (2009).Statistical power analysis: a simple and general model for traditional and modern hypothesis tests (3rd ed). London, Routledge. • Rosenthal, R. & Rubin, D. (1994). The counternull value of an effect size. Psychological Science, 5, 329-334. • Schaik P., van & Weston M. (2016). Magnitude-based inference and its application in user research. International Journal of Human-Computer Studies, 88, 38-50. doi:10.1016/j.ijhcs.2016.01.002

  34. Bibliography (continued) • Power analysis and effect size • Clark-Carter, D. (2009). Quantitative psychological research: the complete student's companion. Hove: Psychology Press. • Cohen, J. (1988.) Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. • Cohen, J (1990). Things I have learned (so far). American Psychologist. 45, 1304-1312. • Cohen (1992) A power primer. Psychological Bulletin, 112, 155-159. • Cohen, J (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. • Faul, F., Erdfelder, E., Lang, A.-L. & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175-191. • Hopkins, W. (2006). Estimating Sample Size for Magnitude-Based Inferences. Sportscience, 10, 63-70. • Jaccard, J. (1998). Interaction effects in factorial analysis of variance. Thousand Oaks, CA: Sage. • Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 147–163. • Murphy, K., Myors, B. & Wolach, A. (2009). Statistical power analysis: a simple and general model for traditional and modern hypothesis tests (3rd ed.). London: Routledge.

More Related