400 likes | 572 Views
Power Analysis for Traditional and Modern Hypothesis Tests. Kevin R. Murphy Pennsylvania State University. Power Analysis. Helps you plan better studies Helps you make better sense of existing studies Is not limited to traditional null hypothesis tests
E N D
Power Analysis for Traditional and Modern Hypothesis Tests Kevin R. Murphy Pennsylvania State University
Power Analysis • Helps you plan better studies • Helps you make better sense of existing studies • Is not limited to traditional null hypothesis tests • Application of power analysis to minimum-effect tests will be discussed
Errors in Null Hypothesis Tests True State of Affairs Your Decision
Power Depends On • Effect Size • How large is the effect in the population? • Sample Size (N) • You are using a sample to make inferences about the population. How large is the sample? • Decision Criteria - • How do you define “significant” and why?
Power Analysis and the F Distribution • The power of most statistical tests in social sciences (e.g., ANOVA, regression, t-tests, other linear model statistics) can be evaluated via the familiar F distribution • F is a ratio of observed effect to error • F= MS treatments / MS error • F = (True Effect + Error) / Error • The larger the true treatment effect, the larger F you expect to find • If the null hypothesis is correct, E(F) = 1.0
How Does Power Analysis Work? In the familiar F distribution below, 95% of the values are below 2.00 (distribution for df = 7,200) F=2.0 represents cutoff for rejecting H0
The Noncentral F Distribution If the null hypothesis is false, the Noncentral F distribution is needed. In the Noncentral F distribution below, 75% of the values are below 2.00. Therefore, power = .25
A Larger Effect In the Noncentral F distribution below, in which the effect is larger, 30% of the values are below 2.00. Therefore power = .70
How to Increase Power Increase N • Effects of adding more subjects are not identical to those of adding more observations Increase ES • Choose a different research question • Use stronger treatments or interventions • Use better measures Use a more lenient alpha • p<.05 is driven by force of habit, not necessarily by substantive concerns
Effects of Implementing Power Analysis • Stronger studies • Larger samples, better measures • Fewer studies • Adequate studies are harder to do than most people realize • Less emphasis, in the long term, on null hypothesis testing
Conducting a Power Analysis • The classic text in this field is still one of the best sources • Cohen, J. (1998). Statistical power analysis for the behavioral sciences (2nd Ed.). Erlbaum • More current (and more accessible) sources include • Lipsey, M. (1990). Design sensitivity. Sage • Murphy, K. & Myors, B. (2004). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests. Erlbaum.
Conducting a Power Analysis • Power Analysis software • Power and Precision - Biostat • www.PowerAnalysis.com • One-Stop F Calculator • Included in Murphy & Myors (2004) • PASS - NCSS software • www.ncss.com/pass.html
Conducting a Power Analysis • In planning studies, you should • Assume relatively small effects • If it was reasonable to expect a large effect, you probably don’t need to do the study or the test • Aim for power of .80 or better • Power of .50 means that significance tests have become a coin flip
Effect Size Conventions • In behavioral and social sciences, there are widely-followed conventions for describing small, moderate, and large effects d- standardized Percentage of mean differencevariance explained Small .20 1% Moderate .50 10% Large .80 25%
Applications of Power Analysis • Study planning - Given ES and , solve for N • If you wanted to compare the effects of four types of training programs and: • You expected small to moderate effects (programs account for 5% of variation in performance) • You use an level of .05 • You need N=214 to achieve Power=.80
Applications of Power Analysis • Study evaluation - Given N and , solve for ES • If you wanted to compare the effects of four safety interventions and: • You have 44 subjects available • You use an level of .05 • You will achieve Power=.80 only if the effects of interventions are truly large (accounting for 25% of the variance in outcomes)
Applications of Power Analysis • Making a rational choice regarding Given N and ES, solve for • If you wanted to compare the effects of two leadership development programs and: • You have 200 subjects available • You expect a small difference (d=.20, or 1% of the variance explained by programs) • You will achieve Power=.64 using • You will achieve Power=.37 using
Moving Beyond Traditional Significance Testing • Traditional null hypotheses tests are the focus of most power analyses • These tests are deeply flawed, and there is relatively little research on the power of alternatives • Minimum effect tests represent one useful alternative
Nil Hypothesis Testing • Testing the hypothesis that treatments, interventions, etc. have no effect (Nil Hypothesis Test - NHT) is most common and least useful thing social and behavioral scientists do • Two problems loom largest: • Confusion over Type 1 errors • Likelihood of rejecting the null hypothesis eventually reaches 1.0, regardless of the research question
Type I Errors are Very Rare • Type I error - reject H0 when it is true • If H0 is never true, it is impossible to make a Type I error • If H0 is very unlikely, a Type I error is even less likely • H0 - treatment had NO effect at all • H1 - SOMETHING happened • Most things we do to minimize Type I errors lead to more Type II errors
This Implies • Large literature on protecting yourself from Type I errors is not really useful • NHTs yield one of two outcomes • confirm the obvious • reject H0, which you already know is likely to be wrong • confuse you • “accept” H0 even though you know it is likely to be wrong
In NHT, All You Need in N • As N increases, the likelihood of rejecting the nil hypothesis approaches 1.0 • Power to reject H0 does not depend all that much on the phenomenon • if N is big enough you will reject H0 • if N it is small enough, you won’t • Significance tests are an indirect index of how many subjects showed up
There Must be a Better Way • Stop doing significance tests (e.g., Schmidt, 1992) • Confidence intervals (e.g., APA Task Force, American Psychologist, August, 1999) • Bayesian methods (e.g., Rounet, Psychological Bulletin, 1996)
There Must be a Better Way • Minimum-Effect Tests • Test the hypothesis that something nontrivial happened • Murphy, K. & Myors, B. (2003) Statistical power analysis: A simple and general model for traditional and modern hypothesis tests: 2nd Ed. Mahwah, NJ: Erlbaum. • Murphy, K. & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84, 234-248.
Minimum-Effect Tests • H0 - treatments have a negligible effect (e.g., they account for 1% or less of the variance) • H1 - the effect of treatments is big enough to care about • This approach addresses the two biggest flaws of traditional tests • H0 really is plausible. Treatments rarely have zero effect but they often have negligible effects • Increasing N does not automatically increase likelihood of rejecting H0
Minimum-Effect Tests • With Minimum Effect Tests (METs) • Type I errors are once again possible, but can be miminized • the question asked in MET is no longer trivial • you can actually learn something by doing the test • Power Analysis work exactly the same way in MET as in NHT
Performing Minimum-Effect Tests • Put your test statistics in a simple, common form • e.g. F • Decide what you mean by a negligible effect • Find or create an F table based on that definition of a negligible effect - Noncentral F distribution • Proceed as you would for any traditional NHT
Working with the Noncentral F • Calculating or deriving noncentral F distributions was once a daunting task • Many simple calculators now available • http://calculators.stat.ucla.edu/cdf/ncf/ncfcalc.php • Noncentrality parameter ( ) • in a measure of effect size • = [dfh * (MSh - MSe )] / MSe
What Constitutes a “Negligible” Effect ? • Standards for “negligible” effects depend on the research area and on the consequences of decisions • Aspirin use accounts for very little variance in heart attacks, but the use of aspirin saves thousands of lives at minimal cost • In personnel selection, it is relatively easy to account for a large proportion of the variance in performance with simple cognitive tests, so the increase in effectiveness that is defined as negligible might be larger
Defining a “Negligible” Effect • Effect Size conventions are useful, but by themselves may not be sufficient • Consequences of errors must also be considered d- standardized Percentage of mean differencevariance explained Small .20 1% Moderate .50 10%
Errors in MET • The potential downsides of MET are: • Type I errors could actually occur • Lower power than corresponding NHT • You can reduce Type I errors by using larger samples • The loss of power is more than balanced by the fact that the hypothesis being tested is not a trivial one
Type I vs Type II Errors • The tradeoff between Type I and Type II errors is more complicated in METs than in Nil tests • In MET, alpha is precise only if the true effect size is exactly the same as your definition of “negligible” • Type II errors more of a problem with METs • METs are less powerful than NHTs (it is easier to reject the hypothesis that nothing happened than the hypothesis that nothing important happened), but this is not necessarily a bad thing • METs place even greater premium on large samples, but small samples cause problems even where there is substantial power
Examples - comparing two treatments • N needed True effect PV=.05 PV=.10 Nil 149 79 MET 375 117 (1%=negligible)