470 likes | 608 Views
Power and Effect Size. Previous Weeks. A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) I want to complete that chart using information from the past week Most of this is a repeat – but a few new tests have been added
E N D
Previous Weeks • A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) • I want to complete that chart using information from the past week • Most of this is a repeat – but a few new tests have been added • Important that you are familiar with these tests, know when they are appropriate to use, and how to run (most of) them in SPSS • Excused from running ANCOVA, RM ANOVA
Tonight… • A break from learning a new statistical ‘test’ • Focus will be on two critical statistical ‘concepts’ • Statistical Power • Related to Alpha/Statistical Significance • Brief overview of Effect Size • Statistically significant results vs Meaningful results • First, a quick review of error in testing…
Example Hypothesis • Pretend my masters thesis topic is the influence of exercise on body composition • I believe people that exercise more, will have lower %BF • To study this: • I draw a sample and group subjects by how much they exercise –High and Low Exercise Groups (this is my IV) • I also assess %BF in each subject as a continuous variable (DV) • I plan to see if the two groups have different mean %BF • My hypotheses (HO and HA): • HA: There is a difference in %BF between the groups • HO: There is not a difference in %BF between the groups
Example Continued • Now I’m going to run my statistical test, get my test statistic, and calculate a p-value • I’ve set alpha at the standard 0.05 level • By the way, what statistical test should I use…? • My final decision on my hypotheses is going to be based on that p-value: • I could reject the null hypothesis (accept HA) • I could accept the null hypothesis (reject HA)
Statistical Errors… • Since there are two potential decisions (and only one of them can be correct), there are two possible errors I can make: • Type I Error • We could reject the null hypothesis although it was really true (should have accepted null) • Type II Error • We could fail to reject the null hypothesis when it was really untrue (should have rejected null)
HA: There is a difference in %BF between the groups • HO: There is not a difference in %BF between the groups
Statistical Errors… • Remember – My final decision is based on the p-value
If p </= 0.05, our decision is reject HO • If p > 0.05, our decision is accept HO
Statistical Errors… • In my analysis, I find: • High Exercise Group mean %BF = 22% • Low Exercise Group mean %BF = 26% • p = 0.08 • What is my decision? • Accept HO • There is NOT a difference in %BF between the groups • Why is that my decision? The means ARE different? • I can’t be confident that the 4% difference between the two groups is not due to random sampling error Is it possible I’ve made an error in my decision?
Possible Error…? • If I did make an error, what type would it be? • Type II Error • When you find a p-value greater than alpha • The only possible error is Type II error • When you find a p-value less than alpha • The only possible error is Type I error
If p </= 0.05, our decision is reject HO • If p > 0.05, our decision is accept HO
Possible Error…? • Compare Type I and Type II error like this: • The only concern when you find statistical significance (p < 0.05) is Type I Error • Is the difference between groups REAL or due to Random Sampling Error • Thankfully, the p-value tells you exactly what the probability of that random sampling error is • In other words, the p-value tells you how likely Type I error is • But, does the p-value tell you how likely Type II error is? • The probability of Type II error is better provided by Power
Possible Error…? • Probability of Type II error is provided by Power • Statistical Power, also known as β(actually 1 – β) • We will not discuss the specific calculation of power in this class • SPSS can calculate this for you • Power (Beta) is related to Alpha, but: • Alpha is the probability of having Type I error • Lower number is better (i.e., 0.05 vs 0.01 vs 0.001) • Power is the probability of NOT having Type II error • The probability of being right (correctly rejecting the null hypothesis) • Higher number is better (typical goal is 0.80) Let’s continue this in the context of my ‘thesis’ example
Statistical Errors… • In my analysis, I found: • High Exercise Group mean %BF = 22% • Low Exercise Group mean %BF = 26% • p = 0.08 • Decided to accept the null • What do I do when I don’t find statistical significance? • What happens when the result does not reflect expectations? First, consider the situation
Should it be statistically significant? • The most obvious thing you need to consider is if you REALLY should have found a statistically significant result? • Just because you wanted your test to be significant doesn’t mean it should be • This wouldn’t be Type II error – it would just be the correct decision! • In my example, researchers have shown in several studies that exercise does influence %BF • This result ‘should’ be statistically significant, right? • If the answer is yes, then you need to consider power
In my ‘thesis’ • This result ‘should’ be statistically significant, right? • Probably an issue with Statistical Power • This scenario plays out at least once a year between myself and a grad student working on a thesis or research project • How can I increase the chance that I will find statistically significant results? • Why was this analysis not statistically significant? • What can I do to decrease the chance of Type II error? • Several different factors influence power • Your ability to detect a true difference
How can I increase Power? • 1) Increase Alpha level • Changing alpha from 0.05 to 0.10 will increase your power (better chance of finding significant results) • Downsides to increasing your alpha level? • This will increase the chance of Type I error! • This is rarely acceptable in practice • Only really an option when working in a new area: • Researchers are unsure of how to measure a new variable • Researchers are unaware of confounders to control for
How can I increase Power? • 2) Increase N • Sample size is directly used when calculating p-values • Including more subjects will increase your chance of finding statistically significant results • Downsides to increasing sample size? • More subjects means more time/money • More subjects is ALWAYS a better option if possible
How can I increase Power? • 3) Use fewer groups/variables (simpler designs) • Related to sample size but different • ‘Use fewer groups’ NOT ‘Use less subjects’ • ↑ groups negatively effects your degrees of freedom • Remember, df is calculated with # groups and # subjects • Lots of variables, groups and interactions make it more difficult to find statistically significant differences • The purpose of the Family-wise error rate is to make it harder to find significant results! • Downsides to fewer groups/variables? • Sometimes you NEED to make several comparisons and test for interactions - unavoidable
How can I increase Power? • 4) Measure variables more accurately • If variables are poorly measured (sloppy work, broken equipment, outdated equipment, etc…) this increases measurement error • More measurement error decreases confidence in the result • For example, perhaps I underestimated %BF in my ‘low exercise’ group? This could lead to Type II Error. • More of an internal validity problem than statistical problem • Downsides to measuring more accurately? • None – if you can afford the best tools
How can I increase Power? • 5) Decrease subject variability • Subjects will have various characteristics that may also be correlated with your variables • SES, sex, race/ethnicity, age, etc… • These variables can confound your results, making it harder to find statistically significant results • When planning your sample (to enhance power), select subjects that are very similar to each other • This is a reason why repeated measures tests and paired samples are more likely to have statistically significant results • Downside to decreasing subject variability? • Will decrease your external validity – generalizability • If you only test women, your results do not apply to men
How can I increase Power? • 6) Increase magnitude of the mean difference • If your groups are not different enough, make them more different! • For example, instead of measuring just high and low exercisers, perhaps I compare marathon runners vs completely sedentary people? • Compare a ‘very’ high exercise to a ‘very’ low exercise group • Sampling at the extremes, getting rid of the middle group • Downsides to using the extremes? • Similar to decreasing subject variability, this will decrease your external validity Questions on Power/Increasing Power?
The Catch-22 of Power and P-values • I’ve mentioned this previously – but once you are able to draw a large sample, this will ruin the utility of p/statistical significance • The larger your sample, the more likely you’ll find statistically significant results • Sometimes miniscule differences between groups or tiny correlations are ‘significant’ • This becomes relevant once sample size grows to 100~150 subjects per group • Once you approach 1000 subjects, it’s hard not to find p < 0.05 • Example from most highly cited paper in Psych, 2004…
This paper was the first to find a link between playing video games/TV and aggression in children: • Every correlation in this table except 1 has p < 0.05 • Do you remember what a correlation of 0.10 looks like?
r = 0.10 Do you see a relationship between these two variables?
What now? • This realization has led scientists to begin to avoid p-values (or at least avoid just reporting p-values) • Moving towards reporting with 95% confidence intervals • Especially in areas of research where large samples are common (epidemiology, psychology, sociology, etc..) • Some people interpret ‘statistically significant’ as being ‘important’ • We’ve mentioned several times this is NOT true • Statistically significant just means it’s likely not Type I error • Can have ‘important’ results that aren’t statistically significant
Effect Size • To get an idea of how ‘important’ a difference or association is, we can use Effect Size • There are over 40 different types of effect size • Depends on statistical test used • SPSS will NOT always calculate effect size • Effect size is like a ‘descriptive’ statistic that tells you about the magnitude of the association or group difference • Not impacted by statistical significance • Effect size can stay the same even if p-value changes • Present the two together when possible • The goal is not to teach you how to calculate effect size, but to understand how to interpret it when you see it
Effect Size • Understanding effect size from correlations and regressions is easy (and you already know it): • r2, coefficient of determination • % Variance accounted for • Pearson correlations between %BF and 3 variables: • r = 0.54, r = -0.92, r = 0.70 • Which of the three correlations has the most important association with %BF? • r2 = 0.29, r2 = 0.85, r2 = 0.49
Interpreting Effect Size • Usually, guidelines are given for interpreting the effect size • Help you to know how important the effect is • Only a guide, you can use your own brain to compare • In general, r2 is interpreted as: • 0.01 or smaller, a Trivial Effect • 0.01 to 0.09, a Small Effect • 0.09 to 0.25, a Moderate Effect • > 0.25, a Large Effect
Effect Size in Regression • Two regression equations contain 4 predictors of %BF. Each ‘model’ is statistically significant. Here are their r2 values: • 0.29 and 0.15 • Which has the largest effect size? Do either or the regression models have a large effect size? • 0.29 model is the most important, and has a ‘large effect size’. • 0.15 model is of ‘moderate’ importance.
Effect Size for Group Differences • Effect size in t-tests and ANOVA’s is a bit more complicated • In general, effect size is a ratio of the mean difference between two groups and the standard deviation • Does this remind you of anything we’ve previously seen? • Z-score = (Score – Mean)/SD • Effect size, when calculated this way, is basically determining how many standard deviations the two groups are different by • E.g., effect size of 1 means the two groups are different by 1 standard deviation (this would be a big difference)!
Example • When working with t-tests, calculating effect size by the mean difference/SD is called Cohen’s d • < 0.1 Trivial effect • 0.1-0.3 Small effect • 0.3-0.5 Medium effect • > 0.5 Large effect • The next slide is the result of a repeated measures t-test from a past lecture, we’ll calculate Cohen’s d
Paired-Samples t-test Output • Mean difference = 2.9, Std. Deviation = 5.2 • Cohen’s d = 0.55, a large effect size • Essentially, the weight loss program reduced body weight by just about half a standard deviation
Other example • I sample a group of 100 ISU students and find their average IQ is 103. • Recall, the population mean for IQ is 100, SD = 15. • I run a one-sample t-test and find it to be statistically significant (p < 0.05) • However, effect size is… • 0.2, or Small Effect • Interpretation: While this difference is likely not due to random sampling error – it’s not very important either
Other types of effect sizes • SPSS will not calculate Cohen’s d for t-tests • However, it will calculate effect size for ANOVA’s (if you request it) • Not Cohen’s d, but Partial Eta Squared (η2) • Similar to r2, interpreted the same way (same scale) • Here is last week’s cancer example • Does Tumor Size and Lymph Node Involvement effect Survival Time • I’ll re-run and request effect size…
Notice, η2 can be used for the entire ‘model’, or each main effect and interaction individually • How would you describe the effect of Tumor Size, or our interaction? • Trivial to Small Effect – How did we get a significant p-value? • Other factors not in our model are also very important
Notice that the r2 is equal to the η2 of the full model • The advantage of η2 is that you can evaluate individual effects
Effect Size Summary • Many other types of effect sizes are out there – I just wanted to show you the effect sizes most commonly used with the tests we know: • Correlation and Regression: r2 • T-tests: Cohen’s d • ANOVA: Partial eta squared (η2) and/or r2 • You are responsible for knowing: • The general theory behind effect size/why to use them • What tests they are associated with • How to interpret them
Upcoming… • In-class activity • Homework: • Cronk – Read Appendix A (pg. 115-19) on Effect Size • Holcomb Exercises 21 and 22 • No out-of-class SPSS work this week • Things are slowing down - next week we’ll discuss non-parametric tests • Chi-Square and Odds Ratio