160 likes | 174 Views
Sample Sizes for IE. Power Calculations. Overview. General question: How large does the sample need to be to credibly detect a given effect size? What does “Credibly” mean here?
E N D
Sample Sizes for IE Power Calculations
Overview • General question: How large does the sample need to be to credibly detect a given effect size? • What does “Credibly” mean here? • We can be reasonably sure that the difference between the treatment group and the comparison group is due to the program • Randomization removes bias, but it does not remove noise. To reduce noise, we need a large sample size. But how large is large?
Measuring Impact • At the end of an experiment, we will compare the outcome of interest in the treatment and the comparison groups. • We are interested in the difference: Mean in treatment - Mean in control = Effect size • For example: mean of the malaria prevalence in villages with ITN distribution vs. mean of malaria prevalence in villages with no ITNs • To make conclusions based on that effect size, we need it to be calculated with precision- since there is always variability in data • If there are other many unobserved factors affecting outcomes, it is harder to say whether the treatment had an effect
Confidence Intervals • We only work with data which is a sample of the population. In order to assess whether this is valid for the entire population, we need a measure of reliability • A 95% confidence interval for an effect size tells us that, for 95% of any samples that we could have drawn from the same population, the estimated effect would have fallen into this interval. • The Standard error (se) of the estimate in the sample captures both the size of the sample and the variability of the outcome • it is larger with a small sample and with a variable outcome
Two Types of Errors • First type of error : Conclude that there is an effect, when in fact there are no effect. The level of your test is the probability that you will falsely conclude that the program has an effect, when in fact it does not. So with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect. To be confident, a= 5%, 10%, 1% • Rule of thumb is that if the effect size is more than twice the standard error, you can conclude with more than 95% certainty that the program had an effect
Two Types of Errors • Second type of error: you fail to reject that the program had no effect, when it fact it does have an effect. • The Power of a test is the probability of finding a significant effect in the RCT • Only with a significant effect can you cleanly influence policy • Power Calculations are a tool to see how likely we are to find a significant effect for a given sample size
How to Determine Effect Size • What is the smallest effect that should justify the program to be adopted (in terms of cost-benefit)? • Sets minimum effect size we would want to be able to test for • Common danger: use an effect size that is too optimistic too small of sample size • How large an effect you can detect with a given sample depends on how variable the outcomes is. • Example: If all children have very similar diarrhea prevalence without a program, a very small impact will be easy to detect • The Standardized effect size is the effect size divided by the standard deviation of the outcome • Common effect sizes are: .20 (small); .40 (medium); .50 (large)
Design Factors to Take into Account • Availability of a Baseline • A baseline can help reduce needed sample size since: • Removes some variability in data, increasing precision • Can been use it to stratify and create subgroups • The level of randomization • Whenever treatment occurs at a group level, this reduces power relative to randomization at individual level
Implications from Group Design • The outcomes for all the individuals within a unit may be correlated • All villagers affected by spring improvements at same time • All students at school with trained teachers may have benefited from information • The sample size needs to be adjusted for this correlation • The more correlation within the group, the more we need to adjust the standard errors
Implications • It is extremely important to randomize an adequate number of groups. • Typically the number of individual within groups matter less than the number of groups • Big increases in power usually only happens when the number of groups that are randomized increase • If you randomize at the level of the district, with one treated district and one control district, you have 2 observations!
Conclusions • Power calculations involve some guess work • Some time we do not have the right information to conduct it very properly • However, it is important to do them to: • Avoid launching studies that will have no power at all: waste of time and money • Determine the appropriate resources to the studies that you decide to conduct (and not too much) • If you have a fixed budget, can determine whether the project is feasible at all • Software: http://sitemaker.umich.edu/group-based/optimal_design_software