270 likes | 364 Views
Sampling & Power Calculations in Randomized Field Experiments. Craig McIntosh, IRPS/UCSD Prepared for The Asia Foundation Evaluation Workshop Singapore, March 25, 2010. Sampling:.
E N D
Sampling & Power Calculations in Randomized Field Experiments Craig McIntosh, IRPS/UCSD Prepared for The Asia Foundation Evaluation Workshop Singapore, March 25, 2010
Sampling: The purpose of sampling is to draw a study group that is representative of some known and interesting population. • For a program with a defined beneficiary group, you may not want to include ‘average’ individuals in the sample because they are irrelevant. • For a study to make claims about impacts on the population as a whole, it must track a sample representative of the population. So, how do you draw a random sample? Not so easy in countries without deep census data.
Drawing a random sample: Assume we don’t have an individual-level census data source. How can this be done? • Most expensive and best way: a ‘listing exercise’ goes through the entire village and gathers very basic demographic information on all households. • From this listing you can then sample households/individuals for detailed surveys using any rule, because the sample can be weighted back to the population using the listing data. • Can also pursue an ‘every third house’ type of rule, can work well but has some methodological problems (unintentional violations of random sampling, lack of weights, etc.). • Use a ‘knowledgeable informant’ to prepare village lists and then sample from these lists. • Cluster sampling.
Cluster sampling: This is a multi-tiered sampling rule, conducted as follows: • Using a data source that has basic information on population by geographic areas, randomly pick ‘Primary Sampling Units’ (PSUs) such that probability of selection is proportional to population. • Within the selected PSUs, you then conduct some type of listing exercise to understand the population within each PSU. • Then sample households from the listing so that the sample within each PSU is representative of the PSU itself. This creates a representative sample at the national level while minimizing both listing and logistical (travel) costs. However: • Not representative at a sub-national level, and • Sampling of PSUs may not line up well with the geographical rollout of a specific program.
Sampling weights: A sample that is directly representative of a population can be used ‘unweighted’. However, often we choose to oversample certain kinds of units: • ‘Rare’ types who are of particular interest given the impact the program is trying to measure (politically engaged, entrepreneurs, at risk for disease, etc.) • For a voluntary program, we may want to oversample the kinds of people who look as if they will take up the program so that we observe as many of them as possible. Having oversampled certain units we must use sampling weights to retain the representivity of the sample to the population. This oversampling does not affect the expected value of the outcome but it gives you a more precise (less noisy) measure of outcomes within these ‘of interest’ groups. Then, how many people do you need to track to measure impacts?
Power & Significance. How many observations is ‘enough’? • Not a straightforward question to answer. • Even the simplest power calculation requires that you know the expected treatment effect ETE, the variance of outcomes and the treatment uptake percentage • From here, you need to pick a ‘power’ (the probability that you reject when you should reject, and thus avoid Type II error), typically and (one-tailed). • Then, pick ‘significance (the probability that you falsely reject when you should accept, and thus commit Type I error), typically and (two-tailed). With these you can calculate the minimum sample size as a function of the desired test power.
Minimum Sample Size: Then, And so you can get away with a smaller sample size if you have: • High expected treatment effects • Low variance outcomes • Treatment & control groups of similar sizes (p=.5) • A willingness to accept low significance and power.
How to Determine Effect Size • What is the smallest effect that should justify the program to be adopted (in terms of cost-benefit)? • Sets minimum effect size we would want to be able to test for • Common danger: use an effect size that is too optimistic too small of sample size • How large an effect you can detect with a given sample depends on how variable the outcomes is. • Example: If all children have very similar diarrhea prevalence without a program, a very small impact will be easy to detect • The Standardized effect size is the effect size divided by the standard deviation of the outcome • Common effect sizes are: .20 (small); .40 (medium); .50 (large)
Power & Significance: Left-hand curve is the distribution of beta hat under the null that it is zero, Right-hand curve is the distribution of beta hat if the true effect size is beta. Significance comes from the right tail of the left-hand curve, Power comes from the left-tailed distribution of the right-hand curve. (source: Duflo & Kremer ‘Toolkit’)
Design Factors to Take into Account • Availability of a Baseline • A baseline can help reduce needed sample size since: • Removes some variability in data, increasing precision • Can been use it to stratify and create subgroups • The level of randomization • Whenever treatment occurs at a group level, this reduces power relative to randomization at individual level
Clustered Treatment Designs: It is often natural to implement randomization at a unit more aggregated than the one at which data is available. • Examples: • School- or village-level randomization of programs using students • Market- or city-level tests of political messages using voters • Police precinct interventions studied using individual surveys Conducting randomization at a more aggregated level creates a loss of power (relative to doing it at the individual level) called the ‘design effect’, strongest if outcomes were anyway correlated at that aggregate level. Upshot: the power of the test has more to do with the number of units over which you randomize than it does with the number of units in the study.
Clustered Treatment Designs: Look at the difference between the ‘minimum detectable effect’: • Without clustered design: • With clustered design: ( is the number of equally-sized clusters, is the intra-cluster correlation, and is the obs per cluster.) As goes to 0, DE disappears. To 1, power from J only.
Implications • It is extremely important to randomize an adequate number of groups. • Typically the number of individual within groups matter less than the number of groups • Big increases in power usually only happens when the number of groups that are randomized increase • If you randomize at the level of the district, with one treated district and one control district, you have 2 observations!
Power calculations in practice: • Use software! Numerous free programs exist on the internet: • ‘Optimal Design’ • http://sitemaker.umich.edu/group-based/optimal_design_software • ‘Edgar’ • http://www.edgarweb.org.uk/ • ‘G*Power’ • http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/ Many of these use medical not social science descriptions of statistical parameters and can be confusing to use. Make sure to use power calculator with the ability to handle clustered designs if your study units are not the same as the intervention units. Reality check: You very frequently face a sample size constraint for logistical reasons, and then power calculations are relevant only to provide you the probability that you’ll detect anything.
Conducting the Actual Randomization. In many ways simpler than it seems. Easy to use random-number generator in Excel or Stata. • For straight one-shot randomizations: • List the ids of the units in the randomization frame. • Generate a random number for each unit in the frame. • Sort the list by the random number. • Flip a coin for whether the first or the second unit on the list goes into the treatment or the control. • Assign every other unit to the treatment. • Enjoy!
Stratification & Blocking. Why might you not want to do a single-shot randomization? Imagine that you have a pre-observable continuous covariate X which you know to be highly correlated with outcomes. • Why rely on chance to make the treatment orthogonal to this X? You can stratify across this X to generate an incidence of treatment which is orthogonal to this variable by construction. What if you have a pre-observable discrete covariate which you know to be highly correlated with outcomes, or if you want to be sure to be able to analyze treatment effects within cells of this discrete covariate? • You can Block on this covariate to guarantee that each subgroup has the same treatment percentage as the sample as a whole. The expected variance of a stratified or blocked randomized estimator cannot be higher than the expected variance of the one-shot randomized estimator.
Conducting Stratified or Blocked Randomizations. Again, not very hard to do. • For stratified or blocked randomizations: • Take a list of the units in the randomization frame. • Generate a random number for each unit in the frame. • Sort the list by stratification or block criterion first, then by the random number. • Flip a coin for whether you assign the first unit on the list to the treatment or the control. • Then alternate treatment status for every unit on the list; this generates p=.5. • For multiple strata or blocks: • Sort by multiple strata or blocks and then by the random number last, then follow same as above.
Post-randomization tests: Quite common for researchers to write loops which conduct randomizations many times, check for balance across numerous characteristics, keep re-running until balance across a pre-specified set of statistics has been realized. Debate exists over this practice. Of course, generates a good-looking t-table of baseline outcomes. Functions like a multi-dimensional stratification criterion. However: • T-tests of difference based on a single comparison of means are no longer correct, and • It is not easy to see how to correct for the research design structure in the estimation of treatment effects (Bruhn & McKenzie, 2008).
Experimental Estimation of Spillover Effects: In the presence of spillover effects, the simple treatment-control difference no longer gives the correct treatment effect. Spillover effects cause trouble for designs where the treatment saturation is blocked, but there are a couple of easy ways to use or create variation to measure them directly. • Miguel & Kremer, ‘Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities’. • Deworming program randomized at the school level • Controlling for the number of pupils within a given distance of an untreated unit, they look at how outcomes change as a function of the number of these pupils that were randomly assigned to treatment. • Because treatment is randomized, localized intensity of treatment is incidentally randomized. • Baird, McIntosh, & Özler, ‘Schooling, Income, & HIV Risk in Malawi’. • Conditional Cash Transfer Program run at the Village level • Saturation of treatment in each village directly randomized, so we can compare untreated girls in treatment villages to the control as a function of the share of girls in the corresponding village that were treated.
Randomizing things that alter the probability of treatment: Local Average Treatment Effects. Angrist, Imbens, and Rueben, ‘Identification of Causal Effects using Instrumental Variables’, JASA 1996: • Paper tests for the health impact of serving in Vietnam using the Vietnam Draft lottery number as an instrument for the probability of serving. • Identification problem is the endogeneity of the decision to serve in Vietnam. • Draft lottery number is a classic instrument in the sense that it has no plausible connection with subsequent health outcomes other than through military service and is not itself caused by military service (exclusion), and strongly drives participation (significance in first stage). • Only atypical element is that the first-stage outcome is binary. • LATE estimates a quantity similar to the TET. • Rather than giving a treatment effect for all compliers, it gives the treatment effect on those units induced to take the treatment as a result of the instrument. • LATE can’t capture impacts of serving on those would have registered no matter what, nor of those who dodged the draft. • Therefore provides a very clean treatment effect on a very unclear subset of the sample. You not know which units identify the treatment.
Application of LATE in the field: In principle, very attractive method: • Don’t have to randomize access to the treatment, just some which alters the probability of taking the treatment. • Opens up the idea that promotions, publicity, pricing could be randomized in simple ways to gain experimental identification Problem in practice: • It turns out to be pretty tough to promote programs well enough to gain first-stage significance. • Many field researchers have tried LATE promotions in recent years, very few papers are coming out because the promotions have not been sufficiently effective. • Even if they work, need to ask whether this newly promoted group represents a sample over which you want treatment effects. Sometimes it does, sometimes it doesn’t.
Internal vs. External Validity in Randomized Trials: Randomized field researchers tend to be meticulous about internal validity, somewhat dismissive of external validity. • To some extent, what can you say? You want to know whether these results would hold elsewhere? Then replicate them there! • However, this is an attitude much excoriated by policymakers: Just tell us what works and stop asking for additional research money! So, given that it is always difficult to make claims about external validity from randomized trials, what can you do? • The more ‘representative’ the study sample is of a broader population, the better. • The more heterogeneous the study sample is, the more ability you have to (for example) reweight the sample to look like the population and therefore use the internal variation to project the external variation. • Beware of controlling the treatment in a randomized trial to such a perfect extent that it stops resembling what the project would actually look like when implemented in the field. Remember that your job is to analyze the actual program, ‘warts and all’.
Conclusions • Sampling and sample size are questions that must be considered in tandem. • Power calculations involve some guess work • Some time we do not have the right information to conduct it very properly • However, it is important to do them to: • Avoid launching studies that will have no power at all: waste of time and money • Determine the appropriate resources to the studies that you decide to conduct (and not too much) • If you have a fixed budget, can determine whether the project is feasible at all • Lots of good software now exists to help us do this.