Practical Sampling for Impact Evaluations

VincenzodiMaro Practical Sampling for Impact Evaluations

Introduction • How do we construct a sample to credibly detect a meaningful effect? • Which populations or groups are we interested in and where do we find them? • How many people/firms/units should be interviewed/observed from that population? • How does this affect the evaluation budget?

Outline • Sampling frame • What populations or groups are we interested in? • How do we find them? • Sample size • Why it is so important: confidence in results • Determinants of appropriate sample size • Further issues • Examples • Budgets

Sampling frame • Who are we interested in? • All SMEs? • All formal SMEs? • All formal SMES in a particular sector? • All formal SMES in a particular region? • Need to keep in mind external validity • Can findings from population (c) inform appropriate programs to help informal firms in a different sector? • Can findings from population (d) inform national policy? • But should also keep in mind feasibility and what you want to learn • Might not be possible or desirable to pilot a very broadly defined program or policy

Sampling frame: Finding the units we’re interested in • Depends on size and type of experiment • Lottery among applicants • Example: BDS program among informal firms in a particular area • Can use treatment and comparison units from applicant pool • If not feasible (50,000 get the treatment), need to draw a sample to measure impact • Policy change • Example: A change in business registration rules in randomly selected districts • To measure impact on profits, cannot sample all informal businesses in treatment and comparison districts. • Will need to draw a sample of firms within districts. • Required information before sampling • Complete listing all of units of observation available for sampling in each area or group • Tricky for units like informal firms, but there are techniques to overcome this

Outline • Sampling frame • What populations or groups are we interested in? • How do we find them? • Sample size • Why it is so important: confidence in results • Determinants of appropriate sample size • Further issues • Examples • Budgets

Sample size and confidence • Start with a simpler question than program impact • Say we wanted to know the average annual profits of an SME in Rio. • Option 1: We go out and track down 5 business owners and take the average of their responses. • Option 2: We track down 1,000 business owners and average their responses. • Which average is likely to be closer to the true average?

1,000 firms Sample size and confidence • 5 firms

Sample size and confidence • Similarly, when determining program impact • Need many observations to say with confidence whether average outcome of treatment group is higher/lower than in comparison group • What do I mean by confidence? • Minimizing statistical error • Types of errors • Type 1 error: You say there is a program impact when there really isn’t one. • Type 2 error: There really is a program impact but you cannot detect it.

Sample size and confidence • Type 1 error: Find program impact when there’s none • Error can be minimized after data collection, during statistical analysis • Need to adjust the significance levels of impact estimates (e.g. 99% or 95% confidence intervals) • Type 2 error: Cannot see that there really is a program impact • In jargon: statistical test has low power • Error must be minimized before data collection • Best method of doing this: ensuring you have a large enough sample • Whole point of an impact evaluation is to learn something • Ex ante: We don’t know how large the impact of this program is • Low powered ex-post: This program might have increased firms’ profits by 50% but we cannot distinguish a 50% increase from an increase of zero with any confidence

Calculating sample size • The formula: • Main things to be aware of: • Detectable effect size • Probability of type 1 and 2 errors • Variance of outcome(s) • Units (firms, banks) per treated area

Calculating sample size • Smallest detectable effect size • Smallest effect you want to be able to distinguish from zero • A 30% increase in sales, a 25% decrease in bribes paid • Larger samples  easier to detect smaller effects • Do female and male entrepreneurs work similar hours? • Claim: On average, women work 40 hours/week, men work 44 hours/week • If statistic came from sample of 10 women & 10 men • Hard to say if they are different • Would be easier to say they are different if women work 30 hours/week and men work 80 hours/week • But if statistic came from sample of 500 women and 500 men • More likely that they truly are different

Calculating sample size • How do you choose the smallest detectable effect size? • Smallest effect that would prompt a policy response • Smallest effect that would allow you to say that a program was not a failure • This program significantly increased sales by 40%. • Great - let’s think about how we can scale this up. • This program significantly increased sales by 10%. • Great….uh..wait: we spent all of that money and it only increased sales by that much?

Calculating sample size • Type 1 and Type 2 errors • Type 1 • Significance level of estimates usually set to 1% or 5% • 1% or 5% probability that there is no effect but we think we found one • Type 2 • Power usually set to 80% or 90% • 20% or 10% probability that there is an effect but we cannot detect it • Larger samples  higher power

Calculating sample size • Variance of outcomes • Less underlying variance  easier to detect difference  can have lower sample size

Calculating sample size • Variance of outcomes • How do we know this before we decide our sample size and collect our data? • Ideal pre-existing data often ….non-existent • Can use pre-existing data from a similar population • Example: Enterprise Surveys, labor force surveys • Makes this a bit of guesswork, not a foolproof exercises • Use as a guide

Further issues • Multiple treatment arms • Group-disaggregated results • Take-up • Data quality

Further issues • Multiple treatment arms • Straightforward to compare each treatment separately to the comparison group • To compare treatment groups requires very large samples • Especially if treatments very similar, differences between the treatment groups would be smaller • In effect, it’s like fixing a very small detectable effect size • Group-disaggregated results • Are effects different for men and women? For different sectors? • If genders/sectors expected to react in a similar way, then estimating differences in treatment impact also requires very large samples

Who is taller?Detecting smaller differences is harder

Further issues • Group-disaggregated results • To ensure balance across treatment and comparison groups, good to divide sample into strata before assigning treatment • Strata • Sub-populations • Common strata: geography, gender, sector, initial values of outcome variable • Treatment assignment (or sampling) occurs within these groups

Why do we need strata? Geography example = T = C

Why do we need strata? • What’s the impact in a particular region? • Sometimes hard to say with any confidence

Why do we need strata? • Random assignment to treatment within geographical units • Within each unit, ½ will be treatment, ½ will be comparison. • Similar logic for gender, industry, firm size, etc

Further issues • Take-up • Low take-up increases detectable effect size • Can only find an effect if it is really large • Effectively decreases sample size • Example: Offering matching grants to SMEs for BDS services • Offer to 5,000 firms • Only 50 participate • Probably can only say there is an effect on sales with confidence if they become Fortune 500 companies

Further issues • Data quality • Poor data quality effectively increases required sample size • Missing observations • Increased noise • Can be partly addressed with field coordinator on the ground monitoring data collection

Example from Ghana • Calculations can be made in many statistical packages – e.g. STATA, Optimal Design • Experiment in Ghana designed to increase the profits of microenterprise firms • Baseline profits • 50 cedi per month. • Profits data typically noisy, so a coefficient of variation >1 common. • Example STATA code to detect 10% increase in profits: • sampsi 50 55, p(0.8) pre(1) post(1) r1(0.5) sd1(50) sd2(50) • Having both a baseline and endline decreases required sample size (pre and post)

Example from Ghana • Results • 10% increase (from 50 to 55): 1,178 firms in each group • 20% increase (from 50 to 60): 295 firms in each group. • 50% increase (from 50 to 75): 48 firms in each group (But this effect size not realistic) • What if take-up is only 50%? • Offer business training that increases profits by 20%, but only half the firms do it. • Mean for treated group = 0.5*50 + 0.5*60 = 55 • Equivalent to detecting a 10% increase with 100% take-up  need 1,178 in each group instead of 295 in each group

Outline • Sampling frame • What populations or groups are we interested in • How do we find them? • Sample size • Why it is so important: confidence in results • Determinants of appropriate sample size • Further issues • Examples • Budgets

Budgets • What is required? • Data collection • Survey firm • Data entry • Field coordinator to ensure treatment follows randomization protocol and to monitor data collection • Data analysis

Budgets • How much will all of this cost? • Huge range. Often depends on • Length of survey • Ease of finding respondents • Spatial dispersion of respondents • Security issues • Formal vs informal firms • Required human capital of enumerator • Et cetera…. • Firm-level survey data:$40-350/firm • Household survey data: $40+/household • Field coordinator: $10,000-$40,000/year • Depends on whether you can find a local hire • Administrative data: Usually free • Sometimes has limited outcomes, can miss most of the informal sector

Budgets • Money can buy power!

Summing up • The sample size of your impact evaluation will determine how much you can learn from your experiment • Some judgment and guesswork in calculations but important to spend time on them • If sample size is too low: waste of time and money because you will not be able to detect a non-zero impact with any confidence • If little effort put into sample design and data collection: See above. • Questions?

Practical Sampling for Impact Evaluations