Practical Sampling for Impact Evaluations

Practical Sampling for Impact Evaluations Marie-HélèneCloutier

Introduction Ideally, want to compare what happens to the same schools with and without the program But impossible → use statistics. • Define treatment and control groups • Compare mean outcome (e.g. test scores) value • Random assignment ensures comparability but do not remove noise… How big should groups be and how should we select them? Warning! • Goal is to give overview of how sampling features affect what it is possible to learn from an impact evaluation • Not make you a sampling expert or give you a headache

Introduction Sampling frame - Representativeness / external validity Which populations or groups are we interested in and where do we find them? Sample size - Groups large enough to credibly detect a meaningful effect How many people/schools/units should be interviewed/observed from that population?

Sampling frame • Census vs Samples? • Sample – Lower cost, faster data collection (avoid capturing dynamics), and smaller data set (improved data quality) • Who are we interested in? Feasibility and what you want to learn • All schools? • All public schools? • All public primary schools? • All public primary schools in a particular region? • External validity • Can findings from a sample of population (c) inform appropriate programs to help secondary schools? • Can findings from a sample of population (d) inform national policy?

Sampling frame Finding the units we’re interested in • Depends on size and type of experiment • Required information before sampling • Complete listing all of units of observation available for sampling in each area or groups

Sample size and confidence Example: simpler question than program impact • Say we wanted to know the average annual expenses of a school • Option 1: We go out and interview 5 randomly selected headmasters and take the average of their responses. • Option 2: We interview 1,000 randomly selected headmasters and average their responses. Which average is likely to be closer to the true average? Why?

Sample size and confidence Example: simpler question than program impact • Say we wanted to know the average annual expenses of a school • Option 1: We go out and interview 5 randomly selected headmasters and take the average of their responses. • Option 2: We interview 1,000 randomly selected headmasters and average their responses. Which average is likely to be closer to the true average? Why? • With IE, need many observations to say with confidence whether average outcome treatment > or < average outcome control

Calculating Sample Size There is a formula… Main things to be aware of: • Detectable effect size • Probability of type 1 error (significance) Probability of type 2 error (1 – power) • Variance of outcome(s)

Calculating Sample Size Detectable Effect Size • What is an effect size? The extent to which the intervention affects the outcome of interest E.g. 10% increase in test scores, 25% increase in completion rate • Harder to capture (detect) a smallereffect

Calculating Sample Size Detectable Effect Size Who is taller? Detecting smaller differences is harder

Calculating Sample Size Detectable Effect Size • Larger samples  easier to detect smaller effects • E.g. Are test scores similar in schools where teachers receive bonus than in schools where they are not?

Calculating Sample Size Detectable Effect Size How to determine detectable effect size? • Smallest effect that would prompt a policy response • Smallest cost effective effect E.g. Constructing toilets for girls • significantly ↑ girls access by 10%. • Great - let’s think about how we can scale this up. • significantly ↑ girls access by 0.5%. • Great….uh..wait: we spent all of that money and it only increased test scores by that much?

Calculating Sample Size Type 1 and Type 2 errors Minimize 2 types of statistical error: Type 1 error → repeating/continuing a bad program • Minimized afterdata collection, during analysis Type 2 error → stopping/not scaling up good program • Minimized before data collection

Calculating Sample Size Type 1 and type 2 errors • Type 1: significance • Lower significance  Larger samples • Common levels: α = 1% or α = 5% • 1% or 5% probability that there is an effect but we think found one • 1- Type 2: power • Higher power Larger samples • Common levels: 1- β = 80% or 1- β = 90% • 20% or 10% probability that there is an effect but we cannot detect it

Calculating Sample Size Variance in Outcome Less underlying variance • easier to detect difference • smaller sample

Calculating Sample Size Variance in Outcome • How do we know this before we decide our sample size and collect our data? • Ideal pre-existing data often ….non-existent • Example: EMIS, school census, national assessment • Can use pre-existing data from a similar population • Makes this a bit of guesswork, not an exact science

Further Issues • Multiple treatment arms • Group-disaggregated results • Clustered design • Stratification

Further Issues 1. Multiple Treatment Arms • Straightforward to compare each treatment separately to the comparison group • To compare multiple treatment groups  larger samples • Especially if treatments very similar, because differences between treatment groups would be smaller • Like fixing a very small detectable effect size • E.g. Distinguish between two amounts of scholarships

Further Issues 2. Group-Disaggregated Results • Are effects different for men and women? For different grades? • Estimating differences in treatment impacts (heterogenous) larger samples • Especially difference is expected to react in a similar way

Further Issues 3. Clustered design • Sampling units are clusters rather than individuals • Very common in education: outcome of interest at the student level but sampling/randomization unit are villages/schools/classroom Examples: Impact of teacher training on student test scores

Further Issues 3. Clustered design Why? • Minimize or remove contamination • E.g.: In the deworming program, schools was chosen as the unit because worms are contagious • Basic Feasibility/Political considerations • E.g. school-feeding: Cannot include and exclude different students from the same school • Only natural choice • Example: Any education intervention that affect an entire classroom (e.g. flipcharts, teacher training).

Further Issues 3. Clustered design Implications of clustering • Outcomes for all the individuals within a unit may be correlated • All villagers are exposed to the same weather • All students share a schoolmaster • The program affect all students at the same time. • The member of a village interact with each other • The sample size needs to be adjusted for this correlation • More correlation btw outcomes → larger sample • Adequate number of groups!!! (often matters less than the number of individuals per groups) • e.g. You CANNOT randomize at the level of the district, with one treated district and one control district!!!!

Further Issues 4. Stratifying What? • Sub-populations/blocks defined by value of the control variables • Common strata: geography, gender, sector, etc. • Treatment assignment (or sampling) occurs within these groups Why? • Ensures treatment and control groups are balanced • ↓ sample size because • ↓ variance of the outcome of interest in each strata (most when high correlation btw stratification variables and outcome) • ↓ correlation of units within clusters.

Further Issues 4. Stratifying Geography example: What’s the impact in a particular region? Sometimes hard to say with any confidence • = T • = C

Further Issues 4. Stratifying Why do we need strata? • Random assignment to treatment within geographical units • Within each unit, ½ will be treatment, ½ will be control • Similar logic for gender, type of schools, school size, etc

Summing up • Your sample size will determine how much you can learn from your IE • Some judgment and guesswork in calculations but important to spend time on them • If sample size is too low: waste of time and money You will not be able to detect a non-zero impact with any confidence Questions?

Example/Exercise • Exemple : Samplingefficiency • Wegenerated data from a population • Computemean and variance • Select randomsample of differentsizes and compute the average • And see how close the the real population value weget

Example/Exercise • Exemple : Sample size Country X wishes to improvestudents’ math performance in grade 2. To do so, the Minisitry of Education of X decides to distribute new math textbooks to thosestudentsthattheycantake home. One yearearlier, a national test in Math indicatedthat the average test scores was 40% with a standard deviation of 19. The national statisticsindicatethat 15% of the studentsrepeat grade 2. Distributing the textbookscost on average $125 (cost of the book and distribution). Giventhat the Ministerisunsure of the impact of this program, hewouldlikeyou to evaluateit. List the different items thatyouneed in order to determineyoursample size. Fixe the value of those items.

Practical Sampling for Impact Evaluations