Assessing Threats to Validity and Implications for Use of Impact Evaluation Findings

Assessing Threats to Validity and Implications for Use of Impact Evaluation Findings Michael Woolcock Development Research Group, World Bank Kennedy School of Government, Harvard University mwoolcock@worldbank.org InterAction May 13, 2013

Overview • Background • The art, science and politics of evaluation • Forms and sources of validity • Construct, Internal, External • Applications to ‘complex’ interventions • With a focus on External Validity • If it works there, will it work here? • Expanding range of ideas, methods and strategies

[T]he bulk of the literature presently recommended for policy decisions… cannot be used to identify ‘what works here’. And this is not because it may fail to deliver in some particular cases [; it] is not because its advice fails to deliver what it can be expected to deliver… The failing is rather that it is not designed to deliver the bulk of the key facts required to conclude that it will work here. Nancy Cartwright and Jeremy Hardie (2012) Evidence- Based Policy: A Practical Guide to Doing it Better (New York: Oxford University Press, p. 137)

Contesting Development Participatory Projects and Local Conflict Dynamics in Indonesia PATRICK BARRON RACHAEL DIPROSE MICHAEL WOOLCOCK Yale University Press, 2011

The art, science and politics of evaluation • The Art… • Sensibility, experience • Optimizing under (numerous) constraints • Taking implementation, monitoring, context seriously • The Science… • Skills, theory • Modes of causal reasoning (statistical, logical, legal), time • …and the Politics • Competence and confidence under pressure • Picking battles…

Making, assessing impact claims Quality of empirical knowledge claims turns on… • Construct validity • Do key concepts (‘property rights’, ‘informal’) mean the same thing to different people? What gets “lost in translation”? • Internal validity… • In connecting ‘cause’ (better schools) and ‘effect’ (smarter children), have we considered other factors that might actually be driving the result (home environment, community safety, cultural norms)? Programs rarely placed randomly… 3. …assessed against a ‘theory of change’ • Specification of how project’s components (and their interaction) and processes generate outcomes • Reasoned Expectations: where by when? • External validity (how generalizable are the claims?) • If it works here, will it work there? If it works with this group, will it work with that group? Will bigger be better?

1. Construct validity • Asking, answering and interpreting questions • To what extent do all parties share similar understandings of key concepts? • E.g., ‘Poverty’, ‘ethnicity’, ‘violence’, ‘justice’… • Can be addressed using mixed methods: • Iterative field testing of questionnaire items, and their sequencing • NOT cut-and-paste from elsewhere • ‘Anchoring vignettes’ (Gary King et al) • Assessing “quality of government” in China and Mexico

2. Internal validity In Evaluation 101, we assume… Impact = f (Design) | Selection, Confounding Variables Adequate for ‘simple’ interventions with a ‘good-enough’ counterfactual. But this is inadequate for assessing ‘complex’ interventions: * design is multi-faceted (i.e., many ‘moving parts’) * interaction with context is pervasive, desirable * implementation quality is vital * trajectories of change are probably non-linear (perhaps unknowable ex ante)

Pervasive problem • Such projects are inherently very complex, thus: • Very hard to isolate ‘true’ impact • Very hard to make claims about likely impact elsewhere • Understanding how(not just whether) impact is achieved is also very important • Process Evaluations, or ‘Realist Evaluations’, can be most helpful (see work of Ray Pawson, Patricia Rogers et al) • Mixed methods, theory, and experience all crucial for investigating these aspects

Evaluating ‘complex’ projects Impact = f ([DQ, CD], SF) | SE, CV, RE DQ = Design quality (weak, strong) CD = Causal density (low, high) SF = Support factors: Implementation, Context SE = Selection effects (non-random placement, participation) CV = Confounding variables RE = Reasoned expectations (where by when?) In Social Development projects (cf. roads, immunizations): * CD is high, loose, often unobserved (unobservable?) * Implementation and context are highly variable * RE is often unknown (unknowable?)

3. Theory of Change, Reasoned Expectations: Understanding impact trajectories Net Impact t = 0 t = 1 Time

Understanding impact trajectories “Same” impact claim, but entirely a function of when the assessment was done… Net Impact t = 0 t = 1 Time

Understanding impact trajectories B C A If an evaluation was done at ‘A’ or ‘B’, what claims about impact would be made? Net Impact t = 0 t = 1 Time

Understanding impact trajectories D B C A Net Impact ? t = 0 t = 1 t = 2 Time

4. External Validity: Some Background • Rising obsession with causality, RCTs as ‘gold standard’ • Pushed by donors, foundations (e.g., Gates), researchers • Campbell Collaboration, Cochrane Collaboration, NIJ, JPAL, et al • For “busy policymaker”, “warehouses” of interventions that “work” • Yet also serious critiques… • In medicine: Rothwell (2005), Groopman (2008) • In philosophy: Cartwright (2011) • In economics: Deaton (2010), Heckman (1992), Ravallion (2009) • Reddy (2013) on Poor Economics: “from rigor to rigor mortis”; a radical approach to defining development down, delimiting innovation space • …especially as it pertains to external validity… • NYT (2013), Engber (2011) on ‘Black 6’ (biomedical research) • Heinrich et al (2011) on ‘WEIRD’ people (social psychology) • Across time, space, groups, scale, units of analysis • …and understanding of mechanisms • True “science of delivery” requires knowledge of how, not just whether, something ‘works’ (Cartwright and Hardie 2012)

Evaluating ‘complex’ projects Impact = f ([DQ, CD], SF) | SE, CV, RE DQ = Design quality (weak, strong) CD = Causal density (low, high) SF = Support factors: Implementation, Context SE = Selection effects (non-random placement, participation) CV = Confounding variables RE = Reasoned expectations (where by when?) In Social Development projects (cf. roads, immunizations): * CD is high, loose, often unobserved (unobservable?) * IE and CC are highly variable * RE is often unknown (unknowable?)

From IV to EV, ‘simple to ‘complex’ • Causal density • Support factors • Implementation • Context • Reasoned expectations Central claim: the higher the intervention’s complexity, the lower its external validity

1. ‘Causal density’Which way up? RCTs vs QICs Eppstein et al (2012) “Searching the clinical fitness landscape” PLoS ONE: 7(11): e49901

How ‘simple’ or ‘complex’ is your policy/project? Specific questions to ask: • To what extent does producing successful outcomes from your policy/project require… • that the implementing agents make finely based distinctions about the “state of the world”? Are these distinctions difficult for a third party to assess/verify? • Local discretion • many agents to act or few, over extended time periods? • Transaction intensity • that the agents resist large temptations/pressures to do something besides implement the policy? • High stakes • that agents innovate to achieve desired outcomes? • Known technology

Classification of “activities” in health Technocratic(implementation light; policy decree) Logistical (implementation intensive, but easy) Implementation Intensive ‘Downstream’ (of services) Implementation Intensive ‘Upstream’ (of obligations) Complex (implementation intensive, motivation hard), need (continuous?) innovation

2. Implementation:Using RCTs to test EV of RCTs • Bold, Sandefur et al (2013) • Take a project (contract teachers) with a positive impact from India, as determined by an RCT… • …to Kenya; 192 schools randomly split into three groups to receive a contract teacher: • a control group • through an NGO (World Vision) • through the MoE • Result?

Implementation matters (a lot) Bold et al (2013)

The fact is that RCTs come at the end, when you have already decided that it will probably work, here and maybe anywhere… To know that this is a good bet, you have to have thought about causal roles and support factors… [A]nswering the how question is made easier in science by background knowledge of how things work. Nancy Cartwright and Jeremy Hardie (2012) Evidence-Based Policy: A Practical Guide to Doing it Better (New York: Oxford University Press, p. 125)

Learning from intra-project variation A B Impact t = 0 t = 1 Time

Learning from intra-project variation‘Complex’ projects Impact t = 0 t = 1 Time

Learning from intra-project variation‘Complex’ projects Iterative, adaptive learning Impact t = 0 t = 1 Time

Putting it all together High Low Even with low EV interventions, the ideas and processes behind them may still travel well

Putting it all together High High Utility of case studies, of process evaluations, of MM Low Low Even with low EV interventions, the ideas and processes behind them may still travel well

Implications • Take the analytics of knowledge claims surrounding EV as seriously as we do IV • Engage with the vast array of social science tools available for rigorously assessing complex interventions • Within and beyond economics • RCTs as one tool among many • New literature on case studies (Mahoney), QCA (Ragin), Complexity • See especially ‘realist evaluation’ (Pawson, Tilly) • Make implementation cool; it really matters… • Learning from intra-project variation; projects themselves as laboratories, as “policy experiments” (Rondinelli 1993) • ‘Science of delivery’ must know how, not just whether, interventions work (mechanisms, theory of change) • Especially important for engaging with ‘complex’ interventions • Need ‘counter-temporal’ (not just counterfactual) • Reasoned expectations about what and where, by when?

Primary source material • Bamberger, Michael, VijayendraRao and Michael Woolcock (2010) ‘Using Mixed Methods in Monitoring and Evaluation: Experiences from International Development’, in Abbas Tashakkori and Charles Teddlie (eds.) Handbook of Mixed Methods (2nd revised edition) Thousand Oaks, CA: Sage Publications, pp. 613-641 • Barron, Patrick, Rachael Diprose and Michael Woolcock (2011) Contesting Development: Participatory Projects and Local Conflict Dynamics in Indonesia New Haven: Yale University Press • Pritchett, Lant, SalimahSamji and Jeffrey Hammer (2012) ‘It’s All About MeE: Using Experiential Learning to Navigate the Design Space’ Center for Global Development Working Paper No. • Woolcock, Michael (2009) ‘Toward a Plurality of Methods in Project Evaluation: A Contextualized Approach to Understanding Impact Trajectories and Efficacy’ Journal of Development Effectiveness 1(1): 1-14 • Woolcock, Michael (forthcoming) ‘Using Case Studies to Explore the External Validity of Complex Development Interventions’ Evaluation

Assessing Threats to Validity and Implications for Use of Impact Evaluation Findings

Assessing Threats to Validity and Implications for Use of Impact Evaluation Findings

Presentation Transcript

BIAS: threats to validity and interpretation

Overview and Findings of CDFI Program Impact Evaluation Reports

Implications of sociolinguistic findings for phonological theory

Social Interaction Threats to Internal Validity

Field experiments for assessing question validity

Social Interaction Threats to Internal Validity

Variance, Control, and Threats to Validity

“Social Interaction” Threats to Internal Validity

Bias: threats to validity and interpretation

The Many Threats to Test Validity

Threats to the Validity of Measures of Achievement Gains

The Testing Environment and Threats to Validity Considered

Experimental Design: Threats to Validity

HR: Validity and Reliability Threats to validity/Sources of error in Design and methods

Construct Validity And its Threats

Syntheses of key findings and implications

Assessing Socio-economic Impact of Rural Cellular Telecom: Implications for Policy

Validity Evaluation

Experimental Design: Threats to Validity

Threats to Construct Validity

Research Design and Validity Threats

Threats to Conclusion Validity