Why Non-Experimental Methods are “Not Good Enough” and Why Experimental Methods Are: Challenging the Folk Lore of Evalua

Why Non-Experimental Methods are “Not Good Enough” and Why Experimental Methods Are: Challenging the Folk Lore of Evaluation Research David Weisburd Hebrew University George Mason University

Oliver Wendell Holmes

Where I am Going • Describe how non-experimental evaluation studies attempt to gain unbiased results in a world where outcomes are confounded. • Define the fundamental weakness of this approach. • Critically examine the “folklore” that suggests non-experimental studies are “good enough” despite this weakness. • Folk lore: the traditional beliefs, customs, and stories of a community, passed through the generations by word of mouth. (Oxford Pocket Dictionary)

Experiments are Good Enough • Experimental studies provide a statistical solution to the problem of confounding. • They should be “good enough.” • Critically examine the “folk lore” that seems to suggest that experiments are not “good enough” despite their statistical advantages.

Neutralizing Confounding in Non-Experimental Research

The Key Question • In evaluating treatments or programs the key issue is getting an unbiased estimate of the treatment effect. • Without that, any other considerations such as the ability to generalize results are superfluous. • The main problem we face is that treatment is confounded with other factors.

The Problem We Need to Solve • Example: We measure the effect of prison on recidivism. • We find that prison increases recidivism. • But the reason for this may be that we have not taken into account the fact the “prisoners” are more likely to recidivate in the first place because they have on average more severe prior records. • Treatment (prison) is confounded with prior record.

Creating Unbiased Estimates in Non-Experimental Studies • Non-experimental methods such as regression techniques or matching rely on a similar logic. • If we know what the factors are that confound treatment we can take them into account. • The primary method of doing this is statistical (Multivariate Statistical Methods). • But Quasi-Experiments that rely on matching, or propensity scores are based on the same logic.

Solving the Problem Statistically: CC is the Confounding Cause

Elegant Solution, But… • If we want to get an unbiased estimate of treatment in a non-experimental study we would in theory have to identify all “confounding causes.” • That “assumption” is on its face unrealistic, but evaluation researchers often use “folk lore” to argue that non-experimental studies are in any event “good enough.”

The Folk Lore of Non-Experimental Evaluations Non-Experimental Methods are “Good Enough”?

1) Overall We Identify the Most Important Causes

Aren’t we Doing Well Enough? • A common “defense” for non-experimental methods is that our “models” take into account the “most important factors.” • The assumption here is that in practice we don’t have to worry about excluded variables. • The major ones (that might effect the outcomes in meaningful ways) are already known and accounted for in the model.

Impact of Small Excluded Effects With Little Influence is Small

How Well do Criminologists Explain Crime • Alex Piquero and I have recently published an article in Crime and Justice in which we examined this assumption. • We reviewed all the articles in Criminology that used multivariate statistical modeling to examine a criminological theory and provided some measurement of “variance explained.” • While my concern here is isolating a treatment effect, the question is similar since we would not expect our understanding of treatments or programs to be very different then our underlying understanding of crime and justice.

Average Variance Explained • Across the articles that reported an R2 value over the time period covered, the average R2 was .389. • Some 25% of the 169 articles exhibit R2 values of below .20, while over 70% have an R2 under .50.

Aggregate R2 Value over Time (N=169 articles). (Note: Years with zero observations are removed for ease of presentation.)

The Folk Lore is Most Likely Wrong • There is a good deal left unexplained, most often more than half the variance. • It would seem very difficult to assume that in all of this variance unexplained there are not very meaningful confounding factors that are routinely excluded.

2) If the Effect of Treatment is Large than You can Assume that Excluded Causes Would not Change that Estimate in a Meaningful Way

This Effect is Large Enough Not to Worry About! • Another folk lore often used to defend a reliance on non-experimental methods is that very large and robust effects are not likely to be meaningfully altered even if there are unmeasured confounding factors. • Statisticians in contrast have often noted the “instability” of regression parameters under differing assumptions.

AOC Death Penalty Study • Joe Naus from Rutgers University and I were asked by the AOC of New Jersey to Assess the Effects of Race on Death Penalty Sentencing. • Following an approach that identified major factors influencing death penalty sentencing we developed a model that showed a very significant effect of race of victim on the likelihood of advancement to penalty trial.

White Victim is the Single Most “Significant” Effect on Advancement to Penalty Trial

Regional Effects • The State Prosecutor argued that the effect of race of victim was confounded by district of prosecution. • He noted that counties that had large numbers of “white victims” were places where it was more likely for a case to go to penalty trial for other reasons. • For example, the cases with large numbers of white victims were in counties with many fewer “death eligible cases.” Prosecutors in such cases were more likely to focus in more aggressively on such cases.

White Victim Controlling for County

3) We can Assume that the Biases are Balanced

Everything Will Balance Off in the End • A common folklore is that the excluded variables “balance each other,” so we can assume that the parameter estimate is unbiased. • This assumption relies on a model in which the exclusion of variables is random, and therefore we would assume an unbiased estimate of b. • If this assumption had any basis to it we could just rely on the bivariate model. No-one would argue that that model provides an unbiased estimate!

Knowledge Development is not Random • Indeed, there is good reason to believe that we identify variables in clusters around specific theoretical constructs (like poverty or social disorganization). • By definition we are then missing clusters which are likely to cause bias in specific directions. • Data restrictions (e.g. gathering official data) are likely to be even more systematic in their biases.

So Why are Experiments Good Enough?

Randomized Experiments: A Naïve Approach • Because treatment has been allocated randomly, in theory it is not going to be related systematically to other factors such as gender, race, age, attitudes etc. • THERE ARE NO CONFOUNDING CAUSES!

No Confounding!

So Rather than Taking Confounding Causes Into Account a Randomized Experiment Makes the Confounding Irrelevant The product of the correlations is zero in a randomized experiment.

The Folk Lore of Why Experiments are Not Good Enough

1) Experiments are Not Ethical • Many people still claim that it is not ethical to carry out social experiments. • It seems that at least in crime and justice evaluation researchers don’t really accept this folk lore (Lum and Yang, 2003).

“Randomized experimental design is the best method of linking cause and effect.” t= -2.70* p= .010

“Randomized experiments cannot be carried out ethically in criminal justice settings.” t= -1.98 p= .051

2) Experiments Cannot be Implemented in the Real World • Crime Reduction Experiments 1945-1993( N=267)

3) Experiments Have Low External Validity • Only innovative agencies are willing to participate in experiments. • “Ordinary” agencies may be brought on board if there is strong governmental encouragement and financial support that rewards participation. • Experiments operate in an artificial world that is controlled and not dynamic. • There is no free lunch!

Randomized Experiments are Good Enough Everything Else is Commentary

The great Talmudic scholar Hillel responded when asked to explain Judaism on one foot, that its essence was the dictum: “‘Treat others as you would like them to treat you.” He then noted that everything else is “commentary.” • In our case, the essence of evaluation research is that “experiments are good enough.” • Non-experimental methods are “not good enough.”

The Commentary • Of course, as in the case of Hillel, the commentary is very important. • But a simple rule should follow. We should begin any study with an assumption that an experimental design is required. We should only then get to the commentary.

Why Non-Experimental Methods are “Not Good Enough” and Why Experimental Methods Are: Challenging the Folk Lore of Evalua