Propensity Score Matching

Propensity Score Matching Onur Baser, PhD Pharmaceutical Outcomes Research onur.baser@thomson.com September 7, 2005

Outline of the Presentation • Why Use Propensity Score Matching? • Propensity Score Matching Defined • An Extended Example and Guidelines for Choosing the Most Appropriate Type • Limitations of Propensity Score Matching • Summary and Discussion

Overview of Key Points • Propensity Score Matching creates “quasi random experiment” from observational data • Choosing among different types of matching techniques is important and we should look at several criteria • Multivariate Analysis after applying the correct matching technique increases the efficiency of the outcome estimator

Why Use Propensity Score Matching? • Addresses an evaluation problem: • Theory of Counterfactuals: Some people receive treatment • Question: What would happened to those who, in fact, did receive a treatment, if they had not received treatment (or the converse?) • Counterfactuals can not be observed directly • We can only create and estimate them • Propensity Score Matching is one strategy that corrects for selection bias in making estimates

Evaluation Problem

Ideal Way to Address Problem: Randomized Experiment • A sample of N individuals selected from the population • Sample is divided randomly into two groups • Treatment group is treated with Drug A, while the Control group is not • Outcome is observed for both Treatment and Control groups • The difference in empirical means of the outcome is the value of interest N T C

Limitations of Randomized Evaluations • High cost • Financial cost • Ethical problems • Threats to internal validity • Non-response bias (patients leave trial) • Threats to external validity • Limited duration (effects occur after study period) • Experiment specificity (effects not generalizable) • Hawthorne and John Henry effects • General equilibrium effects • Threats to power • Small samples

Limitations of Observational Studies • Heterogeneity of real-life patient populations • Lack of standardized analysis • Lack of random assignment: • Randomized control trials (RCTs) look at the effect of exposure by assigning exposure to a random sample • Investigator plays no role in assigning exposure to the study subjects • This makes observational studies more vulnerable to methodological problems RCTs are considered the best method of demonstrating causality

Controlling for Heterogeneity by Controlling for Observables • Ordinary Least Squares (OLS): a basic regression design • Stata command: reg y T x(1) x(2) x(3), robust

Outline of the Presentation • Why Use Propensity Score Matching? • Propensity Score Matching Defined • The Concept of Matching • The Types of Matching • An Extended Example and Guidelines for Choosing the Most Appropriate Type • Limitations of Propensity Score Matching • Summary and Discussion

Propensity Score Matching (PSM) • Employs a predicted probability of group membership • E.g. treatment vs. control group • Based on observed predictors, usually obtained from logistic regression to create counterfactual group (Rosenbaum & Rubin, 1983) • Dependent variable: T=1, if participate; T=0, otherwise T=f(age, gender, pre-cci, etc.) • Allows “quasi-randomized” experiment • Two subjects, one in treated group and one in the control, with the same (or similar) propensity score, can be seen as “randomly assigned” to either group

Criteria for “Good” PSM Before Matching • Identify treatment and comparison group with substantial overlap • Same exclusion, inclusion criteria • Overweighting some variables (Medicare vs Medicaid) • Choose variables • Conduct logit estimations correctly

Choosing appropriate variables • Kitchen-sink approach • Including irrelevant variable increases the variance of predictions • Dropping independent variables that have small coefficients can reduce the average error of predictions • Correlation is not causality! • Example: Storks and babies

Choose Appropriate Conditioning Variables-2 • Include all variables that affect both treatment assignment and the outcome variables • Including variables only weakly related to treatment assignment usually reduces bias more than it will increase variance • To avoid post-treatment bias, we should exclude variables affected by the treatment variables • Step-wise regression Stata command: sw logit treatment age cci cap female race, pr(.2) backward My favorite: Stata command: sw logit treatment (age female) cci cap female race , pr(.2) pe(.5) lackterm1

Example • Variables both effect treatment and outcomes Ex: age, gender, female, race, pre-period: severity, top 10 comorbidities, plantype, cci, total number of diagnosis, pre-period expenditures • Variables that may create post-treatment bias (OVER MATCHING) Ex: post period severity, comorbidities, cci, total number of diagnosis

Conduct logit estimations correctly • Proportional Frequency (%T<10% of total sample) T=100 C=900, 10000, 100000 • ROC= Sensitivity: gives us the proportion of cases picked out by the logit, relative to all cases of the disease • Specificity: the ability to pick out patients who do NOT have the disease • Stata command: qui logit treatment age gender race lroc

General Procedure for Conducting PSM STEP 1: Run Logistic Regression • Dependent variable: T=1, if participate; T=0, otherwise • Choose appropriate conditioning variables • Obtain propensity score: predicted probability (p) or log[p/(1-p)] STEP 2: Match Each Participant to One or More Non-participants on Propensity Score • Stratification Matching • Nearest Neighbor Matching • Caliber Matching • Mahalanobis Matching • Kernel Matching • Radius Matching

Stratification Method • Divide the range of variation of the propensity score in intervals such that within each interval treated and control units have, on average, the same propensity score • Calculate the differences in outcomes measure between the treatment and the control group in each interval • Average treatment effect is obtained as an average of outcomes of each block with weights given by the distribution of treated units across blocks • Discard observation in blocks where either the treated or control unit is absent Stata command: atts cost, pscore(phat) blockid(5) boot reps(250)

Stratification Method

Nearest Neighbor Matching • Randomly order the participants and non-participants • Then select the first participants and find non-participant with closest propensity score

Nearest Neighbor Matchup (1 to 1) Stata command: psmatch2, outcome(cost) pscore(phat) n(2) norep Propensity Score TreatmentPropensity Score Control 0.005 0.005 0.007 0.0055 0.006 0.0061 0.009 Stata command: psmatch2, outcome(cost) pscore(phat) n(1) norep matched matched

Nearest Neighbor Matchup (2 to 1) Propensity Score TreatmentPropensity Score Control 0.005 0.005 0.007 0.0055 0.006 0.0061 0.009 Stata command: psmatch2, outcome(cost) pscore(phat) n(2) norep matched matched matched matched

Nearest Neighbor Matchup with Replacement (1 to 1) Propensity Score TreatmentPropensity Score Control 0.005 0.005 0.0051 0.0055 0.006 0.0061 Stata command: psmatch2, outcome(cost) pscore(phat) n(1) matched matched

Caliber Matching • Define a common support region • Suggested ¼ of standard error of estimated propensity score • Randomly select one treatment that matches on the propensity score with the control qui predict phat sum phat Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- phat | 30878 .027301 .034677 .0000891 .7371861 Caliber: 0.034677/4=0.0086 Stata command: psmatch2, outcome(cost) pscore(phat) n(2) norep cal(0.086)

Caliber Matching P+(1/4)se(P) Range of matched cases P P-(1/4)se(P) Control Treatment

Radius and Kernel Matching • Radius matching • Each treated unit is matched only with the control units whose propensity score falls in a predefined neighborhood of the propensity score of the treated unit • Kernel matching • All treated are matched with a weighted average of all controls • Weights are inversely proportional to the distance between the propensity scores of treated and controls

Radius Matching psmatch2, outcome(cost) pscore(phat) radius cal(0.005) r Picked matched units r

^ d1 ^ C1T P1T P1C C1C ^ d2 ^ C2T P2T P2C C2C ^ d3 ^ C3T P3T P3C C3C 1 [C3T – C3C] ; 1 [C2T – C2C] ; 1 [C1T – C1C] d2 d3 d1 lowest distance good match highest weight Kernel Matching (psmatch2, outcome(cost) pscore(phat) kernel)

Mahalanobis Metric Matching • Randomly order subjects, then calculate distance between first treated subject and all controls, where the distance d(i,j)=(u-v)TC-1(u-v) • u and v are the values from matching variables • C is the sample covariance matrix of matching variables from full set of control subjects • (D’Agostino,1998, Statistics in Medicine) • Stata: psmatch2, outcome(cost) mah(age gender phat)

Outline of the Presentation • Why Use Propensity Score Matching? • Propensity Score Matching Defined • An Extended Example and Guidelines for Choosing the Most Appropriate Type • Limitations of Propensity Score Matching • Summary and Discussion

Example from MarketScan Database • MarketScan data are used to estimate the cost of illness for asthma patients • Details of the data set can be found at Baser (2005) • Treatment Group (Patients with the disease) = 1184 • Control Group (Patients without the disease)= 3169

Descriptive Table for Treatment and Control Group

Estimation of Propensity Score with Logit

Types of Propensity Score Matching Used • M1: Nearest Neighbor • M2: 2 to 1 • M3: Mahalanobis • M4: Mahalanobis with caliber • M5: Radius • M6: Kernel • M7: Stratified

Results

Which One to Choose? • Asymptotically, the matching estimators all end up comparing only exact matches • And therefore give the same answer • In a finite sample however, choice makes a difference! • The general tendency in the literature is to choose matching with replacement when control data set is small • If it is large and evenly distributed without replacement is better • Kernel, Mahalanobis, and Radius Matching work better with asymmetrically distributed, large control datasets • Stratification works better if we suspect unobservable effects in the matching

Quantifiable Criteria-1 C1. Twosample t-statistics & Chi-square test between treatment and matched control observations -Insignificant values M1: Nearest Neighbor;M2: 2 to 1;M3: Mahalanobis; M4: Mahalanobis with caliber M5: Radius; M6: Kernel; M7: Stratified

Quantifiable Criteria-2 C2. Mean difference as a percentage of the average standard deviation The lower the value, the better match

Quantifiable Criteria-3 C3. Calculate the percent reduction bias in the means of the explanatory variables after matching (A) and initially (I) The best value is 100, meaning 100% reduction. The more deviation from 100 the worse is the match.

Quantifiable Criteria-4 C4. Use the Kolmogorov-Smirnov test to compare treatment density estimates for explanatory variables -Insignificant values M1: Nearest Neighbor; M2: 2 to 1; M3: Mahalanobis; M4: Mahalanobis with caliber M5: Radius; M6: Kernel; M7: Stratified

Quantifiable Criteria-5 C5. Use the Kolmogorov-Smirnov test to compare treatment density estimates of the propensity scores of control units with those of the treated units -Insignificant values M1: Nearest Neighbor;•M2: 2 to 1;•M3: Mahalanobis;•M4: Mahalanobis with caliber M5: Radius;•M6: Kernel;•M7: Stratified

Balance Checking Criteria

Results

Average Treatment Effect: Regression • Looked at the difference in means of expenditure between the matched treatment and control units • Now, we will run the regression over matched groups, independent variables are treatment indicator + same variables as in propensity score estimation. • GLM with log link and gamma family STATA command: glm cost treatment age …, link(log) family(gamma) robust

Estimated Total Health Care Expenditure

Multivariate Regression After Propensity Score Matching • Is it necessary? • Results are at least as good as the ones after Propensity Score Matching • Tells us the marginal effects of each variable on outcomes measure glm cost treatment age …, link(log) family(gamma) robust • Increase efficiency - double filtering • Covers your mistakes!

Why do we need Propensity Score Matching if we run multivariate regression-1 - Argument against using regression: it fits parameters using a “global” method. Many prefer “local” methods -- regression function at a point is affected only, or mainly, by nearby observation. • Dehejia and Wahba, 1999 show that propensity score matching approach leads to results that are close to the experimental evidence, where the regression approaches failed. • Since you don’t care about individual coefficients- the main purpose is classification- you can put as many variable as possible to the right hand side in logit runs but if you do this in MV analysis it will cost you.

The Counter-Argument: -Jeffrey M. Wooldridge, University Distinguished Professor, Fellow of Econometric Society, Author of “Introductory Econometrics” and “Econometric Analysis of Cross Section and Panel Data” … “as I often say, people who do data analysis want to do anything other than standard regression analysis. It’s too bad we have the mind set, as a careful regression with good controls and flexible functional forms is probably the best we can do in many cases. …what is a bit disconcerting is that these things take on a life of their own. It seems that once the method is blessed by a few smart people, there is no turning back!”

Outline of the Presentation • Why Use Propensity Score Matching? • Propensity Score Matching Defined • Guidelines for Choosing the Most Appropriate Type • An Extended Example • Limitations of Propensity Score Matching • Summary and Discussion

Limitations-1 • If two groups do not have substantial overlap, then substantial error may be introduced • Individuals that fall outside the region of common support have to be disregarded. Information from those outside the common support could be useful especially if treatment effects are heterogeneous. Possible solution: 1. Analyze them separately 2. Calculate non-parametric bounds of the parameter of interest (Lenchner, 2000)

Propensity Score Matching