Propensity Scores Making Sense of Non-Randomized Observational Data

Propensity Scores Making Sense of Non-Randomized Observational Data Atul Sharma MD, MSc, FRCPC(ret) Biostatistical Consulting Unit April 2014

Propensity score 1996 - 2013

RCT – the gold standard • R.A. Fisher: The Design of Experiments, 1925 • Introduced the concept of randomization • Neyman-Ruben Counterfactual Framework • Ruben and Rosenbaum, 1983 • Formalizes conditions for valid RCT • Goal: estimate treatment effect in ith individual ti = Yi1 – Yi0, where • Yi1 = outcome of ith patient with treatment, Yi0 = outcome with control • Usually, we observe only one outcome i.e. “counterfactual” is hypothetical • This is the “fundamental problem of causal inference “ -> randomization

Neyman-Ruben Theory Counterfactual Framework • With randomization, average treatment effect estimated by difference in means or proportions between treatment & control groups i.e. • Randomization ensures groups balanced for all important covariates -> unbiased estimate of average treatment effect. Treatment assignment is ‘strongly ignorable’ • 1:1 randomization allows direct comparison of groups with simple statistics e.g t-tests for means; RR, OR, NNT for proportions • With more complex study designs e.g. 1:k randomization, weighted averages still permit unbiased estimates of average treatment effect • All such estimates rely implicitly on knowing the probability of assignment to treatment group = propensityscore (e.g. ½ )

Neyman-Ruben Theory Counterfactual Framework • Sometimes randomization fails: • Selection biases arise when there is an imbalance between groups for observed or unobserved covariate, and probability of treatment now depends on covariates e.g. older or sicker patients more likely to receive treatment • Confounding occurs there is an imbalance between treatment groups on observed or unobserved covariates that influence outcome e.g. older or sicker patients more likely to receive treatment and more likely to die • Potential confounders: Covariates correlated with treatment and outcomes

Neyman-Ruben Theory Counterfactual Framework • Imbalance may be due to failure of randomization, non-compliance, unblinded treatment assignment. • Imbalance also arises as a result of non-randomized treatment assignment chosen by patient/physician or institutional/social policies i.e. • ‘natural experiments’ or ‘quasi-experimental designs’ • A non-randomized study can yield unbiased estimates of treatment effects

Conventional methods for adjusting for imbalance: • Stratification (limited to a few key covariates e.g. age, gender). For example, matching on 10 binary factors e.g disease history -> 1024 strata • Matching case-control (also limited to a few key covariates). Given covariates, treatment assignment is ‘strongly ignorable’ • Regression adjustment : Models effects of confounders on outcomedirectly • Can adjust for multiple confounders, but effectiveness depends on correctly specifying the model e.g. non-linear terms, interactions, etc. Misspecification may actually worsen bias • Overfitting may improve fit with current data (‘internal validity’), but limit generalizability to larger population (‘external validity’). Principle of parsimony limits model complexity • Almost impossible to assess effectiveness of ‘bias control’ (unlike PS)

Balancing Property of the Propensity Score: • PS is probability of assignment to treatment given confounders, summarizes all observed confounders in a single number (scalar) • The ‘balancing property’ tells us that conditioning on the PS by stratifying, matching or regression adjustment will improve balance between groups • Alternatively, • The ‘fundamental theorem’: as sample size increases, treatment is assignment independent of observed covariates after conditioning on propensity score(Ruben and Rosenbaum, 1983) • Given PS, treatment assignment is said to be ‘strongly ignorable’

Propensity scores in the real world: • Advantages: • PS estimated by familiar methods (e.g. logistic regression) • Balances multiple covariates simultaneously • Model of treatment assignment applies only to current data, no external validity • Similarly, less need for parsimony. Over-fitting is not a problem if balancing works • After conditioning on PS, balance can be assessed easily and directly • Caveats: • Unlike randomization, no assurance that unobserved covariates will balance • You can only balance observed/ measured covariates • Works best with large N, as imbalance may be unavoidable with small N • Take care to avoid covariates that are really treatment outcomes (mediators). Adjusting for covariates that are effected by treatment may worsen bias

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 • Dataset and codebook available at http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 • RCT is often impractical for ethical, logistic, financial reasons e.g. “ the popularity of this procedure and the widespread belief that it is beneficial makes the performance of an RCT difficult” • Treatment selection at discretion of MD, likely to be confounded with patient characteristics related to the outcome e.g. patients with low BP more likely to receive RHC and more likely to die • 7 critical care specialists identified 50 variable likely to influence treatment • Dependent variable is RHC in first 24h (2184 cases vs 3551 controls) • Independent variables are age, sex, race, education, income, medical insurance, 9 primary disease categories, 10 admission diagnoses, 12 comorbid conditions, pre-admission ADL and DASI, day 1 DNR, cancer, 2 mo. survival probability, APACHE III APS score, Glasgow Coma Score, weight, temperature, BP, respiratory rate, heart rate, PaO2/FiO2, PaCO2, pH, WBC, hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output • Great care taken to identify all confounders that might bias outcomes through literature review, expert panels

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 Pre-treatment balance: 8 select covariates • How do we compare groups that are fundamentally different?

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 Kaplan Meier Survival, death by day 30

A.F. Connors et al. The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. JAMA 1996, Sep 18; 276(11): 889-97 • Cox proportional hazards regression • lt = hazard function = probability of death at time t given survival until t • log(lt) = l0 + b1x RHC • e b1= hazard ratio for RHC • Unadjusted Cox PH model: • coxph(survival~swang1, data=rhc)) Given selection biases, patients with RHC sicker (lower BP, higher creatinine, higher APACHE scores), HR likely overestimate of true risk of treatment

Estimating propensity scores • Logistic regression for binary outcome e.g. RCH vs No RHC (swang1) • LHS p = probability of treatment assignment = propensity score • RHS covariates may be numeric or categorical (0/1 dummy variables) • b’s or OR = ebmeasure impact of each predictor on probability of RHC • For each subject, fitted value of p or linear predictor may be used as PS

Estimating propensity scores Logistic regression: RHC+/RHC- adjusted for 50 risk factors model1 = glm(swang1~age + sex + race + edu + income + ninsclas + cat1 + das2d3pc + dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + meanbp1 + resp1 + hrt1 + pafi1 + paco21 + ph1 +wblc1 + hema1 + sod1 + pot1 + crea1 + bili1 + alb1 + resp + card + neuro + gastr + renal + meta + hema + seps + trauma + ortho + cardiohx + chfhx + dementhx + psychhx + chrpulhx + renalhx + liverhx + gibledhx + malighx + immunhx + transhx + amihx, data=rhc, family=binomial) prop_score= fitted.values(model1) • 30 of 50 were significant independent predictors of treatment (p < 0.05) • OR = ebmeasures impact of predictors on probability of RHC • in general, sicker patients more likely to undergo RHC

Estimating propensity scores Does LR model for PS adequately discriminate treatment groups? Check ROC • Data: 3551 controls < 2184 cases • Area under the curve: 0.80 • 95% CI: 0.79-0.81 • If discrimination inadequate, refine model: • Add covariates, nonlinear terms, interactions • Alternate PS estimation procedure • Probit regression • Discriminant analysis

Estimating propensity scores: • Compare frequency distributions • PS should discriminate, but overlap is essential!

Using Propensity Scores to balance covariates in observational data: • 1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis) • With perfect stratification, average treatment effect across strata is unbiased estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders • 2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression • Binary categorical outcomes: logistic regression • Survival analysis: Cox proportional hazards regression • 3. Match cases and controls by PS • Compare groups as if from RCT • Almost always reduces sample size (unlike 1,2)

Stratify by Propensity Score: • As few as 5 strata can eliminate 90% bias (Ruben and Rosenbaum 1983) centiles= quantile(prop_score, probs=(0,0.2,0.4,0.6,0.8,1)) ps_quintiles = cut(prop_score, centiles, include.lowest=TRUE); • Balancing property of propensity score: • Like randomization, treated and control patients in each PS quintile have similar distribution of measured covariates • Unlike randomization, does not mean they will be similar for unobserved covariates • Does not imply that pairs matched by PS will be similar on covariates

Assess balance by quintile:

Are covariates related to RHC after PS adjustment? • Covariate balance after stratification must be assessed formally • Simple t-tests affected by sample-size differences (N=5735 vs 1147) • Differences between RHC+/RHC- should be tested using methods unaffected by sample size • Recommended : Regress each covariate on RHC ± PS quintile • covariate ~ treatment ± ps_quintile • If covariate unrelated to RHC after adjustment for PS, strata balanced

Are covariates related to RHC after PS adjustment? • Linear regression: age vs RHC status ± PS quintile (categorical, 5 levels): • lm(age~swang1, data=rhc) • lm(age~swang1 + ps_quintiles, data=rhc) • Logistic regression: sex vs RHC status ± PS quintile: • glm(sex~swang1, family = binomial, data=rhc) • glm(sex~swang1 + ps_quintiles, family = binomial, data=rhc) • After adjusting for PS quintile, cases and controls are well balanced. • May need to refine treatment model by adding covariates, non-linear terms, interactions - iteratively until balanced

Stratify by Quintile : Analyze each separately • Unadjusted Cox PH model applied to each stratum • Sometimes meaningful to report strata individually (distinct profiles, outcomes) • Average across strata (HR=1.24) is an ‘unbiased estimate of true treatment effect’ • For most analyses, formulae exist for calculating pooled significance tests adjusted for sample sizes and variances in each stratum • coxph(survival ~ swang1+strata(ps_quintiles), data=rhc)

Stratify by Quintile : Analyze each separately • Unadjusted Cox model applied to each stratum • Individual strata can be reported separately (sometimes meaningful) • coxph(survival~ swang1, data=rhc3) PS Quintile 3

Using Propensity Scores to balance covariates in observational data: • 1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis) • With perfect stratification, average treatment effect across strata is unbiased estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders • 2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression • Binary categorical outcomes: logistic regression • Survival analysis: Cox proportional hazards regression • 3. Match cases and controls by PS • Compare groups as if from RCT • Almost always reduces sample size

Regression Adjustment using propensity scores • Regression of outcome on treatment with adjustment for PS quintile • Cox proportional hazards models: • Unadjusted : coxph(survival~swang1, data=rhc) • Quintile PS : coxph(survival~swang1+ps_quintile, data=rhc) • Linear PS : coxph(survival~swang1+prop_score, data=rhc)

Regression Adjustment using propensity scores • In weighted regression, underrepresented subjects given greater weight so that weighted sample more representative • Inverse-Probability-Treatment Weights (IPTW) • treatment effect in population whose risk factor distribution equals all study subjects • regression weights : 1/PS for controls and 1/(1-PS) for treated • iptw=ifelse(swang1==TRUE,1/(1-prop_score), 1/prop_score)) • Standardized MortalityEstimator (SMR) Weights • treatment effect in population whose risk factor distribution equals treated subjects* • regression weights : PS/(1-PS) for controls and 1 for treated • smr=ifelse(swang1==TRUE,1, prop_score/(1-prop_score)) • Weighted regression • coxph(survival ~ swang1, weights=smr, data=rhc) • Results disappointing: • Freedman, Berk2008: better just to fit the model without weights

Regression Adjustment without propensity scores • Adjusted Cox PH model: model effect of confounders on outcome directly • coxph(survival~ swang1 + age + sex + race + edu + income + ninsclas + cat1 + das2d3pc + dnr1 + ca + surv2md1 + aps1 + scoma1 + wtkilo1 + temp1 + + meanbp1 + resp1 + hrt1 + pafi1 + paco21 + ph1 + wblc1 + hema1 + sod1 + pot1 + crea1 + bili1 + alb1 + resp + card + neuro + gastr + renal + meta + hema + seps + trauma + ortho + cardiohx + chfhx + dementhx + psychhx + chrpulhx + renalhx + liverhx + gibledhx + malighx + immunhx + transhx + amihx, data=rhc) • Misspecification and over-fitting may limit generalizability (external validity) • Difficult to assess adequacy of balance when modeling outcome directly • May provide reassuring confirmation of bias adjustment by PS

Using Propensity Scores to balance covariates in observational data: • 1. Stratify by quintile (Ruben and Rosenbaum, 1983): • Each quintile analyzed separately as if from RCT (~meta-analysis) • With perfect stratification, average treatment effect across strata is unbiased estimate of true treatment effect • 5 groups eliminate 90% of bias from observed confounders • 2. Covariate (regression) adjustment: outcome vs both treatment and PS quintile • Continuous numeric outcomes: ordinary linear regression • Binary categorical outcomes: logistic regression • Survival analysis: Cox proportional hazards regression • 3. Match cases and controls by PS • Compare groups as if from RCT • Almost always reduces sample size

Propensity score matching: • Create matched case-control pairs by matching on PS • Confirm balance before and after matching by comparing treatment groups with procedure independent of sample size (e.g. standardized differences). If inadequate, refine PS model or matching procedure iteratively until it works • Compare groups as if from randomized study • Caveats: • ‘Balancing property’ of PS does not ensure matched pairs will match on covariates, only groups on average • Matching constraints usually reduce sample size (stratification and regression adjustment use all cases and controls) • Requires specialized software

Propensity score matching: • Matching on PS: • Nearest neighbour: matches each case to closest control • Caliper matching: only if PS within a pre-specified caliper distance, typically 0.2 sd’s • Matching on original covariates • Minimize sum of distances between cases and controls • Euclidean distance: • Mahalanobis distance weights the contributions by inverse covariance, to avoid over-counting correlated measures, generally preferable • Exact matching for categorical variables e.g. gender, disease status

Propensity score matching: • Options include: • matching ± replacement (re-use controls; less bias, more analysis) • one:many, many:one or variable ratio matching • greedy (closest) vsoptimal matching (minimizes sum of distances) • almost exact matching, fine balance (without matching) • Suggested approach: • start with nearest neighbor matching without replacement • ± caliper ≤ 0.20 SD’s • ± exact matching for priority covariates (gender, disease) • check balance using standardized differences • consider optimal matching (effective, but intensive) • experiment with other options until cases and controls balanced • may need to use specialized packages (Stata, R)

Propensity score matching in R: • Matchinghttp://sekhon.berkeley.edu/matchingSekhon, J. S. (2011). Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software 42(7). • Uses automated procedure to select matches, based on univariate and multivariate balance diagnostics. Primarily 1:M matching (where M is a positive integer), allows matching with or without replacement, caliper, exact • MatchIthttp://gking.harvard.edu/matchitHo, D.E., Imai, K., King, G., and Stuart, E.A. (2011). MatchIt: Nonparametric preprocessing for parameteric causal inference. Journal of Statistical Software 42(8) • Two-step process: does matching, then user does outcome analysis. Wide array of estimation procedures and matching methods available: nearest neighbor, Mahalanobis, caliper, exact, full, optimal, subclassification • optmatch http://cran.r-project.org/web/packages/optmatch/index.htmlHansen, B.B., and Fredrickson, M. (2009). • optmatch: Functions for optimal matching. Variable ratio, optimal, and full matching.

Propensity score matching in Stata: • psmatch2http://ideas.repec.org/c/boc/bocode/s432001.htmlLeuven, E. and Sianesi, B. (2003). Statamodule to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing. • Allows k:1 matching, kernel weighting, Mahalanobis matching • pscorehttp://www.lrz-muenchen.de/~sobecker/pscore.htmlBecker, S.O. and Ichino, A. (2002). Estimation of average treatment effects based on propensity scores (2002) The Stata Journal 2(4): 358-377. • k:1 matching, radius (caliper) matching, and stratification (subclassification) • match http://www.economics.harvard.edu/faculty/imbens/software_imbensAbadie, A., Drukker, D., Herr, J. L., and Imbens, G. W. (2004). Implementing matching estimators for average treatment effects in Stata. The Stata Journal 4(3): 290-311. • Primarily k:1 matching (with replacement) • cem http://gking.harvard.edu/cem/Iacus, S.M., King, G., and Porro, G. (2008). Matching for Causal Inference Without Balance Checking. • Implements coarsened exact matching

Propensity score matching in SAS: • Local and global optimal propensity score matching • Coca-Perraillon, M. (2007). Local and global optimal propensity score matching. In SAS Global Forum 2007. Paper 185-2007. • Variety of matching methods. No built in diagnostics. • Greedy matching (1:1 nearest neighbor)Parsons, L. S. (2001). Reducing bias in a propensity score matched-pair sample using greedy matching techniques. In SAS SUGI 26, Paper 214-26. • Parsons, L.S. (2005). Using SAS software to perform a case-control match on propensity score in an observational study. In SAS SUGI 30, Paper 225-25. • Kosanke, J., and Bergstralh, E. (2004): Match 1 or more controls to cases using the GREEDY algorithm • 1:1 Mahalanbois matching within propensity score calipersFeng, W.W., Jun, Y., and Xu, R. (2005). A method/macro based on propensity score and Mahalanobis distance to reduce bias in treatment comparison in observational study. www.lexjansen.com/pharmasug/2006/publichealthresearch/pr05.pdf

Propensity Score Matching • Using R Match()function to implement • Nearest neighbour match on PS with caliper = 0.1 (PS within 0.025) • Exact matching on gender, admission diagnosis (9 disease categories) • Original number of observations.............. 5735 • Original number of treated obs............... 2184 • Matched number of observations............... 1461 • Matched number of observations (unweighted). 1461 • Number of obs dropped by 'exact' or 'caliper' 723 cases, 2090 controls • Balance checks before and after matching are mandatory • Test for balance mustbe independent of sample size • Given numeric covariate x, standardized mean difference = • For proportions p, • Most authors equate balance with a standardized mean difference < 10% • Full range pre- and post- match test statistics output by MatchBalance()

Propensity Score Matching • Output of MatchBalance() function includes tests for equality, treatment vs controls • means, standardized mean differences (< 10%) • variances (treatment vs controls) (0.5-2.0, Rubin 2001) • empiric quantiles (quantile-quantile plot) • empiric cumulative distribution function ***** (V5) meanbp1 ***** Before Matching After Matching mean treatment........ 68.198 73.523 mean control.......... 84.869 73.673 std mean diff......... -48.685 -0.42029 mean raw eQQ diff..... 16.673 1.6547 med raw eQQ diff..... 11 1 max raw eQQ diff..... 49 74 mean eCDF diff........ 0.092555 0.0092787 med eCDF diff........ 0.06949 0.0058179 max eCDF diff........ 0.21172 0.039014 var ratio (Tr/Co)..... 0.77589 1.0215 T-test p-value........ < 2.22e-16 0.90308 KS Naive p-value...... < 2.22e-16 0.21612 KS Statistic.......... 0.21172 0.039014

Propensity Score Matching: Assessing matching balance dot chart effective for display of many simultaneous covariates Treatment and control groups balance when Standardized Mean Differences < 10%

Propensity Score Matching • Output of MatchBalance() function includes tests for equality, treatment vs controls • means, standardized mean differences (< 10%) • variances (treatment vs controls) (0.5-2.0, Rubin 2001) • empiric quantiles (quantile-quantile plot) • empiric cumulative distribution function ***** (V5) meanbp1 ***** Before Matching After Matching mean treatment........ 68.198 73.523 mean control.......... 84.869 73.673 std mean diff......... -48.685 -0.42029 mean raw eQQ diff..... 16.673 1.6547 med raw eQQ diff..... 11 1 max raw eQQ diff..... 49 74 mean eCDF diff........ 0.092555 0.0092787 med eCDF diff........ 0.06949 0.0058179 max eCDF diff........ 0.21172 0.039014 var ratio (Tr/Co)..... 0.77589 1.0215 T-test p-value........ < 2.22e-16 0.90308 KS Naive p-value...... < 2.22e-16 0.21612 KS Statistic.......... 0.21172 0.039014

Propensity Score Matching: Assessing matching balance Equality of distributions may be assessed by QQ plots, eCDF, KS statistic Quantile – quantile plot: Creatinine(mmol/l) Empiric CDF: Creatinine (mmol/l) Formally compared via KS statistic

Propensity Score Matching Cox PH regression: 1461 matched pairs coxph(survival ~ swang1, data = rhcm)

Testing the fundamental assumption: “Estimated treatment effects are unbiased if there are no unobserved confounders i.e. all relevant covariates have been included” • Sensitivity Analysis seeks to test robustness of conclusions to hidden biases from unobserved confounders • Connors. 1996: • Asked 13 practicing clinicians to identify 10 variables with the greatest influence on their decision to use RHC in practice. None missing in study • From logistic regression, identified 4 physiologic variables with the largest effect on the decision to treat: PaO2/FiO2, mean BP, pulse, and resp. rate. Individually omitted from PS model -> hazard ratio ± 0.01 • Formal ‘Rosenbaum bounds’ to estimate how much hidden bias would need to be present to invalidate conclusions, implemented in rbounds libraries for R, Stata.

Rosenbaum Sensitivity Test for matched pairs: • Matched patients have same PS and same probability of treatment (RHC) if no unmeasured confounders • Assume true probabilities differ due to an unmeasured confounder that correlates perfectly with outcome (worst-case scenario). Let G = odds ratio for its effect on treatment assignment (in logistic regression for PS) • How large does G need to be to invalidate the conclusion RCH increased mortality? • For binary outcome: 2 x 2 table for survival in 1461 matched pairs • RHC dies RHC lives • Control dies 651 274 • Controllives 356 180 Discordant outcomes in 274+356 = 730of 1461 matched pairs McNemar test for matched pairs: Under Ho, how likely is it that two discordant cells (274 vs 356 ) arose from same binomial distribution by chance?

Rosenbaum Sensitivity Test for matched pairs: • For binary outcome: 2 x 2 table for survival in 1461 matched pairs • RHC dies RHC lives • Control dies 651 274 • Controllives 356 180 ←McNemar p= 0.001 • binarysens(274,356,Gamma=1.2,GammaInc=0.02) ←variesG -> effect on McNemar p-value • Rosenbaum Sensitivity Test • p-value of estimate .... 0.0006 • Gamma Lower bound Upper bound • 1.00 0.00062 0.00062 ← no confounding by unobserved covariate • 1.02 0.00025 0.00143 • 1.04 0.00010 0.00308 • 1.06 0.00004 0.00619 • 1.08 0.00001 0.01170 • 1.10 0.00000 0.02082 • 1.12 0.00000 0.03503 • 1.14 0.00000 0.05590 ← McNemarp > 0.05 Unmeasured confounder needs to increase odds of RHC by 1.14 to nullify the increase in mortality with treatment

Sensitivity analysis: Interpreting Rosenbaum bounds • Becker et al, 2007: “These are worst case scenarios…a critical value of G = 1.15 does not mean there is no effect of treatment on the outcome. However, the results are sensitive to deviations from the unconfoundedness assumption, and we advise some caution when interpreting” • Connors et al used a ‘less worst-case’ scenario to estimate Gneeded to ‘invert’ HR i.e. reduce it from 1.2 to 0.8and cited a 3-fold increase in odds of RHC due to unobserved confounders. • rbounds now the standard measure of sensitivity to hidden biases in matched studies • psens() performs similar analysis for numeric outcomes based on Wilcoxon rank-sum test (≠ McNemar) • On Friday, we’ll demonstrate both binarysens() and psens() to perform sensitivity analysis for different outcomes

Devices to increase robustness • Multiple control groups: • consistency between control groups that differ in unobserved confounders • e.g. different ways to escape treatment e.g. declined vs denied treatment • Coherence among outcomes: • e.g. length or cost of hospitalization rather than survival, looking for consistency • subgroup analysis: looking for consistency or plausible variations e.g. elderly, women, whites, shock/sepsis/ post-op comparable • Dose response relationships: looking for a consistent effect • e.g. SVC pressure line vs. full RHC • Known effects • e.g. outcomes known to be unaffected by treatment

A brief review • Quasi-experimental designs require as much care and attention as an RCT • Carefully identify allpotential confounders by • Clinical judgment • Literature review • Expert panels • Prioritize confounders to identify those which must be balanced (a priori) • Avoid controlling for post-treatment covariates • After estimating PS, make sure treatment model discriminates. Add covariates, non-linear terms, or interactions until it does. Make sure that group scores overlap • After conditioning on PS by stratification, regression, or matching, make sure that groups are balanced using measures independent of sample size. If needed, revise the treatment model or tighten up matching criteria • Assess robustness of conclusions to hidden bias due to unobserved confounders: • Have external panel of experts review choice of predictors • Re-analyze after omitting important confounders • Estimate Rosenbaum bounds i.e. G = how much hidden bias needed to invalidate conclusions

Propensity Scores Making Sense of Non-Randomized Observational Data