Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics

Comparing high-dimensional propensity score versus lasso variable selection for confounding adjustment in a novel simulation framework Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics Brigham and Women’s Hospital and Harvard Medical School QMC, Department of Quantitative Health Sciences University of Massachusetts Medical School April 15, 2014

Background • Administrative healthcare claims data are a popular data source for nonrandomized studies of interventions. • Because treatments are not randomized, addressing confounding is the primary methodological challenge.

Claims Data • Comprehensive claims databases contain information on patient insurance enrollment and demographics, as well as every healthcare encounter, including: • Diagnoses • Procedures • Hospitalizations • Medications dispensed • Dates of encounters provide a complete longitudinal record of patients’ healthcare interactions.

New user design • Potential confounders are measured prior to initiation of exposure. • Active treatment comparator group reduces biases associated with non-user comparators. • End of: • Data • Enrollment Exposure initiation Follow-up for outcome events Covariates assessed

Principles of variable selection • Brookhart et al. (2006) showed that the best PS model is the model that includes all predictors of outcome (regardless of whether they are associated with exposure). • Pearl (2010) and Myers et al. (2011) further noted that including instrumental varaibles (IVs) can increase bias from unmeasured confounding. • IVs are associated with exposure, but not associated with outcome except through exposure.

hd-PS variable selection • The high-dimensional propensity score (hd-PS) algorithm screens thousands of diagnoses, medications, and procedure codes and ranks variables according to likelihood of confounding. • Relies on the idea that a large number of “proxy” variables can reduce bias from unmeasured confounding. • Empirical evidence has shown a reduction in bias.

Shrinkage methods • Greenland (2008) suggested regularization methods as preferable to variable selection. • Shrinking coefficients allows for efficient estimation, even in models with many degrees of freedom. • Lasso regression provides both shrinkage and principled variable selection. • Shrinkage allows for direct modeling of the outcome even with many potential confounders • Some coefficients are shrunk all the way to 0.

Objective • To compare the performance of • hd-PS variable selection • Ridge regression of the outcome on all potential confounders • Lasso regression of the outcome on all potential confounders • The goal is maximum reduction in confounding bias.

Comparing high-dimensional methods • How can we answer this question? • Empirical studies are useful when we “know” the true treatment effect, but even then we can’t determine the contributions of bias and variance to overall error. • Ordinary simulation techniques with completely synthetic data cannot capture the complex correlation structure among covariates in claims data.

Plasmode simulation • We start with a real empirical cohort study: • 49,653 patients • Exposed to either ns-NSAIDs or Cox-2 inhibitors (X) • Followed for gastrointestinal events (Y) • Pre-defined covariates include age, sex, race, and 16 diagnosis/medication/procedure variables (C1) • To get reasonable values for associations between covariates and outcome, we estimated a model with: • Y ~ X + all pre-defined covariates + interactions between age and binary covariates

Simulation setup • True outcome generation model: • Estimated coefficient values from the observed outcome model • Except for the coefficient on exposure: . • To create simulated datasets: • Sample with replacement rows from (X, C) • Calculate for each patient in the sample. • Simulate outcome • We created 500 datasets, each of size 30,000, outcome prevalence set to 5%, exposure prevalence set to 40%.

True causal diagram Any variables associated with exposure remain associated with exposure. C Any correlations among covariates and true confounders remain intact. C1 X Y Associations with outcome are determined by chosen simulation model. C1 = True confounders, a subset of C = all measured covariates.

Outcome generation

The mechanics of hd-PS • For each diagnosis, procedure, medication code, hd-PS creates 3 potential variables: • Code observed ≥ 1 time during baseline period • Code observed ≥ median number of times • Code observed ≥ 75th percentile number of times • There are 2 potential ranking methods: • Exposure-based: A simple RR association measure between exposure and each variable. • Bias-based: Bross’s bias formula that considers the association of each varaible with exposure and outcome

hd-PS Analyses • PSs were constructed using: • The top 500 exposure-ranked variables + demographics • The top 500 bias-ranked variables + demographics • The top 30 exposure-ranked variables + demographics • The top 30 bias-ranked variables + demographics • Logistic regression on exposure + deciles of each PS

Shrinkage analyses • Regression of the outcome on all hdPS-screened variables (4800 – those that never occur) + exposure + demographics • Ridge regression • Lasso regression • We apply no shrinkage to the coefficient on exposure. • Calculate the crude estimate for comparison

Combination approaches • Using the variables selected by the lasso regression: • Include them in a PS analysis • Include them in an ordinary logistic regression outcome model • Using the 500 variables chosen by bias-based hd-PS: • Include them in an ordinary logistic regression outcome model • Include them in a lasso outcome model • Include them in a ridge outcome model

Results – Variable selection • Lasso selected 103 variables on average. • 66% were also selected by at least one hdPS algorithm • IQR: 62-70% • Age was selected in 100% of simulations. • Race was selected in 28%.

Results - Bias

Results - Bias Crude confounding bias of 0.19.

Results - Bias Ridge and lasso regression with all variables reduces bias by 41% and 63%, respectively.

Results - Bias Ridge and lasso do better when they start with pre-screened variables. Bias is reduced by 70% and 83%, respectively.

Results - Bias Ordinary regression and PS approaches performed better. Exposure-based hdPS with 500 variables completely eliminated bias.

Results - Bias Bias-based hdPSvaraible selection also performed well, with 93% and 91% bias reduction in the PS and ordinary regression models.

Results - Bias PS and regular regression models performed well using lasso variable selection as well (95% and 96% bias reduction).

Results - Bias When restricting variables to a very small set, bias-based hdPS was much preferred.

Conclusion • The variable selection method had relatively little importance. • The estimation method mattered much more. • Shrinkage of coefficient estimates led to insufficient bias control. • Focus on including a large number of potential confounders or confounder proxies.

Limitations • There are many “instruments” in current simulation setup. • Variables associated with exposure that are not included in the outcome simulation model are essentially IVs, which is unrealistic. • There is no unmeasured confounding in these data. • Variable selection is an easier task when all important confounders are measured.

Future work • Enrich the outcome model • Non-linear associations, more interactions, more true confounders • Vary the true treatment effect • Modify the coefficient on treatment in the outcome generation model. • Vary exposure prevalence • Can be accomplished by sampling within exposure group. • Vary outcome prevalence • Modify the intercept in the outcome generation model. • Unmeasured confounding • Set aside one or more true confounders and don’t allow methods to utilize these variables. • Other base datasets

Thanks! • Co-authors: • Wesley Eddings • Jeremy A Rassen • Robert J Glynn • Sebastian Schneeweiss • Contact: • jmfranklin@partners.org • www.drugepi.org/faculty-staff-trainees/faculty/jessica-franklin/

Jessica Franklin Instructor in Medicine Division of Pharmacoepidemiology & Pharmacoeconomics