Using Secondary Data Analysis for Outcomes Research

Using Secondary Data Analysis for Outcomes Research Epi 211 April 2012 Michael Steinman, MD

Disclosures and acknowledgements Disclosures: None Acknowledgements: J. Michael McWilliams Ann Nattinger SGIM Research Committee Shameless plug for CER http://ctsi.ucsf.edu/research/cer

Question: • You are a fellow / junior faculty member interested in studying... • Impact of nurse-led HTN clinics on clinical outcomes in patients with HTN • Impact of implementing EMRs on appropriate prescribing in ambulatory surgical patients • Whether quality measures of asthma control in children correlate with actual clinical outcomes in this population

Question: • Here’s your choice: • A. Get a multimillion dollar grant to conduct a multi-center, multi-year RCT • B. Analyze existing data

Learning objectives • Appreciate key conceptual and methodologic issues involved in outcomes research employing secondary data analysis • Identify and use online tools for locating and learning about datasets relevant to your research

Overview • Working with secondary data • Conceptual and methodologic issues • Overview of high-value datasets and web-based resources • Q&A

Working with Secondary Data

Key Take-Home Points • Secondary data analysis is rigorous research • Not throwing data on a wall and seeing what sticks • RQ must meet FINER criteria and be interesting a priori • Know the data as if it were your own • How was it collected; limitations (including validity) • Read the codebooks and any/all documentation; validation studies; speak with PIs. • Perfect enemy of good (but so is bad)

Conceiving a Project Which comes first: question or dataset? • Research question first • Dataset first

Conceiving a Project Which comes first: question or dataset? • Research question first • Dataset first Hybrid approach • Identify research focus, broad question • Consider candidate datasets • Hone question • Iterate between 2 and 3

Types of Secondary Data • Data that have been collected but not for you • Survey • Administrative (claims) • Discharge • Medical chart / EMR • Disease registries • Aggregate (ARF, US Census) • Combinations and linkages

Selecting a Database • Compatibility with research question(s) • Availability and expense • Sample: representativeness, power • Measures of interest present and valid • Predictors, outcomes, confounders • Messiness and missingness • Local expertise • Linkages

Challenges and Pitfalls • Causal inference • Inherently limited with observational data • Does not preclude quasi-experimental designs to recover causal effects • Core of comparative effectiveness research • Value of these approaches highly dependent on expected confounders • For example, study of medical management vs. catheterization for AMI

Challenges and Pitfalls • Validity of measures • Beware of assumptions • Problems: coding, reporting, recall biases • Carefully read the codebooks and documentation about the study • How variables measured • (Who was included in study) • Solutions: direct validation in subgroup or another data source, literature review, sensitivity analyses

What You Want and What You Have • Want to measure financial resources • Explanation for underuse of health services, poor outcomes? • Have measures of income. • Are the two equivalent? • Might financial resources also depend on: • Other assets – especially retired persons? • Family and community resources

What You Want and What You Have • Want to measure presence of a chronic disease • Have ICD9 codes from Medicare billing claims. • Will this work? • Accuracy of ICD9 claims may depend on: • Type of disease – specificity of symptoms, “dominance” in clinical visit, accuracy of clinician diagnosis • Coding incentives – upcoding in Medicare, undercoding in VA • How codes operationalized – which codes to use; require 1 or 2 separate codes; what time period; etc.

Challenges and Pitfalls 3. Complexity of file structure • Row in dataset may not be unit of analysis • Skip patterns, proxy respondents

A Simple Question? Ask: IF ((piRTab1X007AFinFam = FAMILYR) OR (piRTab1X007AFinFam = FINANCIAL_FAMILYR)) AND ((ACTIVELANGUAGE <> EXTENG) AND (ACTIVELANGUAGE <> EXTSPN)) AND (piInitA106_NumContactKids > 0) AND (piInitA100_NumNRKids > 0) JE012 CHILDREN LIVE WITHIN 10 MILES Section: E Level: Household Type: Numeric Width: 1 Decimals: 0 CAI Reference: SecE.KidStatus.E012_ 2000 Link: G19802002 Link: HE012 IF {R DOES NOT HAVE SPOUSE/PARTNER and DOES NOT STILL HAVE HOME OUTSIDE NURSING HOME {(CS11/A028=1) and (CS26/A070 NOT 1)}} or {R & SPOUSE/PARTNER} LIVE IN SAME NURSING HOME (CS11/A028=1 and CS12/A030=1): [Do any of your children who do not live with you/Does CHILD NAME] live within 10 miles of you (in R's NURSING HOME CITY, STATE (CS25b/A067))? OTHERWISE: [Do any of your children (who do not live with you)/Does CHILD NAME] live within 10 miles of you (in MAIN RESIDENCE [CITY/CITY, STATE STATE])? 6802 1. YES 4720 5. NO 32 8.DK (Don't Know); NA (Not Ascertained) 4 9. RF (Refused) 2087 Blank. INAP (Inapplicable) * From the Health and Retirement Study

Challenges and Pitfalls 4. Data mining / overfitting • Is urine cortisol associated with Catholicism? • But… • “Just because you were too stupid to think of the question in advance doesn’t mean it’s not important” - Warren Browner

Challenges and Pitfalls • Representativeness of Sample • External validity (generalizability) • Internal validity (selection bias) • Example: comparing outcomes for insured and uninsured patients using hospital discharge data • Must be hospitalized to enter sample • Not only limits generalizability (to outpatients) • But inferences about the sample may be wrong • Sample would need to include uninsured who would have been hospitalized if insured

Finding the Right Dataset

Finding the Right Dataset • Contain variables of interest • predictor, outcome, confounders • Relevant time frame • Cross-sectional, longitudinal • Feasible • Access: time, bureaucracy, cost • Usable • No perfect datasets -> hybrid approach of developing research question

Administrative Data (VA) • VA has multiple high-value administrative databases • Outpatient visit information • Visit date, type of clinic, provider, ICD9 diagnoses • Inpatient information • Admitting dx(s), discharge dx(s), CPT codes, bed section, meds administered • Lab data • >40 labs • Pharmacy data • All inpatient and outpatient fills • Academic affiliation • etc

Administrative Data (VA) • Huge bureaucracy and paperwork

Administrative Data (VA) • Messy data • Huge size • 2+ TB server • Data analyst

Survey Data (NHANES) • National Health and Nutrition Examination Survey (NHANES) • Nationally representative sample of >10K patients every 2 years • Extensive interview data on clinical history (including diseases, behaviors, psychosocial parameters, etc.) • Physical exam information (e.g. VS) • Labs, biomarkers

Survey Data (NHANES) • Free and easy to download • (Relatively) easy to use • Although requires careful reading of documentation • Serial cross-sectional • Disease data self-report • Very limited information about providers and systems of care

Survey Data (NAMCS) • National Ambulatory Medical Care Survey (NAMCS) and National Hospital Ambulatory Medical Care Survey (NHAMCS) • Nationally representative sample of ~70K outpatient and ED visits per year • Physician-completed form about office visit

Survey Data (NAMCS) • Data more from physician perspective (diagnoses, treatments Rx’ed, etc) and some info on providers (e.g., clinic organization, use of EMRs, etc) • Serial cross-sectional • Visit-focused • Not comprehensive, ? value for chronic diseases

Discharge Data (NIS) • National Inpatient Sample (NIS) • Database of inpatient hospital stays collected from ~20% of US community hospitals by AHRQ • Diagnoses and procedures, severity adjustment elements, payment source, hospital organizational characteristics • Hospital and county identifiers that allow linkage to the American Hospital Association Annual Survey and Area Resource File

Discharge Data (NIS) • Relatively easy to access (DUA, $200/yr) • Relatively easy to use • Though need close attention to documentation • Limited data elements • Huge data files

Web-Based Resources • Society of General Internal Medicine (SGIM) Research Dataset Compendium • www.sgim.org/go/datasets • UCSF CELDAC • http://ctsi.ucsf.edu/research/celdac • UCSF K-12 Data Resource Center • http://www.epibiostat.ucsf.edu/courses/RoadmapK12/PublicDataSetResources/

Finding Additional Resources • National Information Center on Health Services Research and Health Care Technology (NICHSR) • Inter-University Consortium for Political and Social Research (ICPSR) • Partners in Information Access for the Public Health Workforce • Roadmap K-12 Data Resource Center (UCSF) • List of datasets from the American Sociologic Association • Canadian Research Data Centers – Data Sets and Research Tools (Canada) • Directory of Health and Human Services Data Resources • Publicly Available Databases from National Institute on Aging (NIA) • Publicly Available Databases from National Heart, Lung, & Blood Institute (NHLBI) • National Center for Health Statistics (NCHS) Data Warehouse • Medicare Research Data Assistance Center (RESDAC); and Centers for Medicare and Medicaid Services (CMS) Research, Statistics, Data & Systems • Veterans Affairs (VA) data (all available at www.sgim.org/go/datasets)

National Information Center on Health Services Research and Health Care Technology (NICHSR) • Databases, data repositories, health statistics • Fellowship and funding opportunities • Glossaries, research and clinical guidelines • Evidence-based practice and health technology assessment • Specialized PubMed searches on healthcare quality and costs http://www.nlm.nih.gov/hsrinfo/datasites.html

Inter-University Consortium for Political and Social Research (ICPSR) • World’s largest archive of social science data • Searchable • Many sub-archives relevant to HSR • Health and Medical Care Archive • National Archive of Computerized Data on Aging http://www.icpsr.umich.edu/icpsrweb/ICPSR/partners/archives.jsp

Conclusions • Secondary data has lots of advantages • Relatively quick, tremendous power, high-profile work • Approach with a high level of detail and care • Conceptual background and RQ • Validity and use of measures • Explore range of options available – but also take advantage of resources at hand

Using Secondary Data Analysis for Outcomes Research