Strategic Health IT Advanced Research Projects (SHARP) Area 4: Secondary Use of EHR Data Project 3: High-Throughput P

Strategic Health IT Advanced Research Projects (SHARP) Area 4: Secondary Use of EHR Data Project 3: High-Throughput Phenotyping June 30, 2011 Jyoti Pathak, PhD Assistant Professor of Biomedical Informatics Department of Health Sciences Research

Project 3: Collaborators and Acknowledgments • CDISC (Clinical Data Interchange Standards Consortium) • Rebecca Kush, Landen Bain, Mark Arratoon • Centerphase Solutions • Gary Lubin, Jeff Tarlowe • Harvard University/MIT • GuerganaSavova, Margarita Sordo, Peter Szolovits • IBM T.J. Watson Research Labs • Marshall Schor • Intermountain Healthcare/University of Utah • Susan Welch, Herman Post, Darin Wilcox, Peter Haug • Mayo Clinic • Cui Tao, Lacey Hart, Erin Martin, Sridhar Dwarkanath, Calvin Beebe, Kent Bailey, Kevin Bruce, Mike Conway (UCSD)

Background On-going projects and updates Proposed project ideas for Year 2 Productivity till date Q & A Outline

The Big Question… • The era of Genome-Wide Association Studies (GWAS) has arrived • Genotyping cost is asymptoting to free [Altman et al.] • Most (all?) published GWAS are done on carefully selected and uniformly characterized patient populations • Time consuming • Clinical Phenotyping, on the other hand, is lacking • Slow-throughput • Costly and time consuming • How “good” are EMRs (with inconsistencies and biases) as a source for phenotypes?

Why is this important now? • Bio-repositories are becoming popular • Linking biospecimens to personal health data • Population-based studies for genetic and environmental conditions and contributions to disease etiology • Often limited in scope or population diversity • Clinical trials eligibility • Cohort identification is always a bottleneck • Quality metrics and HITECH Act • Large-scale prospective cohort studies could be facilitated by availability of complete, standardized, and unbiased data from EMRs

Pros and Cons of EMR Data for Phenotyping • We have a LOT of information about subjects • Demographics, labs, meds, procedures… • Team diagnoses as opposed to a diagnoses based on a single person’s opinion • Potential for more reliable diagnoses • Identification of otherwise latent population differences • Possible issues with using EMR data for phenotyping • Non-standardized, heterogeneous, unstructured data • Measured (e.g., demographics) vs. un-measured (e.g., socio-economic status) population differences • Hospital specialization and coding practices • Population/regional market landscape

But…the challenges can be addressed…if we • Develop techniques for standardization and normalization of clinical data • Develop techniques for transforming and managing unstructured clinical text into structured representations • Develop techniques for resolving missing and inconsistent data • Develop a scalable, robust and flexible framework for demonstratingall of the above in a “real-world setting” SHARP Area 4 Project!

EMR-derived Phenotyping • Overarching goal • To develop techniques and algorithms that operate on normalized EMR data to identify cohorts of potentially eligible subjects on the basis of disease, symptoms, or related findings • Phenotyping (from our perspective) • Inclusion and exclusion criteria for cohort identification • Numerator and denominator criteria for clinical quality metrics • Trigger criteria for clinical decision support • …

EMR-based Phenotype Algorithms • Typical components • Billing and diagnoses codes • Procedure codes • Labs • Medications • Phenotype-specific co-variates (e.g., Demographics, Vitals, Smoking Status, CASI scores) • Pathology • Imaging? • Organized into inclusion and exclusion criteria • Experience from eMERGE (http://www.gwas.net) • Electronic Medical Records and Genomics Network

EMR-based Phenotype Algorithms • Iteratively refine case definitions through partial manual review to achieve ~PPV ≥ 95%) • For controls, exclude all potentially overlapping syndromes and possible matches; iteratively refine such that ~NPV ≥ 98%

Example: Type 2 Diabetes (cases)

Challenges • Algorithm design • Non-trivial; requires significant expert involvement • Highly iterative process • Time-consuming manual chart reviews • Representation of “phenotypic logic” • Data access and representation • Lack of unified vocabularies, data elements, and value sets • Questionable reliability of ICD & CPT codes (e,g., omit codes that don’t pay well, billing the wrong code since it is easier to find) • Natural Language Processing needs • And many more…

Outline • Background • On-going projects and updates • Proposed projects for Year 2 • Productivity till date • Q & A

Current HTP Project Themes • Identification of Clinical Element Models • Phenotyping Execution Logic • Data Quality, Validation and Cost Effectiveness

Project Overview • Three eMERGEphenotyping algorithms as initial Use Cases • Type 2 Diabetes Mellitus (T2DM) • Peripheral Arterial Disease (PAD) • Hypothyroidism • Specified computable mappings between CEMs and algorithms • Classified phenotyping input specifications into two categories: • General EHR data requirements (Examples: demographics, diagnoses) • Phenotype-specific EHR data (Example: Ankle-brachial index for PAD) • Proposed semantic types of the input specifications

Semantic Classification Types Demographic data (e.g., Gender, Race, Age, etc) Physical measurements (e.g., Weight, Height, BMI, etc) Diagnosis (ICD codes, SNOMED CT annotations from problem list, administrative coding workflows, clinical notes, and etc) Procedure (CPT codes, ICD procedure codes) Medication Laboratory

Diagnosis AdministrativeDiagnosisCode: billing purposes ClinicalAssertedDiagnosisCode: problem list, clinical notes, etc Medication Prescribed/Ordered Dispensed Administered Procedure AdministrativeProcedureCode: CPT code, ICD 9 code for inpatient. Laboratory General Models for Scalability

Mapping Issues • Secondary use versus patient care meanings • History of X meaning “evidence of X prior to date Y” versus history of X statementin text documents • Diagnosis inputs often validated on ICD-9-CM codes • Non-standard aggregations • Fasting glucose test • Availability of data in EHR • Age at onset of X • Medical specialty (ankle brachial index) • Smoking history/family history (NLP/structured solutions)

Mapping Considerations • Algorithm inputs are abstractions of EHR content • Native content • Generalized content • Computed • Selected content • Common constraints of EHR content • Source of data, i.e., EHR application used, encounter type • Allowable codes • Temporal bounds • Relationships among separate observations

Example CEM to Algorithm Map

Example CEM to Algorithm Map - 2

Drools-based PhenotypingArchitecture Clinical Element Database List of Patients for Specific Cases Drools (A long with other technologies) • Workflow authoring by domain experts (clinicians) • Rule accessibility by clinicians – BPMN, decision tables, DSL; collaborative authoring Domain Expert ~ Analyst ~ Developer

Drools-based Phenotyping Architecture Clinical Element Database Data Access Layer Business Logic Transformation Layer Inference Engine (Drools) List of Diabetic Patients Service for Creating Output (File, Database, etc) Transform physical representation  Normalized logical representation (Fact Model)

Drools – Workflow

Diabetes Project Status • Diabetes Rules are Completed • Demonstrated the Workflow/Rules for Feedback • Make Rules “Shareable” • Performance Validation • More details in the later session!

DM2 algorithm

NQF QDM Criteria

Data Quality: Objectives • Assess Data variability within and across institutions • Assess impact of this variability on Secondary Use of EMR • Generate specifications for Widgets • “Warning Label” for suspect data categories • Data quality audits with logs • Batch data correction / removal • More details during the later session!

Centerphase Project Research Design Randomly generate ONE sample set of patient records from database: Based on T2DM ICD9 codes from at least 2 visits during measurement period Sample Patient Records Manual Process Algorithm-Driven Process Study coordinator (SC) conducts manual review of patient charts, and monitors activity time Programmer develops and runs algorithm to query records, and monitors development and run time Screens 1 -3 Screens 1 -3 Patient Result Set Patient Result Set Compare time, cost and accuracy of results

Project 1: National Library for Clinical Phenotyping Algorithms • Current state of the art • MS Word files: do not scale • An FTP server: will not work either • We need…programmatic access, querying, navigation • Promote re-use (where applicable) • Research Question: To develop an implementation independent, phenotyping logic representation template for algorithm design • Existing work on Drools, GELLO and NQF • Leverage CEMs for algorithm design and representation • Publicly accessible Web-based environment for phenotyping algorithms • Validate algorithm deployment in multiple EMR settings

Project 2: Machine Learning and Phenotyping • EMR-derived phenotyping algorithm development is tedious, and time-consuming • Based on our own experience! • Research Question: To leverage machine learning methods for rule/algorithm development, and validate against expert developed ones • Use eMERGE library of phenotype algorithms for validation • Asthma and Diabetes as initial use-cases • Preliminary work by Susan • Work with data normalization and NLP teams

Project 3: Just-in-Time Phenotyping • The current pipeline prototype is based on a relational persistence layer • Access to historical, retrospective data • Offline processing of data and phenotyping algorithms • Research Question: To to apply phenotyping algorithms as “data sniffers” that can be plugged within an UIMA pipeline • Online, real-time phenotyping (e.g., for clinical decision support) • How much data is “necessary”? How much data is “necessary and sufficient”? • More active role of NLP techniques

Project 4: Phenotyping Workbench • EMR-based phenotyping algorithms are hard to design, and even harder to implement • Access to domain experts—often a resource issue • Access to IT/informatics experts—also, a resource issue • Lot of moving components • Research Question: To develop a phenotyping “plug & play” workbench for algorithm design and evaluation • Visual and graphical algorithm editing (jPBMN) • Configurable algorithms (Drools code snippets) • User workspace management (who are these “users”?) • File-based or database access layer (CEM-based) • Leverage i2b2 workbench where applicable • “Plug & Play” is still a big challenge…

Productivity till date • Manuscripts/Abstracts/Posters • Conway MA, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, Linneman JG, Pacheco JA, Pessig PL, Rasmussen L, Weston N, Chute CG, Pathak J. Analyzing Heterogeneity and Complexity of Electronic Health Record Oriented Phenotyping Algorithms. AMIA 2011 (paper). • Tao C, Parker CG, Oniki TA, Pathak J, Huff SM, Chute CG. An OWL Meta-Ontology for Representing the Clinical Element Model. AMIA 2011 (paper). • Chute CG, Pathak J, Savova GK, Bailey KR, Schor MI, Hart LA, Beebe CE, Huff SM. The SHARPn Project on Secondary Use of Electronic Medical Record Data: Progress, Plans and Possibilities. AMIA 2011 (paper). • Conway MA, Pathak J. Analyzing the Prevalence of Hedges in Electronic Health Record Oriented Phenotyping Algorithms. AMIA 2011 (poster). • Tao C, Welch SR, Wei WQ, Oniki TA, Parker CA, Pathak J, Huff SM, Chute CG. Normalized Representation of Data Elements for Phenotype Cohort Identification in Electronic Health Record. AMIA 2011 (poster). • Prototype software • Drools-based implementation of the diabetes algorithm

Thank You!

Strategic Health IT Advanced Research Projects (SHARP) Area 4: Secondary Use of EHR Data Project 3: High-Throughput P