Using Electronic Medical Records Systems for Clinical Research: Benefits and Challenges

Using Electronic Medical Records Systems for Clinical Research: Benefits and Challenges Prakash M. Nadkarni

Introduction • Opportunities • Availability of clinical, financial and administrative data in electronic form • Challenges • Using EMR Software for research operations • Using EMR Data for research? Suitability of care-oriented data to clinical research needs. • EMRs queried directly to answer research questions

EMR/Clinical Research Information System (CRIS) Differences: Research Subjects • Subjects are not necessarily “patients”. • Personal Health Information may be optional. • Not all screened subjects are enrolled. • Simultaneous or sequential enrollment • Eligibility Criteria

EMR/CRIS Differences: The Study Calendar • Events/Visits and Study Calendar: Specific evaluations or interventions are done at specific time points ('events") relative to the start of the study. • All patients are not enrolled at the same time.

EMR/CRIS Differences: Electronic Data Capture (EDC) • CRIS EDC is Far More Structured and Fine-grained – textual comments are only a last resort. • CRISs may need to Support Real-Time Self-reporting of Subject Data • CRIS EDC may not always be Real-Time. • Quality Control considerations dictate many workflow steps.

EMR/CRIS Differences: Trans-Institutional Scope • For trans-institutional scope, Web technology is virtually mandated. • Site restriction in Multi-Site studies – end-users and investigators access only their own site’s patients. • Trans-National Issues: Software Localization/ Globalization – same software, different language/layout.

EMR/CRIS Differences: User Roles • CRISs support differential access to studies • Most users of a CRIS are unaware of the other studies in the same database. • Some users have read-only access to the data; some only view reports. • Only certain users may be allowed to enter data in particular forms, or even view certain "blinded" data. • Data analysts typically do not need to access PHI. However, in multi-institutional studies, they are not typically site-restricted (see later)

EMR/CRIS Differences: Summary • EMRs are intended to primarily support patient care, not research. CRISs are specifically designed for research protocols. • May inter-operate with CRISs. • Sub-systems: Laboratory, Pharmacy, Scheduling • EMR *may* be used with structured EDC for intra-institutional studies if the only alternative is paper, or if data-entry would otherwise be duplicated. • Claims by any EMR vendor that their systems are CRIS-capable should be viewed skeptically.

EMR Data for Research: • The Nature of Electronic EMR Data • Significant dependence on narrative text, which is often the gold standard for clinical findings. • Using administrative/billing data as a surrogate for clinical data • Miscoding, variations in coding

Using EMR Data for Research • Primarily hypothesis suggestion/generation rather than confirmation • Sample size may be too small to achieve statistical significance • Most data mining tests only show association, which does not prove causation. • Selection of patients matching complex criteria: sample size projections for a planned study (a strength of I2B2 – no IRB approval needed because only anonymized data is returned).

Medical Natural Language Processing 101 • NLP is concerned with extraction of meaningful information from human language input. • Ultimate goal is to transform unstructured text into a structured form. • Most NLP applications are targeted toward specific goals – e.g., identification of medications, adverse drug events. • NLP is not 100% accurate

Medical NLP 101 : Symbolic/ Rule-based approaches • Linguistic / symbolic NLP approaches employ hand-crafted grammar rules to parse text into units of speech (symbols), which are then processed further. • Still used successfully for limited problems. • This approach does not always scale • Labor-intensive, ambiguous parses, poor results with telegraphic text.

Medical NLP 101: Statistical NLP • Relies on large bodies of text annotated with the correct answers by humans. • Utilizes probabilistic methods for prediction • The larger and more representative the training data, the better the results will be. • Approaches include Support Vector Machines (SVMs), Hidden Markov Models (HMMs), and Conditional Random Fields (CRFs).

Medical NLP 101: Subproblems • NLP software typically works as a pipeline of modules: Modules for Low-level tasks precede those for high-level tasks • Low Level Tasks • Segmentation- sentence and word boundary detection, problem-specific boundary detection • Part of speech tagging • Morphological decomposition of compound words • Aggregation – identification of phrases

Medical NLP 101 : Sub-problems (2) • High-level tasks • Spelling and grammatical error correction • Named Entity Recognition – including medical concept recognition • Word /abbreviation disambiguation • Negation and uncertainty identification • Relationship extraction • Temporal inferencing

Medical NLP: Practical Issues • Change of Workflow and Introduction of Structure can eliminate a difficult problem. • Code Reuse to avoid reinventing wheels. • General vs. Specific Solutions • Tools Need Commoditization

Querying EMR Data: Technological Considerations • A database cannot be simultaneously designed for rapid query as well as efficient interactive, multi-user updates. • EMR database designs are transaction-oriented. • EMRs are optimized for "Patient/Entity Centric", not "Attribute-Centric" queries

Data Warehousing 101 • Principle: Operating on a separate read-only copy of the data on separate hardware yields better query performance. • Structural tweaks include adding extra and pre-computation of aggregate values. • Special types of indexes (bitmap indexes) yield improved query performance. • “Star schemas” characterize most warehouse designs. • Farmers vs. Explorers (Inmon) • “Virtual" integration ("federation")

Data Warehousing: Practical Considerations • After warehouse, need for creation of custom reports may increase rather than decrease. • The critical requirement for effective ad hoc query is a comprehensive understanding of the data. This is generally a full-time effort.

Special Considerations: Querying of Clinical Data • Both EMRs and large-scale CRISs typically store clinical data in Entity-Attribute-Value (EAV) form • 100,000s of clinical parameters exist across all medical domains. • The vast majority of parameters will be inapplicable for a particular subject/patient. • EAV is a triple: Entity=Patient+point in time, Attribute=Parameter, Value=value of that parameter. • EPIC Flowsheet data uses EAV.

Standardization • The mere presence of structure does not solve all problems • Synonyms in narrative text are unavoidable- reduced to the same concept. Controlled medical vocabularies (UMLS) help. • UMLS is not a panacea • Institutions will therefore evolve their internal controlled vocabularies.

Standardization Considerations • Standardizing your definitions • 2nd Law of Thermodynamics • Poor definition quality becomes a problem if pooled-data (or meta-) analysis is intended. • Features of certain systems predispose to disorder. (Learn As You Go, separate definitions databases.) • Even the best system is not immune – path of least resistance. • Consistent definition is difficult to achieve after the fact – Deming.

EMR use as the basis for research hypotheses • Conflicting evidence regarding EMR benefit still appears. • A *well designed* EMR may benefit. • Electronic Alerting Systems themselves may not improve care, unless EMRs also reduce workload through automatic actions. • Review vendor-supplied templates carefully.

Conclusions: Future EMR Evolution • EMRs fully supporting CRIS capability are unlikely to evolve. • No software should attempt to do everything • Differences in storage-engine capabilities • Jack-of-all-trades approach (doing everything in a mediocre manner) is not viable. • Difficult (or impossible) to devise a logically consistent user-interface metaphor that applies to diverse unrelated features. • Example of Microsoft Office.

Inter-operation (1) • Co-existing and Inter-operating best-of-breed packages offer the best usability and feature-set • CRISs, Genomic / Proteomic Data Management Packages • There may be minimal data duplication- e.g., EMRs may pull in very limited summary information on critical genetic data for selected patients, so that it is immediately visible.

Inter-operation (2) • CRIS/EMR • Bulk import of laboratory parameters, to avoid duplicate data entry • Automatic grading of laboratory-based adverse events (oncology studies) – Richesson et al. • Use for scheduling research subject visits • Pharmacy subsystem for drug dispensation • EMR for primary EDC in intra-institutional studies if the only alternative is paper, or if data-entry would otherwise be duplicated. • EMR/Specialized EMR • Picture-archiving systems

Inter-operation (3) • Application Programming Interfaces (APIs) • All large packages – CRISs, EMRs, ‘Omics – require APIs to make inter-operation efficient • APIs are vendor-specific. Inter-operation standards (e.g., the HL7 Virtual medical record) have not received much traction. • Currently, many vendors set unreasonable financial and other barriers to use of their APIs (e.g., official certification, withholding of documentation). • EMRs lag in the software industry’s trend toward open-source.

Questions?

Further reading • CRIS • Richesson and Andrews, Clinical Research Informatics, 2012 (Springer) • NLP • Jurafsky and Martin: Natural Language Processing • Manning and Schuetze: Foundations of Statistical Natural Language Processing • Nadkarni, Ohno-Machado and Chapman: Natural Language Processing: An Introduction. Journal of the American Medical Informatics Association 2011. • Data Warehousing • Larry Greenfield. The Data Warehousing Information Center. www.dwinfocenter.org/ • Kimball, Reeves, Ross and Thornthwaite. The Data Warehouse Lifecycle Toolkit : Expert Methods for Designing, Developing, and Deploying Data Warehouses. Wiley, 1998.

Using Electronic Medical Records Systems for Clinical Research: Benefits and Challenges