An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain

Integrating Data for Analysis, Anonymization, and Sharing An NLP Ecosystemfor Development and Use of Natural Language Processing in the Clinical Domain Wendy W. Chapman, PhD Division of Biomedical Informatics University of California, San Diego

Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • iDASH • Opportunities for sharing and collaboration in NLP

NLP Success “IBM's computer could very well herald a whole new era in medicine." ComputerWorld February 17, 2011 Dr. Watson?? Fresh off its butt-kicking performance on Jeopardy!, IBM’s supercomputer "Watson" has enrolled in medical school at Columbia University,”New York Daily News February 18th 2011

Clinical NLP Since 1960’s Why has clinical NLP had little impact on clinical care?

Barriers to Development • Sharing clinical data difficult • Have not had shared datasets for development and evaluation • Modules trained on general English not sufficient • Insufficient common conventions and standards for annotations • Data sets are unique to a lab • Not easily interchangeable

Limited collaboration • Clinical NLP applications silos and black boxes • Have not had open source applications • Reproducibility is formidable • Open source release not always sufficient • Software engineering quality not always great • Mechanisms for reproducing results are sparse

Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH

Security & Privacy Concerns Institutions are reluctant to share data • Clinical texts have many patient identifiers • 18 HIPAA identifiers • Names • Addresses • Items not regulated by HIPAA • tight end for the Steelers • Unique cases • 50s-year-old woman who is pregnant • Sensitive information • HIV status

Lack of user-centered development and scalability • Perceived cost of applying NLP outweighs the perceived benefit (Len D’Avolio)

Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH

iDASH • integrating Data • Analysis • Anonymization • Sharing Data Software/Tools Computational Resources

Disincentives to Share iDASH aims to minimize these disincentives • ‘Scooping’ by faster analysts Exposure of potential errors in data • Resources for preparing data submissions • Maintaining data • Interacting with potential users takes time • Threat of privacy breach when human subjects are involved • Do not have policies in place • Fallible de-identification, anonymization algorithms

nlp-ecosystem.ucsd.edu

HIPAA &/or FISMA Compliant Cloud DigitalInformed consent • Access control • De-identification • Query counts • Artificial data generators Privacy preserving Informed Consent Registry Customizable DUAs Researcher access

Schemas Bibliography Research Tutorials Guidelines Resources Education NLP Ecosystem UCSD Clinical Data Data Evaluation Workbench De-Identification MT Samples Tools & Services Collaborative Development Tools TxtVect Virtual Machines Annotation Admin & eHOST Registry 2011 summer internship program funded by NIH U54HL108460

Collaborative Effort to Build Ecosystem Evaluation Workbench De-Identification Tools & Services Collaborative Knowledge Authoring TextVect Increase access to NLP Virtual Machines Annotation Environment Decrease Burden of Developing NLP Registry

Increase ability to find NLP tools orbit

Registry: orbit.nlm.nih.gov Len D’Avolio, Dina Demner-Fushman

Increase access to clinical text De-identification service

De-identification Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery • Several available de-identification modules • Need to adapt to local text • Efficient • Secure • Customizable ensemble de-identification system • Build a de-identified corpus • Incorporate existing de-id modules • Launch as virtual machine • Iterative training, evaluation, and modification by user • Correct mistakes • Add regular expressions

Increase access to textual features TextVect

TextVect NLM: Abhishek Kumar

Decrease the Burden of Customizing an NLP Application collaborative Knowledge Authoring Support Service (cKass)

Customizing an IE App IE Output User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Map

Customizing an IE App IE Output Dry cough Productive cough Cough Hacking cough Bloody cough User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Which concepts?

Customizing an IE App IE Output Temp 38.0C Low-grade temperature User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy What is a fever?

Customizing an IE App IE Output NECK: no adenopathy Disorder: adenopathy Negation: negated User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Section mapping

KOS-IEKnowledge Organization Systems for Information Extraction

Compile information helpful for IE

Collaborative Knowledge Base Development: cKASS Radiologist NLP Tools • Physician • Radiologist • Nurse • Clinical Researcher • Knowledge Engineer. Decision Support System User KB Shared KB External KB LQ Wang, M Conway, F Fana, M Tharp, D Hillert

Knowledge Authoring Augment user KB with lexical variants, synonyms, and related concepts • User-driven authoring • Top-down: Provide access to external knowledge sources • UMLS, Specialist Lexicon, Bioportal • Bottom-up: Annotate to derive synonyms • Recommendation-based authoring • Generate lexical variants • Mine external knowledge sources • Mine patient records

Decrease the Burden of Evaluation & Error Analysis Evaluation workbench

Evaluation Workbench • Compare the output of two NLP annotators on clinical text • NLP system vs human annotation • View annotations • Calculate outcome measures • Drill down to all levels of annotation • Document-level • Perform error analysis • Future versions will support formal error analysis

Levels of Annotation • Document • Report classified as Shigellosis • Group • Section classified as Past Medical History Section • Utterance • Group of text classified as Sentence • Snippet • “chest pain”classified as CUI 058273 • Word • “pain”classified as noun) • Token • “.”classified as EOS marker

Select Classifications to View Document & annotations Outcome Measures for Selected Annotations Report List Attributes for Selected Annotation Relationships for Selected Annotation VA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova

Decrease the Burden of Annotation Annotation Environment

Challenges to Annotating • Time consuming • Recruiting & training annotators for high agreement • Expensive • Domain experts especially expensive • Need for annotation by multiple people • Challenging to design annotation task • How many annotators? • How should I quantify quality of annotations? • Logistically challenging • Managing files and batches of reports • Setting up annotation tool • Reinventing the wheel • Hasn’t someone created a schema for this before?

How can we reduce the burden of annotation?

iDASH Annotation Environment Goal: provide an environment to decrease the Burden of annotation for research and application Annotator Registry eHOST Annotation Admin Web application iDASH cloud Client app on your computer VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser

Annotator Registry • Enlist for annotation • Certify for annotation tasks • Personal health information • Part-of-speech tagging • UMLS mapping • Set pay rate • Searchable • Available for inclusion in new annotation task http://idash.ucsd.edu/nlp-annotator-registry

Annotation Admin: Intended Users & Uses Users • NLP researchers • Annotation administrators Uses • Manage annotation projects – who annotates what • Currently done with hundreds of files on hard drive • Integrate with annotation tool (eHOST) • Download batches of raw reports to annotators • Upload and store annotated reports • Manage simple annotation projects • Facilitate distributed annotation

Annotation Admin 1. Assign annotators to a task

2. Create a Schema

3. Assign users and set time expectations

3. Keep track of progress

Collaborative Effort to Build Resources Evaluation Workbench De-Identification Tools & Services Collaborative Knowledge Authoring TextVect Increase access to NLP Virtual Machines Annotation Environment Decrease Burden of Developing NLP Registry

Conclusion • More demand for EHR data • NLP has potential to extend value of narrative clinical reports • There have been many barriers • To development • To deployment • Recent developments facilitate collaboration & sharing • Common annotation conventions • Privacy algorithms • Shared datasets • Hosted environments • iDASH hopes to facilitate • Development of NLP • Application of NLP

Integrating Data for Analysis, Anonymization, and Sharing Questions | Discussion iDASH/ShARe Workshop on Annotation September 29, 2012 La Jolla, CA wwchapman@ucsd.edu Division of Biomedical Informatics University of California, San Diego

An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain

An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain

Presentation Transcript

Python for NLP and the Natural Language Toolkit

Natural Language Processing (NLP)

Natural Language Processing (NLP) Market

Applying Natural Language Processing in the Clinical Setting

Domain Adaptation in Natural Language Processing

Semantics and Context in Natural Language Processing (NLP)

Introduction to Natural Language Processing (NLP)

Natural Language Processing (NLP)

Natural Language Processing: An Introduction

Natural Language Processing Lecture 1 : Introduction to NLP

Natural language processing (NLP)

Natural Language Processing (NLP) + Visualization and Virtual Reality (VVR)

Natural Language Processing (NLP)

NATURAL LANGUAGE PROCESSING (NLP)

NLP Tutorial AI with Python | Natural Language Processing

The History of Natural Language Processing (NLP)

The History of Natural Language Processing (NLP)

An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain

The NLP TOOLTORIAL: Tools for Natural Language Processing and Text Mining

Natural language processing (nlp) in healthcare | Coherent Market Insights