520 likes | 692 Views
Integrating Data for Analysis, Anonymization, and Sharing. An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain. Wendy W. Chapman, PhD. Division of Biomedical Informatics University of California, San Diego. Overview.
E N D
Integrating Data for Analysis, Anonymization, and Sharing An NLP Ecosystemfor Development and Use of Natural Language Processing in the Clinical Domain Wendy W. Chapman, PhD Division of Biomedical Informatics University of California, San Diego
Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • iDASH • Opportunities for sharing and collaboration in NLP
NLP Success “IBM's computer could very well herald a whole new era in medicine." ComputerWorld February 17, 2011 Dr. Watson?? Fresh off its butt-kicking performance on Jeopardy!, IBM’s supercomputer "Watson" has enrolled in medical school at Columbia University,”New York Daily News February 18th 2011
Clinical NLP Since 1960’s Why has clinical NLP had little impact on clinical care?
Barriers to Development • Sharing clinical data difficult • Have not had shared datasets for development and evaluation • Modules trained on general English not sufficient • Insufficient common conventions and standards for annotations • Data sets are unique to a lab • Not easily interchangeable
Limited collaboration • Clinical NLP applications silos and black boxes • Have not had open source applications • Reproducibility is formidable • Open source release not always sufficient • Software engineering quality not always great • Mechanisms for reproducing results are sparse
Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH
Security & Privacy Concerns Institutions are reluctant to share data • Clinical texts have many patient identifiers • 18 HIPAA identifiers • Names • Addresses • Items not regulated by HIPAA • tight end for the Steelers • Unique cases • 50s-year-old woman who is pregnant • Sensitive information • HIV status
Lack of user-centered development and scalability • Perceived cost of applying NLP outweighs the perceived benefit (Len D’Avolio)
Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH
iDASH • integrating Data • Analysis • Anonymization • Sharing Data Software/Tools Computational Resources
Disincentives to Share iDASH aims to minimize these disincentives • ‘Scooping’ by faster analysts Exposure of potential errors in data • Resources for preparing data submissions • Maintaining data • Interacting with potential users takes time • Threat of privacy breach when human subjects are involved • Do not have policies in place • Fallible de-identification, anonymization algorithms
HIPAA &/or FISMA Compliant Cloud DigitalInformed consent • Access control • De-identification • Query counts • Artificial data generators Privacy preserving Informed Consent Registry Customizable DUAs Researcher access
Schemas Bibliography Research Tutorials Guidelines Resources Education NLP Ecosystem UCSD Clinical Data Data Evaluation Workbench De-Identification MT Samples Tools & Services Collaborative Development Tools TxtVect Virtual Machines Annotation Admin & eHOST Registry 2011 summer internship program funded by NIH U54HL108460
Collaborative Effort to Build Ecosystem Evaluation Workbench De-Identification Tools & Services Collaborative Knowledge Authoring TextVect Increase access to NLP Virtual Machines Annotation Environment Decrease Burden of Developing NLP Registry
Registry: orbit.nlm.nih.gov Len D’Avolio, Dina Demner-Fushman
Increase access to clinical text De-identification service
De-identification Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery • Several available de-identification modules • Need to adapt to local text • Efficient • Secure • Customizable ensemble de-identification system • Build a de-identified corpus • Incorporate existing de-id modules • Launch as virtual machine • Iterative training, evaluation, and modification by user • Correct mistakes • Add regular expressions
Increase access to textual features TextVect
TextVect NLM: Abhishek Kumar
Decrease the Burden of Customizing an NLP Application collaborative Knowledge Authoring Support Service (cKass)
Customizing an IE App IE Output User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Map
Customizing an IE App IE Output Dry cough Productive cough Cough Hacking cough Bloody cough User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Which concepts?
Customizing an IE App IE Output Temp 38.0C Low-grade temperature User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy What is a fever?
Customizing an IE App IE Output NECK: no adenopathy Disorder: adenopathy Negation: negated User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Section mapping
KOS-IEKnowledge Organization Systems for Information Extraction
Collaborative Knowledge Base Development: cKASS Radiologist NLP Tools • Physician • Radiologist • Nurse • Clinical Researcher • Knowledge Engineer. Decision Support System User KB Shared KB External KB LQ Wang, M Conway, F Fana, M Tharp, D Hillert
Knowledge Authoring Augment user KB with lexical variants, synonyms, and related concepts • User-driven authoring • Top-down: Provide access to external knowledge sources • UMLS, Specialist Lexicon, Bioportal • Bottom-up: Annotate to derive synonyms • Recommendation-based authoring • Generate lexical variants • Mine external knowledge sources • Mine patient records
Decrease the Burden of Evaluation & Error Analysis Evaluation workbench
Evaluation Workbench • Compare the output of two NLP annotators on clinical text • NLP system vs human annotation • View annotations • Calculate outcome measures • Drill down to all levels of annotation • Document-level • Perform error analysis • Future versions will support formal error analysis
Levels of Annotation • Document • Report classified as Shigellosis • Group • Section classified as Past Medical History Section • Utterance • Group of text classified as Sentence • Snippet • “chest pain”classified as CUI 058273 • Word • “pain”classified as noun) • Token • “.”classified as EOS marker
Select Classifications to View Document & annotations Outcome Measures for Selected Annotations Report List Attributes for Selected Annotation Relationships for Selected Annotation VA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova
Decrease the Burden of Annotation Annotation Environment
Challenges to Annotating • Time consuming • Recruiting & training annotators for high agreement • Expensive • Domain experts especially expensive • Need for annotation by multiple people • Challenging to design annotation task • How many annotators? • How should I quantify quality of annotations? • Logistically challenging • Managing files and batches of reports • Setting up annotation tool • Reinventing the wheel • Hasn’t someone created a schema for this before?
iDASH Annotation Environment Goal: provide an environment to decrease the Burden of annotation for research and application Annotator Registry eHOST Annotation Admin Web application iDASH cloud Client app on your computer VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser
Annotator Registry • Enlist for annotation • Certify for annotation tasks • Personal health information • Part-of-speech tagging • UMLS mapping • Set pay rate • Searchable • Available for inclusion in new annotation task http://idash.ucsd.edu/nlp-annotator-registry
Annotation Admin: Intended Users & Uses Users • NLP researchers • Annotation administrators Uses • Manage annotation projects – who annotates what • Currently done with hundreds of files on hard drive • Integrate with annotation tool (eHOST) • Download batches of raw reports to annotators • Upload and store annotated reports • Manage simple annotation projects • Facilitate distributed annotation
Annotation Admin 1. Assign annotators to a task
Collaborative Effort to Build Resources Evaluation Workbench De-Identification Tools & Services Collaborative Knowledge Authoring TextVect Increase access to NLP Virtual Machines Annotation Environment Decrease Burden of Developing NLP Registry
Conclusion • More demand for EHR data • NLP has potential to extend value of narrative clinical reports • There have been many barriers • To development • To deployment • Recent developments facilitate collaboration & sharing • Common annotation conventions • Privacy algorithms • Shared datasets • Hosted environments • iDASH hopes to facilitate • Development of NLP • Application of NLP
Integrating Data for Analysis, Anonymization, and Sharing Questions | Discussion iDASH/ShARe Workshop on Annotation September 29, 2012 La Jolla, CA wwchapman@ucsd.edu Division of Biomedical Informatics University of California, San Diego