Selen Bozkurt, PhD Stanford University, Biomedical Data Science, Biomedical Informatics Research

An Automated Feature Engineering for Digital Rectal Examination Documentation using Natural Language Processing S63: NLP for Phenotyping Selen Bozkurt, PhD Stanford University, Biomedical Data Science, Biomedical Informatics Research

Disclosure • We have NO relevant relationships with commercial interests to disclose. • Acknowledgement • Researchreported in thispublicationwassupportedbytheNationalCancerInstitute of theNationalInstitutes of HealthunderAwardNumber R01CA183962. Thecontent is solelytheresponsibility of theauthorsanddoes not necessarilyrepresenttheofficialviews of theNationalInstitutes of Health. • Authors: • Selen Bozkurt, PhD, Jung In Park, PhD, RN , Kathleen Mary Kan MD , • Michelle Ferrari, RN , Daniel L Rubin, MD, MS James D Brooks, MD, • Tina Hernandez Boussard, PhD AMIA 2018 | amia.org

Learning Objectives • After participating in this session the learner should be better able to learn about: • An NLP framework for automatic identification of Prostate Cancer Quality Metrics • A rule-based approach enriched with terms learned from the corpus using distributional semantics algorithms might be used to get patient centered outcomes • Prostate Cancer Quality Metrics • Patient-Centered Outcome Phenotypes AMIA 2018 | amia.org

Introduction • Prostate Cancer Quality Metrics • Patient-reported outcomes (e.g. urinary incontinence, erectile dysfunction) • Quality of life measures (global mental and physical health) • Digital rectal exam (DRE) as a pre-treatment assessment • Pretreatment process quality measure, • Documentation within 6 months prior to initial treatment • DRE as a quality metric for prostate cancer treatment AMIA 2018 | amia.org

Motivation • DRE • Not recorded systematically or included in billing or claims datasets • Limited to labor-intensive approaches, manual chart reviews to assess documentation • Well documented in the patient record DRE: Moderately enlarged, smooth, symmetric prostate without any induration or nodularity Report 1 If he does decide to undergo active surveillance, he will need frequent PSA checks (up to every 3 months), frequent rectal examinations and possibly rebiopsyin the future. Report 2 At the time of his diagnosis his rectal examination showed no abnormalities. Rectal: Normal perianal skin, good sphincter tone and normal rectal mucosa. Prostate is moderately enlarged and without nodularity or induration. Report 3 AMIA 2018 | amia.org

Assessing Prostate Cancer QM from EHRs Problems Solutions Natural Language Processing Documented in clinical narratives Lack of labeled data Rule-based Approaches Terminology Ontologies, Lexicons, Dictionaries Dictionary Development Domain Knowledge, Manual developments or Using Distributional Semantics AMIA 2018 | amia.org

Research Questions and Goals • Can we extract DRE documentation from clinical notes? • Develop an NLP solution • Can we follow up DRE documentation for all DB? • Integrate NLP solution to research database and follow up documentation AIM 2 Structured Data EHR + NLP Output EHR + Unstructured Data NLP pipeline AIM 1 Evaluation AMIA 2018 | amia.org

Method: Data Source • The Stanford prostate cancer research database • Data were linked to the California Cancer Registry • from 2005 to February 9, 2018 • ICD diagnostic codes, ICD-9-CM:185 and ICD-10-CM: C61 Reference: Seneviratne, M. G., Seto, T., Blayney, D. W., Brooks, J. D., & Hernandez-Boussard, T. (2018). Architecture and Implementation of a Clinical Research Data Warehouse for Prostate Cancer. eGEMs (Generating Evidence & Methods to improve patient outcomes), 6(1). AMIA 2018 | amia.org

Method: Data Set All Database ICD-9-CM:185 or ICD-10-CM:C61 from Jan 1, 2005 to Mar 30, 2017 N = 15,834 7443 excluded for not receiving initial treatment for prostate cancer at our hospital N = 8391 458,339 notes N = 7353 1038 missing notes and note dates Development + Test Set Dictionary Creation N = 301 Development Set N = 101 Notes # 101 Test Set N = 200 Notes # 200 AMIA 2018 | amia.org

Method: Development and Test Set Development + Test Set N = 301 Test Set N = 200 Notes # 200 Development Set N = 101 Notes # 101 Hypothetical Deferred Refused Examined Historical Manually annotated as sentence level by two domain experts: inter-rater reliability (Cohen’s κ = .97) Development process, error analysis and adjustments (development set) Precision, recall and F-score (test set) AMIA 2018 | amia.org

Method: The Proposed Pipeline Output 1 PRE-PROCESSING Proposed Terms List for DRE Findings Terms List Creation Revised-CONTEXT Tagger Sentence Splitter Key Term Mapping Ontologies from NCBO Tokenization Negation Experts’ domain knowledge Named Entity Tagging Learning vector space representations of words and phrases in clinical notes Stop word, numbers, punctuation removal Temporality Words and Phrases Output 2 Rule based Information Extraction AMIA 2018 | amia.org

Method: Dictionary Creation Initial list of terms was generated based on domain knowledge Matched with existing ontologies from NCBO Candidate terms using distributional information on words and phrases 1) Bigram, trigram: rectal_exam, digital_rectal_exam 2) Word2vec: skip-gram model, vector length 100, context window width of 5 The final list of terms was reviewed by the domain experts AMIA 2018 | amia.org

Method: Terms added to the ConText Modifier List AMIA 2018 | amia.org

Results: Accuracy Metrics AMIA 2018 | amia.org

Results: DRE Documentation Stats AMIA 2018 | amia.org

Conclusion • We built a rule based NLP pipeline to follow DRE documentation • As a quality metric • As a patient centered outcome • Could be expanded to other quality metrics and PCOs • with NLP techniques, it is feasible to accurately and efficiently identify and extract features associated with quality metrics AMIA 2018 | amia.org

Future Works • Testing our algorithms in another healthcare system to ensure their generalizability • Expanding for other quality metrics • The clinical terms used in our algorithms will be disseminated with a national repository (pheKB.org) AMIA 2018 | amia.org

Thank you! • Selen Bozkurt - selenb@stanford.edu • Tina Hernandez-Boussardboussard@stanford.edu • Boussard Lab looking for • new postdocs.

Selen Bozkurt, PhD Stanford University, Biomedical Data Science, Biomedical Informatics Research

Selen Bozkurt, PhD Stanford University, Biomedical Data Science, Biomedical Informatics Research

Presentation Transcript

Personalized Biomedical Informatics

Carol Friedman, PhD Department of Biomedical Informatics Columbia University

Biomedical Science

From biomedical informatics to translational research

Biomedical Informatics

Computational BioMedical Informatics

Biomedical Informatics Program

Research Programme on Biomedical Informatics (GRIB)

Biomedical Informatics Research Network

Personalized Biomedical Informatics

Biomedical Informatics Research Network

BIOMEDICAL INFORMATICS RESEARCH

Biomedical Informatics Hub

Biomedical Informatics

Biomedical Informatics Core

Biomedical Research

Biomedical Informatics

Biomedical Engineering and Biomedical Informatics Program

Data Mining for Biomedical Informatics

Biomedical Research

Biomedical Science