1 / 18

NLP for Contralateral Breast Cancer Detection in EHR Data

Using natural language processing to identify contralateral breast cancer events in patient data sets from EHR notes and surgical reports for improved surveillance and outcome measurement.

abbyj
Download Presentation

NLP for Contralateral Breast Cancer Detection in EHR Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Contralateral Breast Cancer Event Detection Using Nature Language Processing Session Title: NLP for Population heath Surveillance Session Number: S05 Speaker: Zexian Zeng Mentor: Yuan Luo Northwestern University

  2. Motivation – Identify Breast Cancer Outcome Measurement Contralateral event is an outcome measurement for breast cancer study • Contralateral breast cancer is defined as a solid tumor developed in the opposite breast after the detection of the first primary breast cancer • Woman with a first primary breast cancer has two to six folds of increased risk to develop a contralateral breast cancer compared to the normal population • Efforts have been devoted to studying the shared risk factors between the first and second primary breast cancer Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 1

  3. Motivation – Problems with Chart Review Manual chart review is widely used • Researchers still heavily rely on a manual chart review to identify the sub-cohorts with contralateral breast cancer • The review process is error-prone, labor-intensive, and time-consuming, making it difficult to scale to large cohort studies 7000 *10 =70000 minutes 7000 /60/10 =166.7 days Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 2

  4. Motivation – Information in EHR Electronic health records (EHR) contains abundant information • Abundant available information in EHR makes phenotyping in large cohort studies achievable • Information in free text makes natural language processing (NLP) an indispensable tool for text-mining Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 3

  5. Objectives • Objectives: Develop a model using natural language processing and machine learning to identify contralateral events in breast cancer patients’ data set Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 4

  6. Data Sources – Progress Notes Patient’s progressive information and clinical status are well recorded in the progress notes • Progress notes serve to communicate opinions, findings, and plans between healthcare professionals • Progress notes are readily and prevalently available Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 5

  7. Data Sources – Surgical Pathology Report Diagnostic procedure for breast cancer generates at least one pathology report • Pathology reports contain anatomic site information Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 6

  8. Methods – Workflow Obtain a set of positive CUIs • Progress notes from 15 women with contralateral breast cancer were extracted and reviewed • Sentences or partial sentences indicating the occurrence of contralateral breast cancer and cancer diagnoses related events were retrieved • Sentences were annotated using MetaMap • 42 CUIs were generated Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 7

  9. Methods – Workflow Generate features from progress notes • Preprocess • Remove duplicate copies • Divide the notes into sentences • Remove non-English symbols • Filter terms • Negation • Not fall in positive concept set • Concepts combination • Power sets • Combine any two and three CUIs that are extracted from the same sentence Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 8

  10. Methods – Workflow One example to illustrate the process to generate features from progress notes Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 9

  11. Methods – Workflow Generate features from surgical pathology report Algorithms: New_feature=‘0’ If ‘left’ in at least one pathology report: If ‘right’ in at least one pathology report: New_feature = ‘1’ One new binary feature indicating whether the patient has pathology reports for both sides were derived Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 10

  12. Methods – Workflow Train model & evaluation • Support Vector Machine (SVM) • Grid search was performed • Evaluation • Baseline studies were performed • Combined MetaMap • Pathology Report Count • Positive Dictionary without Combination • Bag of Words • Five-fold cross validation • Held-out test Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 11

  13. Results– Cross Validation Five-fold cross validation results Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 12

  14. Results– Cross Validation Held-out test results Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 13

  15. Results– Feature Study Top ranked features Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 14

  16. Motivation – Discussions and Conclusions Discussions and conclusions • Patients with contralateral events usually have pathology reports for both sides of breast cancer • Progress notes do not contain mentions for all contralateral events • Putting these two dimensions of features together improves the performance • This method can be replicated due to the simplicity of feature generation Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 15

  17. Thank You ! Zexian Zeng (Northwestern) AMIA 2017 11/05/2017 16

More Related