Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding Julia Medori CENTAL (Centre for Natural Language Processing) Université catholique de Louvain (Belgium) Séminaire du Cental - 17/12/2010

Overview • Problem outline • System structure • Extraction • Encoding • Extraction module • Encoding module • Machine learning methods • Experiments for features selection • Results • Symbolic methods description • Method 1: Morphological Analysis (MA) • Method 2: Extended lexical patterns (ELP) • Methods combination • Results • Conclusions

Introduction • Aim Build a (semi-)automated system for ICD-9-CM encoding • Collaboration CENTAL/Saint-Luc • Université catholique de Louvain (Belgium) • CENTAL : Centre for Natural Language Processing • Saint-Luc hospital : • team of 10 coders processes medical records : Extraction of medical acts and diagnoses  ICD-9-CMcodes • 85,000 patient’s stays encoded each year.

Manual encoding

Data • International Classification of Diseases -9th Revision-Clinical Modification (ICD-9-CM) • Hierarchy : • first 3 digits -> general category : 1,135 categories • Digits 4 and 5 -> specific diagnosis : 15,688 codes • Example :

Objectives • Design a coding help: • a tool that will suggest the most likely codes to be assigned to a patient’s medical record. • Why not a fully automated system? • Main source of information : Patient discharge summary (PDS) • PDS : letter, addressed to patient’s GP with no standard structure • 15-20% of the codes inferred from other sources from patient’s medical record (often scanned documents).

System structure Machine learning module Extraction Coding Code modification according to context and stats Context analysis + tagging PDS PDS + ordered list of codes Dictionaries and linguistic structures Preprocessing Manual checking Morphological processing ICD9CM + Inclusions Matching lists

Structure outline • 2 steps : • Extraction • Develop an extraction system able to extract information necessary to the encoding task : • Diagnoses, procedures, locations, dates, allergies, aggravating factors, etc. => Reading help tool. • Encoding • Extracted information => codes through a combination of statistical and symbolic methods.

Extraction • Develop specialized linguistic resources • Specialized dictionaries • Diagnoses and procedures <= ICD-9-CM + UMLS • Medications • Anatomy • Linguistic structure description • Diagnoses context (present, absent, probable, etc.) • Allergies and smoking • Dates • Weight and height

Example of linguistic structure graph Fracture de l’épaule => <MALINDET> Fracture de l’<ANAT>épaule</ANAT></MALINDET>

Extraction result

Structure outline • 2 steps: • Extraction • Develop an extraction system able to extract information necessary to the encoding task : • Diagnoses, procedures, localisations, dates, allergies, aggravating factors, etc. => Reading help tool. • Encoding • Extracted information => codes through a combination of statistical and symbolic methods.

Machine Learning • Encoding = categorization problem • Features = extracted phrases? • Classes = codes • Baseline method : Naive Bayes • Tool: Weka • Corpus : • 13,635 PDS from Digestive Surgery • 90% training set / 10% test set (1364 PDS) • Average number of codes per PDS: 6.2 • Trained 1 classifier per code occurring>5 times in the corpus : • 775 codes -> 775 classifiers • Limitation: 5% rare codes • attributes: kept only those co-occurring at least twice with the codes. • Measures: Precision and recall according to the probability returned by the Naive Bayes test.

Experiments • A series of experiments were conducted where attributes were variants of the extracted diagnoses and procedures after stemming. • Variants implied: • Kept original word order or not. • Ex: excisional biopsy bile duct • Or bile biopsy duct excisional • Included details like location, date, context. • Excisional biopsy • Each word of the extracted phrases is a feature • Excisional • Biopsy • Bile • Duct • Words and morphemes (together) composing the extracted phrase • Bile biopsy excision excisional duct • Words and morphemes (separately) composing the extracted phrase • Excisional biopsy bile duct • Excision biopsy bile duct • Values were 0 or 1 whether the attribute was in the text or not. • Values were the frequency of the attribute in the text.

Results • 3 best results when thresholding the list of results where the probability • returned by Naive Bayes = 1

Discussion • Limitations of the machine learning method: • 5% rare codes – not enough data to build a classifier for these codes • Need for annotated data means that these methods are unable to face changes in classifications • In these cases, we need to use symbolic methods Kevers Laurent et Medori Julia, Symbolic classification methods for patient discharge summaries encoding into ICD, In: Advances in Natural Language Processing, 7th International Conference on NLP, IceTAL 2010, Reykjavik, August 16-18, 2010, Lecture Notes in Artificial Intelligence, 2010, p. 197-208

Objective • Automatic encoding of PDS according to categories (first 3 digits) • Use of symbolic methods • No need for annotated data • Can assign rare codes (27% used 5 times or less) • Principle : • Make use of the nomenclature • Enrich it with other resources in French from UMLS (Unified Medical Language System)

Corpus • 19,692 patient discharge summaries (PDS) in French • General Internal Medicine • 150,116 codes (137,336 categories) • 6,029 distinct codes (895 categories) • Average = 7.6 codes/document (7 categories)

Method 1 (MA) – General Principle • Based on the rich morphology of medical language • Ex. Bronchoscopy: Fibroscopiebronchique = bronchoscopie par fibre optique • 2 steps process : • Extract phrases or terms describing diagnoses or procedures to be encoded • Encoding : match these terms to the right code.

Method 1 (MA) – Encoding • Bags-of-words : Words – stop words + morphemes + meaning ICD-9-CM PDS Fibroscopie bronchique Bronchoscopie par fibre optique fibroscopie bronchique fibro- fibre -scopie bronch- bronche -ique bronchoscopie par fibre optique bronch- bronche -scopie Similarity score

Method 1 (MA) – Results

Method 2 (ELP) – Generalprinciple • Developed by L. Kevers as designed for the Stratego project on parliamentary documents. • Symbolic method with less manual work • Use existing « terminological » resources • ICD-9-CM + UMLS • Two steps process • Automatic transformation of existing terminological resources into an extraction resource (only once) • Use extraction resource on documents for terms extraction and classification (for each document)

Method 2 (ELP) – build extraction resource (1) • For each ICD-9-CM term (= a class), the automatic processing implies : • Gather synonyms (UMLS) « dengue » → « dengues », « dengue fever », « infection by the dengue virus » • Parse complex compound expressions « Infectious and parasitic diseases » → « Infectious disease » → « Parasitic disease » • Transform initial term into Extended Lexical Pattern (ELP) • Stopwords :→ « infection <TOKEN> dengue virus » • Stemming : → « infect <TOKEN> dengue virus » • Allow insertions : → « infect <I> <TOKEN> <I> dengue <I> virus » • Add negative contexts patterns • Build the main transducer for text annotation

Method 2 (ELP) – Transducer & output • Transducer for class '061' Zona [[053]] extremement douloureux [[729]] gastroscopie [[Z44]] acide [[E96]] anemie normochrome normocytaire [[285]] sequellaires apicales droite (tuberculose [[137]] intestin grele [[Z45]] tuberculose [[V12]] • Output of main transducer for a document oesophagite moderee aspeciﬁque [[947]] infection a mycobacterie [[031]] fond de oeil [[Z16]] pas de [[-]] atteinte du nerf [[957]] zona [[053]] hyperthyroidie [[242]] goitre [[706]] goitre [[240]]

Method 2 (ELP) – Class assignment (2) • For a text to classify, analyse the main transducer output • When negative contexts, the phrase is skipped • Each recognized phrase has one (or more) related code • Compute a weight for each phrase based on • Frequency • Is a multi word expression (frequency*2), or not • Compute a weight for each code by summing up the weights obtained for the phrases • Result : ordered list of codes (possibly threshold it)

Method 2 (ELP) – Results

Combination of methods 1 & 2 • Merge the lists from method 1 & 2 • Threshold(M.1 union M.2) • Threshold(M.1 inter M.2) • Threshold(M.1) union Threshold(M.2) • Threshold(M.1) inter Threshold(M.2) • The weight for each method can be balanced • Example: 0.4*M.1 union 0.6* M.2

Evaluation of symbolic methods combination

Conclusions • Results have to be put into perspective: • Inter-annotator agreement ~70% • 15 to 20% cannot be inferred from PDS • Machine learning methods performed well. • Symbolic methods: • MA method based on extraction module : 66% of useful information is extracted. • ELP method performs better when built from short unambiguous phrases. ICD-9-CM code descriptions are more complex. • Future work : • Give more weight to information contained in important parts of the PDS (introduction, conclusion…) • Evaluate the actual help given to human coders • Combine with learning algorithms

Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Presentation Transcript

Machine Learning Methods for Human-Computer Interaction

AUDIT ON DISCHARGE SUMMARIES

Methods of Discharge

Clinical Summaries and Patient Reminders

Machine Learning for Protein Classification: Kernel Methods

Patient discharge

Symbolic methods for cryptography

GPLO Workshop: March 2008 Discharge Summaries

Machine learning methods for protein analyses

Critical care patient discharge summaries

Machine Learning Methods for Decision Support and Discovery

Machine Learning Methods

Efficient encoding methods

Machine Learning for Big Data, Methods and Applications

Machine Learning Methods for Cybersecurity

Ensemble Methods for Machine Learning

Machine learning methods for protein analyses

DATA ENCODING METHODS

Symbolic learning