290 likes | 378 Views
Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding. Julia Medori CENTAL (Centre for Natural Language Processing) Université catholique de Louvain (Belgium) Séminaire du Cental - 17/12/2010. Overview. Problem outline System structure Extraction Encoding
E N D
Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding Julia Medori CENTAL (Centre for Natural Language Processing) Université catholique de Louvain (Belgium) Séminaire du Cental - 17/12/2010
Overview • Problem outline • System structure • Extraction • Encoding • Extraction module • Encoding module • Machine learning methods • Experiments for features selection • Results • Symbolic methods description • Method 1: Morphological Analysis (MA) • Method 2: Extended lexical patterns (ELP) • Methods combination • Results • Conclusions
Introduction • Aim Build a (semi-)automated system for ICD-9-CM encoding • Collaboration CENTAL/Saint-Luc • Université catholique de Louvain (Belgium) • CENTAL : Centre for Natural Language Processing • Saint-Luc hospital : • team of 10 coders processes medical records : Extraction of medical acts and diagnoses ICD-9-CMcodes • 85,000 patient’s stays encoded each year.
Data • International Classification of Diseases -9th Revision-Clinical Modification (ICD-9-CM) • Hierarchy : • first 3 digits -> general category : 1,135 categories • Digits 4 and 5 -> specific diagnosis : 15,688 codes • Example :
Objectives • Design a coding help: • a tool that will suggest the most likely codes to be assigned to a patient’s medical record. • Why not a fully automated system? • Main source of information : Patient discharge summary (PDS) • PDS : letter, addressed to patient’s GP with no standard structure • 15-20% of the codes inferred from other sources from patient’s medical record (often scanned documents).
System structure Machine learning module Extraction Coding Code modification according to context and stats Context analysis + tagging PDS PDS + ordered list of codes Dictionaries and linguistic structures Preprocessing Manual checking Morphological processing ICD9CM + Inclusions Matching lists
Structure outline • 2 steps : • Extraction • Develop an extraction system able to extract information necessary to the encoding task : • Diagnoses, procedures, locations, dates, allergies, aggravating factors, etc. => Reading help tool. • Encoding • Extracted information => codes through a combination of statistical and symbolic methods.
Extraction • Develop specialized linguistic resources • Specialized dictionaries • Diagnoses and procedures <= ICD-9-CM + UMLS • Medications • Anatomy • Linguistic structure description • Diagnoses context (present, absent, probable, etc.) • Allergies and smoking • Dates • Weight and height
Example of linguistic structure graph Fracture de l’épaule => <MALINDET> Fracture de l’<ANAT>épaule</ANAT></MALINDET>
Structure outline • 2 steps: • Extraction • Develop an extraction system able to extract information necessary to the encoding task : • Diagnoses, procedures, localisations, dates, allergies, aggravating factors, etc. => Reading help tool. • Encoding • Extracted information => codes through a combination of statistical and symbolic methods.
Machine Learning • Encoding = categorization problem • Features = extracted phrases? • Classes = codes • Baseline method : Naive Bayes • Tool: Weka • Corpus : • 13,635 PDS from Digestive Surgery • 90% training set / 10% test set (1364 PDS) • Average number of codes per PDS: 6.2 • Trained 1 classifier per code occurring>5 times in the corpus : • 775 codes -> 775 classifiers • Limitation: 5% rare codes • attributes: kept only those co-occurring at least twice with the codes. • Measures: Precision and recall according to the probability returned by the Naive Bayes test.
Experiments • A series of experiments were conducted where attributes were variants of the extracted diagnoses and procedures after stemming. • Variants implied: • Kept original word order or not. • Ex: excisional biopsy bile duct • Or bile biopsy duct excisional • Included details like location, date, context. • Excisional biopsy • Each word of the extracted phrases is a feature • Excisional • Biopsy • Bile • Duct • Words and morphemes (together) composing the extracted phrase • Bile biopsy excision excisional duct • Words and morphemes (separately) composing the extracted phrase • Excisional biopsy bile duct • Excision biopsy bile duct • Values were 0 or 1 whether the attribute was in the text or not. • Values were the frequency of the attribute in the text.
Results • 3 best results when thresholding the list of results where the probability • returned by Naive Bayes = 1
Discussion • Limitations of the machine learning method: • 5% rare codes – not enough data to build a classifier for these codes • Need for annotated data means that these methods are unable to face changes in classifications • In these cases, we need to use symbolic methods Kevers Laurent et Medori Julia, Symbolic classification methods for patient discharge summaries encoding into ICD, In: Advances in Natural Language Processing, 7th International Conference on NLP, IceTAL 2010, Reykjavik, August 16-18, 2010, Lecture Notes in Artificial Intelligence, 2010, p. 197-208
Objective • Automatic encoding of PDS according to categories (first 3 digits) • Use of symbolic methods • No need for annotated data • Can assign rare codes (27% used 5 times or less) • Principle : • Make use of the nomenclature • Enrich it with other resources in French from UMLS (Unified Medical Language System)
Corpus • 19,692 patient discharge summaries (PDS) in French • General Internal Medicine • 150,116 codes (137,336 categories) • 6,029 distinct codes (895 categories) • Average = 7.6 codes/document (7 categories)
Method 1 (MA) – General Principle • Based on the rich morphology of medical language • Ex. Bronchoscopy: Fibroscopiebronchique = bronchoscopie par fibre optique • 2 steps process : • Extract phrases or terms describing diagnoses or procedures to be encoded • Encoding : match these terms to the right code.
Method 1 (MA) – Encoding • Bags-of-words : Words – stop words + morphemes + meaning ICD-9-CM PDS Fibroscopie bronchique Bronchoscopie par fibre optique fibroscopie bronchique fibro- fibre -scopie bronch- bronche -ique bronchoscopie par fibre optique bronch- bronche -scopie Similarity score
Method 2 (ELP) – Generalprinciple • Developed by L. Kevers as designed for the Stratego project on parliamentary documents. • Symbolic method with less manual work • Use existing « terminological » resources • ICD-9-CM + UMLS • Two steps process • Automatic transformation of existing terminological resources into an extraction resource (only once) • Use extraction resource on documents for terms extraction and classification (for each document)
Method 2 (ELP) – build extraction resource (1) • For each ICD-9-CM term (= a class), the automatic processing implies : • Gather synonyms (UMLS) « dengue » → « dengues », « dengue fever », « infection by the dengue virus » • Parse complex compound expressions « Infectious and parasitic diseases » → « Infectious disease » → « Parasitic disease » • Transform initial term into Extended Lexical Pattern (ELP) • Stopwords :→ « infection <TOKEN> dengue virus » • Stemming : → « infect <TOKEN> dengue virus » • Allow insertions : → « infect <I> <TOKEN> <I> dengue <I> virus » • Add negative contexts patterns • Build the main transducer for text annotation
Method 2 (ELP) – Transducer & output • Transducer for class '061' Zona [[053]] extremement douloureux [[729]] gastroscopie [[Z44]] acide [[E96]] anemie normochrome normocytaire [[285]] sequellaires apicales droite (tuberculose [[137]] intestin grele [[Z45]] tuberculose [[V12]] • Output of main transducer for a document oesophagite moderee aspecifique [[947]] infection a mycobacterie [[031]] fond de oeil [[Z16]] pas de [[-]] atteinte du nerf [[957]] zona [[053]] hyperthyroidie [[242]] goitre [[706]] goitre [[240]]
Method 2 (ELP) – Class assignment (2) • For a text to classify, analyse the main transducer output • When negative contexts, the phrase is skipped • Each recognized phrase has one (or more) related code • Compute a weight for each phrase based on • Frequency • Is a multi word expression (frequency*2), or not • Compute a weight for each code by summing up the weights obtained for the phrases • Result : ordered list of codes (possibly threshold it)
Combination of methods 1 & 2 • Merge the lists from method 1 & 2 • Threshold(M.1 union M.2) • Threshold(M.1 inter M.2) • Threshold(M.1) union Threshold(M.2) • Threshold(M.1) inter Threshold(M.2) • The weight for each method can be balanced • Example: 0.4*M.1 union 0.6* M.2
Conclusions • Results have to be put into perspective: • Inter-annotator agreement ~70% • 15 to 20% cannot be inferred from PDS • Machine learning methods performed well. • Symbolic methods: • MA method based on extraction module : 66% of useful information is extracted. • ELP method performs better when built from short unambiguous phrases. ICD-9-CM code descriptions are more complex. • Future work : • Give more weight to information contained in important parts of the PDS (introduction, conclusion…) • Evaluate the actual help given to human coders • Combine with learning algorithms