100 likes | 249 Views
Architecture of a Medical Information Extraction System. Dalila Bekhouche (dalila.bekhouche@ loria.fr) Yann Pollet (pollet@cnam.fr) Bruno Grilheres (bruno.grilheres@sysde.eads.net) Xavier Denis (xavier.denis@tiscali.fr). Index. Introduction. Information extraction.
E N D
Architecture of a Medical Information Extraction System Dalila Bekhouche (dalila.bekhouche@ loria.fr) Yann Pollet (pollet@cnam.fr) Bruno Grilheres (bruno.grilheres@sysde.eads.net) Xavier Denis (xavier.denis@tiscali.fr)
Index • Introduction • Information extraction • The architecture of the IE System • Extraction of lexical and medical terms • Evaluation of ICD-10 and CCMA results • Limits of this approach and future work
1- Introduction Database Problem: Difficult to access and exploit this amount of information • Variety of content • Specific terminology • The practionners use uncertain expressions and sens modifying Difficulties in understanding for most NLP tools
2- Information extraction Lexical Ressource Documents Free text Relevant information Extraction Domain knowledge • Aim • Identify and Extract relevant information from medical documents (examination report as colonoscopy) • How to identify the relevant information? • Relevant information: events and entities described in texts which concern the patient (signs, diagnosis, acts, results)
3- The architecture of the IE System • Date of examination • Document type • Signs • Diagnosis • Acts • Results • 1- Lexical level • Named entities • (Name,Medical terms) Documents • 2-Sub-sentence level • Signs, symptoms Generation Extraction Thesauri ICD- 10/Vidal/CCMA dictionary Database validation resources and rules
4- Extraction of the lexical terms Named entities(location, companies, organizations, dates) Mr <name> was addressed for a checkup by McGann Level 2 REGEX(words) and level 1 Mr <name> was addressed for a checkup by McGann Level 1 REGEX(words) or dictionary Mr Smith was addressed for a checkup by McGann
5- Extraction of the ICD-10 and CCMA Identify the various occurrences of these thesauri • 1- Preprocessing step: • Reduce the text and thesauri • Standardisation of words, removing irrelevant words • 2- Recognizing of the discminate terms • 3- Evaluate the Similarity (cosine measure) between the neighbouring terms in text and each candidate entry of the ICD-10 in relationship with indexing term ICD-10: International classification of the diseases CCMA: Common Classification of the Medical Acts
6- Evaluation of ICD-10 and CCMA results valid annotations found by the system valid annotations found by the practitionner Precision = valid annotations found by the system all annotations found by the system Recall = • 50% correct annotations. After adding knowledge, the precision increases up to 87,7% • Recall is approximatively the same, it represents problems due to ambiguous words.
7- Limits of this approach and future work • French medical texts only and specifics domains colonoscopy & oncology records. • Simple sentences as medical records but may have difficulties to analyse complex sentences needing a deep syntactic analysis • we will focus on the generation and acquisition steps. • Taking into account synonyms and feedback users
Thank you! dalila.bekhouche@ loria.fr PSI (Perception, system, information) Insa Rouen, Place E. Blondel, 76130 Mont St Aignan, France