210 likes | 356 Views
Bruno Cartoni & Pierre Zweigenbaum LIMSI-CNRS, France. Semi-Automated Extension of a Specialized Medical Lexicon for French. Outline. Context : UMLF for French The desired coverage The target lexical information The organisation of a specialised lexicon Acquiring lexical information
E N D
Bruno Cartoni & Pierre Zweigenbaum LIMSI-CNRS, France Semi-Automated Extension of a Specialized Medical Lexicon for French
Outline Context : UMLF for French The desired coverage The target lexical information The organisation of a specialised lexicon Acquiring lexical information Initial coverage Obtaining lexical entries from general lexicon Guessing technique Results Consensus guessing Acquisition of the full paradigm General improvement Conclusion and further work
Context : the InterSTIS project InterSTIS: development of Terminology Server for French Medical Terminologies Sub-Project: Improving the Lexical Coverage of a French medical lexicon (UMLF : Unified Medical Lexicon for French) Use: support indexation process of medical texts Issues: What is the desired lexical knowledge ? How to acquire it ?
The desired coverage Reference: “Term-Union” Union of 10 terminologies (CIM-10, SNOMED, MeSH, CISMeF, …) of French medical domains, organised around concept identifiers (CUI) of the UMLS 311,518 terms 203,300 unique concepts (CUI) 94,964 word-forms
Term-Union: example C0000936 MSHFRE … Accommodation de l'oei C0000936 MSHFRE … Accommodation des yeux C0000936 MSHFRE … Accommodation oculaire C0000936 SNMIGIPFRE … accommodation visuelle ... C00001558 MSHF … Voie cutanée C00001558 MSHF … Voie intradermique C00001558 MSHF … Voie percutanée C00001558 MSHF … Voie transcutanée Observation of term variation
Target lexical information Term variation within Term-Union Graphemic équilibre acido-basique – équilibre acidobasique [EN: acid-base balance] Morphosyntactic adaptation de l'oeil- adaptation des yeux [EN: eye adaptation] Morphosemantic intoxication à l’alcool - intoxication alcoolique [EN: alcohol intoxication] Others ...
Organisation of the specialised lexicon 3 types of relational tables for the 3 levels of representation (graphemic, inflection, derivation) A full-entry lexicon (LMF compliant) that gathers all lexical information … inter-maxillaire | intermaxillaire insulino-sécrétantes | insulinosécrétantes scléro-cornéenne | sclérocornéenne … ... abdominal | abdomen aplasique | aplasie arachnoïdien | arachnoïde argentique | argent … … sérofibrineux | sérofibrineux | Afpms sérofibrineuse | sérofibrineux | Afpfs sérofibrineux | sérofibrineux | Afpmp sérofibrineuses | sérofibrineux | Afpfp …
Outline • Context : UMLS for French • The desired coverage • The target lexical information • The organisation of a specialised lexicon • Acquiring lexical information • Initial coverage • Obtaining lexical entries from general lexicon • Guessing technique • Results • Consensus guessing • Acquisition of the full paradigm • General improvement • Conclusion and further work
Acquiring the lexical information Initial coverage of UMLF (previous project, UMLF, based on Baud et al. 1998) 17,192 lexical units 5,353 adjectives 11,799 nouns 36,211 word forms
Acquiring the lexical information From general lexicon Existing French general lexicon (Morphalou) With a guessing technique
Acquiring the lexical information • From guessing technique (Tanguy & Hathout 2007) • 3 steps: • Learning phase : calculating the most frequent tag for each ending string in 2 existing lexicons • Guessing phase: assigning possible tag(s) • Cross validation with 2 guessing based on 2 lexicons
Acquiring the lexical information • Acquiring the full paradigm • All the inflectional forms • Lemma • Based on “productive” inflectional paradigms • 9 for adjectives • 3 for nouns • Algorithm based on lexical tries to cluster forms of the same paradigm
Outline • Context : UMLS for French • The desired coverage • The target lexical information • The organisation of a specialised lexicon • Acquiring lexical information • Initial coverage • Obtaining lexical entries from general lexicon • Guessing technique • Results • Consensus guessing • Acquisition of the full paradigm • General improvement • Conclusion and further work
Known words entries Remaining words to describe Term-Union 94,964 Initial UMLF 19,599 81,595 Morphalou 6,617 74,978 Acquisition from general lexicon: results
Acquisition with guessing techniques: results 74,978 unknown forms 44,515 analyses from Morphalou-based program 35,438 analyses from UMLF-based program Cross-validation: 30,137 in common
Acquisition with guessing techniques: evaluation Wrong label 12 Proper names 49 Latin words 5 English words 1 Spelling/segmentation 10 Other 5 Total 82 • Errors: 82 out of 1000 (8.2 %)
Acquisition of the full paradigm: Results 4,453 paradigms captured (incomplete or not, grouping 9352 word forms) 3,308 adjectives 514 nouns Automatic extension for the full paradigms (with canonical forms only) Manually checked for the others
General improvement Source Forms added Still unknown in Term-union Coverage UMLF-v1 36,211 81,595 14,1% Morphalou 17,828 74,978 21,0% Acquisition 8,088 70,602 25,7%
Outline • Context : UMLS for French • The desired coverage • The target lexical information • The organisation of a specialized lexicon • Acquiring lexical information • Initial coverage • Obtaining lexical entries from general lexicon • Guessing technique • Results • Consensus guessing • Acquisition of the full paradigm • General improvement • Conclusion and further work
Discussion and conclusion The acquisition and evaluation of specialised lexical resources require a specific reference Term-Union Extract (full) lexical information Assess lexical needs and target Other acquisition techniques (CRF for inflectional information, rule-based techniques for derivational information)
Acknowledgment • This work was partially funded by project InterSTIS (ANR-07-TECSAN-010) • InterSTIS project: www.interstis.org