Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany

Overview MuchMore Objectives Semantic Annotation  Semantic Resources, Term/Relation Tagging Corpus Annotation  Part-of-Speech, Morphology, Chunks  Grammatical Functions Annotation Format (DTD), Examples, Demo

MuchMore Objectives Evaluation Systematic Comparison of CLIR Methods on a Realistic Scenario in the Medical Domain  Establishing a Baseline with Corpus-Based Methods  Comparison with Concept-Based Methods Concept-Based CLIR Effective Use of Medical and General Semantic Resources by Developing Methods for Tuning and Extension

Semantic Resources Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (MeSH, ICD, …) English, German, Spanish, … 730.000 Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”)

C0019682|ENG|P|L0019682|PF|S0048631|HIV|0| C0019682|ENG|S|L0020103|PF|S0049688|HTLV-III|0| C0019682|ENG|S|L0020128|VS|S0049756|Human Immunodeficiency Virus|0| C0019682|ENG|S|L0020128|VWS|S0098727|Virus, Human Immunodeficiency|0| C0019682|FIN|P|L1523437|PF|S1819346|HIV|3| C0019682|FRE|P|L0168651|PF|S0233132|HIV|3| C0019682|FRE|S|L0206547|PF|S0277133|VIRUS IMMUNODEFICIENCE HUMAINE|3| C0019682|GER|P|L0413854|PF|S0538136|HIV|3| C0019682|GER|S|L1261793|PF|S1503739|Humanes T-Zell-lymphotropes Virus Typ III|3| Concept Names (MRCON): 1.734,706 ENGLISH 1.462,202 GERMAN 66,381 other languages UMLS • Each CUI (Concept Unique Identifier) is mapped to one of 134 semantic types (TUI) • Clozapine : C0009079  Pharmacologic Substance : T121 • Semantic Types are organized in a Network through 54 Relations • T121|T154|T047

Annotate Terms (of length 1-4 tokens) with Preferred Term, CUI and TUI <term id="13" tokenid="14, 15, 16" preferred ="Intensive Care Unit” cui="C0021708" tui="T073"/> Term / Relation Tagging Annotate All Possible Semantic Relations between Identified Terms within a Sentence <term id="2" tokenid="2” preferred="Heparinoid” cui="C0019142” tui="T121"/> <term id="5" tokenid="6" preferred ="Thrombin” cui="C0040018" tui="T126"/> <semrel id="40" relterms="5, 2" reltype="interacts_with" />

Morpho/Syntactic Processing TnT Tokenization, Segmentation, PoS-tagging Mmorph Lemmatization(German compound analysis) Chunkie Phrase Recognition under developmentGrammatical Function Tagging Parallel Corpus ~ 9000 English and German Medical Abstracts from 41 Journals (obtained through Springer LINK WebSite) ~ 1 M Tokens for each Language Manual Clean-Up Corpus Annotation

Tokenization Hyphenated Compounds, e.g: side-effects, short-term, follow-up Abbreviations, e.g: aquos., emulsific., Ungt. TnT PoS-Tagger (Brants, 2000) Retrain on an annotated domain-specific corpus Update underlying lexicon Specialist Medical Lexicon  UMLS (Englisch), ZInfo (German) Tokenization, POS Tagging

Mmorph Dumped Full-Form Lexicon (domain independent) Decomposition: Problematic for German, e.g. Schleimhautoedem > Schleimhaut+Oe+Dem German Medical Specialist Lexicon Chunkie HMM-based Partial Parser (Skut and Brants, 2000) Recognition of internal structure of simple as well as complex NPs, PPs and APs Retraining needed on Annotated Medical Corpora Morphology, Phrase Recognition

Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. Untersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten. ”Untersucht” PAS.SUBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:OBJ ”sich” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:IOBJ ”Bypassoperation” Grammatical Function Tagging

document gramrels gramrels chunks chunks terms terms ewnterms ewnterms semrels semrels text text title sentence keywords keyword gramrel chunk term ewnterm semrel token gramrel chunk term ewnterm semrel token XML Annotation Format (DTD)

XML Annotation (Example) <?xml version="1.0" encoding="ISO-8859-1" ?> <document id="DerHautarzt.80490581.eng" type="abstract" lang="eng"> <sentence id="s1" corresp="s1"> <terms> <term id="s1.t1" tokenid="s1.w5" preferred="Women" cui="C0043209" tui="T098" /> <term id="s1.t2" tokenid="s1.w7" preferred="Fevers" cui="C0015967" tui="T184" /> <term id="s1.t3" tokenid="s1.w9 s1.w10" preferred="Weight Loss" cui="C0043096“ tui="T184" /> </terms> </semrels> <gramrels> <gramrel id="s1.g1" tokenid="s1.w6 s1.w6" gramtype="ACT" prob="0.750" /> <gramrel id="s1.g2" tokenid="s1.w5 s1.w6" gramtype="SUBJ" prob="0.017" /> <gramrel id="s1.g3" tokenid="s1.w7 s1.w6" gramtype="OBJ" prob="0.056" /> <gramrel id="s1.g3" tokenid="s1.w10 s1.w6" gramtype="OBJ" prob="0.106" /> </gramrels> <chunks> <chunk id="s1.c1" from="s1.w1" to="s1.w5" type="NP" /> <chunk id="s1.c2" from="s1.w9" to="s1.w10" type="NP" /> <chunk id="s1.c3" from="s1.w11" to="s1.w13" type="PP" /> </chunks> <text> <token id="s1.w1" pos="DT" lemma="a">A</token> <token id="s1.w2" pos="JJ">34-year-old</token> <token id="s1.w3" pos="VBN" lemma1="HIV" lemma2="infect">HIV-infected</token> <token id="s1.w4" pos="JJ" lemma="african">African</token> <token id="s1.w5" pos="NN" lemma="woman">woman</token> <token id="s1.w6" pos="VBN" lemma="develop">developed</token> <token id="s1.w7" pos="NN" lemma="fever">fever</token> <token id="s1.w8" pos="CC" lemma="and">and</token> <token id="s1.w9" pos="NN" lemma="weight">weight</token> <token id="s1.w10" pos="NN" lemma="loss">loss</token> <token id="s1.w11" pos="IN" lemma="on">on</token> <token id="s1.w12" pos="PRP" lemma="her">her</token> <token id="s1.w13" pos="NN" lemma="trunk">trunk</token> <token id="s1.w14" pos="CC" lemma="and">and</token> <token id="s1.w15" pos="NN" lemma="arm">arms</token> <token id="s1.w16" pos="punct">.</token> </text> </sentence> </document>

Demo...

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain

Presentation Transcript

CROSSMARC CROSS-lingual Multi Agent Retail Comparison Project Presentation

Cross-lingual Information Access by Natural Language

Cross-Lingual IR

Cross-Language Information Retrieval

Annotation Schemes to Encode Domain Knowledge in Medical Narratives

Cross-Language Information Retrieval

Multi-lingual media dictionary

Multi Lingual Thesaurus for the Geosciences

Multi-Lingual Support for the Multi-Channel Call Center

A Multi-Disciplinary, Multi-Lingual Approach to African Information Resources Retrieval

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

Cross Lingual Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross-lingual Information Extraction System Evaluation

Multi-Lingual Collaborative Dance

DAML Tools for Intelligent Information Annotation, Sharing and Retrieval

Multi-lingual media dictionary

Iterative Translation Disambiguation for Cross Language Information Retrieval