130 likes | 334 Views
Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain. Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany. Overview. MuchMore Objectives. Semantic Annotation Semantic Resources, Term/Relation Tagging. Corpus Annotation
E N D
Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany
Overview MuchMore Objectives Semantic Annotation Semantic Resources, Term/Relation Tagging Corpus Annotation Part-of-Speech, Morphology, Chunks Grammatical Functions Annotation Format (DTD), Examples, Demo
MuchMore Objectives Evaluation Systematic Comparison of CLIR Methods on a Realistic Scenario in the Medical Domain Establishing a Baseline with Corpus-Based Methods Comparison with Concept-Based Methods Concept-Based CLIR Effective Use of Medical and General Semantic Resources by Developing Methods for Tuning and Extension
Semantic Resources Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (MeSH, ICD, …) English, German, Spanish, … 730.000 Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”)
C0019682|ENG|P|L0019682|PF|S0048631|HIV|0| C0019682|ENG|S|L0020103|PF|S0049688|HTLV-III|0| C0019682|ENG|S|L0020128|VS|S0049756|Human Immunodeficiency Virus|0| C0019682|ENG|S|L0020128|VWS|S0098727|Virus, Human Immunodeficiency|0| C0019682|FIN|P|L1523437|PF|S1819346|HIV|3| C0019682|FRE|P|L0168651|PF|S0233132|HIV|3| C0019682|FRE|S|L0206547|PF|S0277133|VIRUS IMMUNODEFICIENCE HUMAINE|3| C0019682|GER|P|L0413854|PF|S0538136|HIV|3| C0019682|GER|S|L1261793|PF|S1503739|Humanes T-Zell-lymphotropes Virus Typ III|3| Concept Names (MRCON): 1.734,706 ENGLISH 1.462,202 GERMAN 66,381 other languages UMLS • Each CUI (Concept Unique Identifier) is mapped to one of 134 semantic types (TUI) • Clozapine : C0009079 Pharmacologic Substance : T121 • Semantic Types are organized in a Network through 54 Relations • T121|T154|T047
Annotate Terms (of length 1-4 tokens) with Preferred Term, CUI and TUI <term id="13" tokenid="14, 15, 16" preferred ="Intensive Care Unit” cui="C0021708" tui="T073"/> Term / Relation Tagging Annotate All Possible Semantic Relations between Identified Terms within a Sentence <term id="2" tokenid="2” preferred="Heparinoid” cui="C0019142” tui="T121"/> <term id="5" tokenid="6" preferred ="Thrombin” cui="C0040018" tui="T126"/> <semrel id="40" relterms="5, 2" reltype="interacts_with" />
Morpho/Syntactic Processing TnT Tokenization, Segmentation, PoS-tagging Mmorph Lemmatization(German compound analysis) Chunkie Phrase Recognition under developmentGrammatical Function Tagging Parallel Corpus ~ 9000 English and German Medical Abstracts from 41 Journals (obtained through Springer LINK WebSite) ~ 1 M Tokens for each Language Manual Clean-Up Corpus Annotation
Tokenization Hyphenated Compounds, e.g: side-effects, short-term, follow-up Abbreviations, e.g: aquos., emulsific., Ungt. TnT PoS-Tagger (Brants, 2000) Retrain on an annotated domain-specific corpus Update underlying lexicon Specialist Medical Lexicon UMLS (Englisch), ZInfo (German) Tokenization, POS Tagging
Mmorph Dumped Full-Form Lexicon (domain independent) Decomposition: Problematic for German, e.g. Schleimhautoedem > Schleimhaut+Oe+Dem German Medical Specialist Lexicon Chunkie HMM-based Partial Parser (Skut and Brants, 2000) Recognition of internal structure of simple as well as complex NPs, PPs and APs Retraining needed on Annotated Medical Corpora Morphology, Phrase Recognition
Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. Untersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten. ”Untersucht” PAS.SUBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:OBJ ”sich” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:IOBJ ”Bypassoperation” Grammatical Function Tagging
document gramrels gramrels chunks chunks terms terms ewnterms ewnterms semrels semrels text text title sentence keywords keyword gramrel chunk term ewnterm semrel token gramrel chunk term ewnterm semrel token XML Annotation Format (DTD)
XML Annotation (Example) <?xml version="1.0" encoding="ISO-8859-1" ?> <document id="DerHautarzt.80490581.eng" type="abstract" lang="eng"> <sentence id="s1" corresp="s1"> <terms> <term id="s1.t1" tokenid="s1.w5" preferred="Women" cui="C0043209" tui="T098" /> <term id="s1.t2" tokenid="s1.w7" preferred="Fevers" cui="C0015967" tui="T184" /> <term id="s1.t3" tokenid="s1.w9 s1.w10" preferred="Weight Loss" cui="C0043096“ tui="T184" /> </terms> </semrels> <gramrels> <gramrel id="s1.g1" tokenid="s1.w6 s1.w6" gramtype="ACT" prob="0.750" /> <gramrel id="s1.g2" tokenid="s1.w5 s1.w6" gramtype="SUBJ" prob="0.017" /> <gramrel id="s1.g3" tokenid="s1.w7 s1.w6" gramtype="OBJ" prob="0.056" /> <gramrel id="s1.g3" tokenid="s1.w10 s1.w6" gramtype="OBJ" prob="0.106" /> </gramrels> <chunks> <chunk id="s1.c1" from="s1.w1" to="s1.w5" type="NP" /> <chunk id="s1.c2" from="s1.w9" to="s1.w10" type="NP" /> <chunk id="s1.c3" from="s1.w11" to="s1.w13" type="PP" /> </chunks> <text> <token id="s1.w1" pos="DT" lemma="a">A</token> <token id="s1.w2" pos="JJ">34-year-old</token> <token id="s1.w3" pos="VBN" lemma1="HIV" lemma2="infect">HIV-infected</token> <token id="s1.w4" pos="JJ" lemma="african">African</token> <token id="s1.w5" pos="NN" lemma="woman">woman</token> <token id="s1.w6" pos="VBN" lemma="develop">developed</token> <token id="s1.w7" pos="NN" lemma="fever">fever</token> <token id="s1.w8" pos="CC" lemma="and">and</token> <token id="s1.w9" pos="NN" lemma="weight">weight</token> <token id="s1.w10" pos="NN" lemma="loss">loss</token> <token id="s1.w11" pos="IN" lemma="on">on</token> <token id="s1.w12" pos="PRP" lemma="her">her</token> <token id="s1.w13" pos="NN" lemma="trunk">trunk</token> <token id="s1.w14" pos="CC" lemma="and">and</token> <token id="s1.w15" pos="NN" lemma="arm">arms</token> <token id="s1.w16" pos="punct">.</token> </text> </sentence> </document>