1 / 32

Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval

MUCHMORE. Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval. Project Overview. Application  Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval. Research & Development

peri
Download Presentation

Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MUCHMORE Multilingual Concept Hierarchies for Medical Information Organization and Retrieval

  2. Project Overview Application  Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval Research & Development  Developing Novel, Hybrid (Corpus-/Concept- Based) Methods for Handling this Scenario Evaluation  Evaluating the Technical Performance of (Combinations of) Existing and Novel Methods

  3. User Perspective (ZInfo) Vision: BAIK Model • MuchMore •  Provide Relevant Medical Information • … for a Specific Patient Problem • … Automatically, from the Web • … Independent of Language

  4. User Perspective (ZInfo) User Requirements • Automatic Query Generation (and Expansion), Identifying the Exact Problem of the Patient • Retrieval and Relevance Ranking of Evidence Based Medical Literature, Language Independent • Summarization and Filtering of Results According to a User Profile

  5. User Perspective (ZInfo) User Evaluation Evaluate Usefulness  Query Generation  Relevance for Decisions in Diagnostics and Treatment Use for Medical Cases  Part of Postgraduate Course in Medical Informatics Problematic Issues  Different medical profiles, schools, experience, speciality  Relevant for one user may mean less or nothing to another  Evidence based medicine criteria exist only for a small fraction of medicine

  6. MuchMore Prototype • Overview of Prototype Functionality • Relation between Functionality and User Requirements •  Issues Addressed by Research and Development within MuchMore

  7. R&D in MuchMore Semantic Annotation Based CLIR Corpus Annotation (DFKI, ZInfo) •  PoS, Morphology, Phrases, Grammatical Functions •  Term and Relation Tagging • Term Extraction (XRCE, EIT, CMU, CSLI) •  Bilingual Lexicon Extraction, Extension of Semantic Resources • Sense Disambiguation (CSLI, DFKI) •  Tuning and Extension of Semantic Resources •  Combining Sense Disambiguation Methods • Relation Extraction (DFKI, CSLI) •  Grammatical Function Tagging •  Extracting Semantic Relation Indicators •  Extracting Novel Semantic Relations • Semantic Indexing/Retrieval (EIT,DFKI)

  8. R&D in MuchMore Additional Approaches in CLIR • Corpus Based CLIR • Bilingual Lexicon Extraction (XRCE, EIT, CMU, CSLI) • Pseudo Relevance Feedback: PRF (CMU) • Generalized Vector Space Model: GVSM (CMU) Text Classification Based CLIR (CMU)  Hierarchical/Flat kNN with MeSH Summarization (CMU)  Query, Genre Specific

  9. Corpus Annotation Annotation Evaluation Corpus ~ 9000 English and German Medical Abstracts from 41 Journals, Springer LINK WebSite, ~ 1 M Tokens for each Language PoS • Lexicon Update, Remaining Error Rate ~ 1.5% (EN) Histologically, we found a subepidermal blister formation and a predominantly neutrophilic infiltrate. pos=VB > pos_correct=NN Morphology Incorrect, e.g.:Chorionzottenbiopsie > Chor + Ion + Zotte + Biopsie • Term and Relation Tagging •  Evaluation of 8 DE/EN Parallel Abstracts, Relevant for a Query

  10. Aim Bilingual Lexicon Extraction From Comparable Corpora at Word Level; From Parallel Corpora at Word, and Term (Multi-Word) Level Bilingual Extension of Semantic Resource (MeSH) Term Extraction XRCE (Aims and Resources) Resources • Optimal Combination of Existing Resources (Corpus, General Dictionary, Thesaurus: MeSH) • Corpus Specific German Decompounding (Improves Recall by 25% at Equal Precision)

  11. Optimal Combination of Resources Retaining only 10 best Translations for each Candidate 1. word-to-word, comparable corpora: F1 = 0.84 2.a word-to-word, parallel corpora: F1 = 0.98 2.b term-to-term, parallel corpora: F1 = 0.85 Evaluating Separately with Individual Resources (F1) Corpus: 0.62; MeSH: 0.51; General Dictionary: 0.56 3. MeSH Extension: 1453 new multi-word terms added (synonyms or new term entries) extracted from the Springer corpus Term Extraction XRCE (Results of Best Method)

  12. Term Extraction EIT (Similarity Thesauri) Method  Extract Most Frequent Terms (Single Word) by Comparison of Term Frequencies in a General Corpus (German: SDA, English: LA Times) vs. Medical Corpus Results  Single Word Terms (Springer Abstracts) German-English:104,904 / English-German: 49,454  Multiword Terms (Phrase Lexicon Generated from ICD10) German Phrases: 354 / English Phrases: 665 Bilingual Phrasal Entries Generated: German - English: 225 / English - German: 246

  13. Term Extraction CMU (EBT Bilingual Lexicon) Method  For each word in one language, accumulate counts of the number of times the translations of the sentences containing that word include each word of the other language. These co-occurrence counts may be restricted using word-alignment techniques.  Apply a variable threshold to filter out uncommon co-occurrences which are unlikely to be translations. The result is a lexicon listing candidate translations and their relative frequencies. Results  ~99.000 Bilingual Term Pairs (PubMed Parallel Abstracts) (Estimated Error Rate: < 10%)

  14. Term Extraction CSLI (Infomap System) Represent English and German Words as Vectors that are Produced by Recording the Number of Co-Occurrences of the Word in Question with each of a Set of Content-Bearing Words. Use (Cosine) Similarity Measure on these Rows to Find “Nearest Neighbours”. 1, 000 (English) content-bearing words ligament kneejoint . . . ligament English words English Kreuzband Kniegelenk German words German . . . . . .

  15. WSD: Terms, Senses Semantic Resource Extension and Tuning • Extension (DFKI) • Morphological Analysis (Decomposition) • Entzündungsgewebe (infection tissue) HYPONYM Gewebe,Körpergewebe (body tissue) • Gewebe, Stoff,Textilstoff (textile) • Semantic Similarity (Co-Occurrence Patterns) • Karzinom (carcinoma), Metastase (metastasis) SYNONYM Geschwulst, Tumor, .... • Tuning (CSLI, DFKI) • Aligning Clusters with Senses C0043210|GER|P|L1254343|PF|S1496289|Frauen|3| C0043210|ENG|P|L1189496|PF|S1423265|Human adult females|0|

  16. WSD: Algorithm Combination of Methods (Task, Domain, General) Bilingual Sense Selection (CSLI) • 1 Sense in L1 vs. >1 Sense in L2 • English blood vessel (C0005847)vs. vessel (polysaccharide) (C0148346) • German Blutgefaesse = blood vessel (C0005847) Collocations and Senses (CSLI) • For an ambiguous single word term that is part of several unambiguous multiword terms, choose the sense of the most frequent multiword term. single word term abortion 1) a natural process C0000786 (T047) 2) a medical procedure C0000811 (T061) multiword term recurrent abortion C0000809 (T047) => sense 1 induced abortion C0000811 (T061) => sense 2

  17. WSD: Algorithm Combination of Methods (Task, Domain, General) Domain Specific Senses (DFKI) • Concept Relevance in Domain Corpus • Mineral 0.030774033: Mineralstoff, Eisen, Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5: Allanit, Alumogel, ..., Axionit, Beryll, ... Wurtzit, Zirkon Instance-Based Learning (DFKI) • Unsupervised Context Models (n-grams) • Training (Learn Class Models) He drank <milk LIQUID> He drank <coffee LIQUID> He drank <tea LIQUID> He drank <chocolate FOOD, LIQUID> • Application (Apply Class Models) He drank <chocolate FOOD, LIQUID> He drank <Java GEOGAPHICAL, LIQUID>

  18. WSD: Evaluation Lexical Sample Evaluation Corpora (Medical) • Ambiguous: MeSH EN: 847 (2.5), DE: 780 (2.1); EWN EN: 6300 (2.8) DE: 4059 (1.5) • Evaluation (Nouns): GermaNet (40), English MeSH (59), German MeSH (28)

  19. Relation Extraction Grammatical Function Tagging (DFKI) • Robust, Shallow Grammatical Function Tagger • EM Model (Trained on Frankfurter Rundschau: 35M Tokens,Adaptation on Medical Corpora Under Development) 1.5M ‘Types’: Verb, Voice, Function, Nom-Head-Argument abarbeiten ACT SUBJ Politiker  Use of PoS Information, Use of Chunk Information Planned  Tags for SUBJ, OBJ, IOBJ, ACT/PAS  German Available, English under Development • Untersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten.

  20. Relation Extraction Semantic Relation Indicators (DFKI, CSLI) Novel Semantic Relations (DFKI, CSLI) differentiate conclude discriminate diagnose illustrate Cluster 1 T047/T060 (Diagnoses) T060/T101 (Affects) T060/T169 ... reduce treat follow diagnose cure Cluster 3 T047/T121 (Treats, Causes) T061/T121 (Uses) T121/T184 (Treats) ... Cluster 2 T101/T169 T101/T184 T101/T048 ... T047: Disease T048: Mental Dysfunction T060: Diagnostic Procedure T101: Patient T121: Pharm. Substance T169: Funct. Concept (Syndrom) T184: Sign or Symptom suffer demonstrate progress develop die

  21. Maximal Marginal Relevance (MMR)  Find passages most relevant to query  Maximize information novelty (minimize passage redundancy) Assemble extracted passages for summary Argmaxkdiin C[λS(Q, di) - (1-λ)maxdjin R (S(di, dj))] Q = query, d = document, S = similarity function λ = tradeoff factor between relevance & novelty k = number of passages to include in summary Summarization (CMU) Extractive Summarization Applications  Re-ranking retrieved documents from IR Engine  Ranking passages from a document for inclusion in summaries  Ranking passages from topically-related document cluster for cluster summary

  22. Summarization (CMU) MuchMore Application  INDICATIVE and QUERY-RELEVANT  MMR applies to English and German • Genre-based specialization (e.g. include conclusions for scientific articles) • Linguistic specialization possible  Summarization should apply when retrieving FULL articles  query-driven summaries instead of generic abstracts

  23. Technical Evaluation Test Data  Test Collection: Springer Abstracts (German and English)  Query Set: 25 of 126 Selected by ZInfo  Relevance Assessments Assumption: Documents Retrieved by all Runs for one Query (Intersection) are Relevant Pool Size: 500 Documents Based on 18 Runs Done by CMU, CSLI and EIT German (ZInfo): 959 Relevant Documents English (CMU): 500 Relevant Documents (1 judge) 964 Relevant Documents (3 judges)

  24. Technical Evaluation Methods Evaluated • Corpus Based Similarity Thesaurus (EIT) • Example-based Translation (CMU) • Pseudo Relevance Feedback (CMU) • Generalized Vector Space Model (CMU) • Hybrid Classification (CMU) • Hierarchical: kNN, Rocchio • Flat: kNN, Rocchio-style Classifier • Semantic Annotation + Extraction (DFKI, XRCE) • UMLS / XRCE Terms & Semantic Relations EuroWordNet Terms • Semantic Annotation + Similarity Thesaurus

  25. Technical Evaluation TREC-Style Performance Measurements • Overall Performance •  11point-Average Precision (Interpolated) • Performance in the High-Precision Area • Assumption: User Wants to Get Most Relevant Documents Topranked within the Result List •  Average Interpolated Precision at Recall of 0.1 •  Exact Precision after 10 Retrieved Documents • Applied to Experiments Evaluating Semantic Annotations

  26. Technical Evaluation Results: Corpus Based Methods Data Sets  EIT: The Springer Parallel Corpus, i.e. 9640 Documents for English, and 9640 documents for German CMU: Half of the Corpus, i.e. a Test Set with 4820 Documents in each.

  27. Technical Evaluation Results: Hybrid Methods Categorization (Preliminary Results) Reuters-21578: 10,000+ documents, 90 categories Reuters Corpus Volume 1, TREC-10 version (RCV1): 783,484 documents, 84 categories Reuters Koller & Sahami subsets (ICML’98): 138 to 939 documents, 6-11 categories in a set OHSUMED: 233,445 documents, 14,321 categories

  28. Technical Evaluation Results: Hybrid Methods Semantic Annotation + Extraction Data Set Full Springer Corpus Weighting Scheme Coordination Level Matching (CLM): 1. Pass: Documents Preferred Containing Matching Terms or Semantic Relations 2. Pass: All Features Using lnu.ltn Rel. Assessments German

  29. Technical Evaluation Results: Hybrid Methods Semantic Annotation + Similarity Thesaurus Data Set Full Springer Corpus Weighting Scheme Coordination Level Matching (CLM) Rel. Assessments German

  30. Technical Evaluation Summary of the Results • Assumption: CLIR achieves up to 75 % of Monolingual Baseline • (11pt Average Precision) • Corpus-based Methods (Compared to Monolingual PRF) • German – English PRF: 81 %, EBT: 77 %, EIT: 66% • English – German PRF: 113 %, EBT: 106 %, EIT: 60% • Hybrid Methods (Compared to Monolingual EIT) • German – English: 73 % (UMLS Terms & SemRels) • English – German: 50 % (UMLS Terms & SemRels) • English – German: 80 % (UMLS Terms & SemRels & XRCE Terms) • German – English: 74 % (SimThes & UMLS Terms & SemRels) • English – German: 80 % (SimThes & UMLS Terms & SemRels) • English – German: 92 % (SimThes & UMLS Terms & SemRels & XRCE Terms)

  31. Management Deviations from the Work Plan Corpus Collection • Comparable Medical Document Corpora are Very Difficult to Obtain, Anonymization Must be Validated by Hospital CIO • Work with „Shuffled“ Parallel Corpus • Radiology Reports (~600.000) Available in German, to be Obtained for English Corpus Annotation • More Efforts on Improving PoS Tagging and Morphological Analysis (English and German Medical Specialist Lexicon) Relation Extraction • More Efforts on Grammatical Function Tagging as Preprocessing for Semantic Relation Tagging and Extraction

  32. Management Future Prospects and Activities R&D Topics • Ontology DevelopmentCombining Axes in AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI, DFKI) • Semantic WebSemantic Annotation of Medical Documents with Metadata (UMLS in Protégé) Related Projects and Workshops • Project Proposal IKAR/OS on KM & Visualization in Life Sciences • OntoWeb SIG on LT in Ontology Development and Use • MuchMore Workshop with Invited Experts in Medical Information Access, CLIR and Semantic Annotation (September 2002) • ZInfo/MuchMore Workshop on Electronic Patient Records (Spring 2003)

More Related