170 likes | 290 Views
Automatic Document Indexing in Large Medical Collections. Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors : Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis, Evangelos E. Milios. 2006 . HIKM. Outline. Motivation Objective Current Approach : MMTx
E N D
Automatic Document Indexing inLarge Medical Collections Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors :Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis, Evangelos E. Milios 2006 . HIKM
Outline • Motivation • Objective • Current Approach : MMTx • Method : AMTEx • C/NC-value method • Use of MeSH Thesaurus as lexical resource • Experiments • Conclusion • Personal Opinions
Motivation • MMTx, the U.S. NLM approach • maps biomedical documents to UMLS term concepts • The limitations of MMTx in term extraction: • term over-generation • term concept diffusion • unrelated terms added to the final candidate list • MMTx focus on UMLS rather than MeSH • But MEDLINE indexing is based on MeSH • To improve the efficiency of automatic indexing of medical documents.
Objective • We propose a new method, AMTEX • Improving the efficiency of automatic term extraction by using C/NC-value method . • Indexing and retrieval of MEDLINE documents, based on the extraction and mapping of document terms to the MeSH Thesaurus.
Current Approach : MMTx • Maps arbitrary text to UMLS Metathesaurus concepts: • Parsing (syntactic analysis - linguistic filter) • Variant Generation (uses SPECIALIST Lexicon) • Candidate Retrieval (mapping process to Metathesaurus Concepts) • Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness)
MMTx Example • Parsing • Shallow syntactic analysis of the input text • Linguistic filtering: isolates noun phrases e.g. the term “ocular complications” is analysed as: • Variant Generation e.g. “obstructive sleep apnea” has variants: obstructive sleep apnea, sleep apnea, sleep, apnea, osa,… • Candidate Retrieval Candidate Metathesaurus concepts for the variant “osa” : osa [osa antigen], osa [osa gene product] osa [osa protein] osa [obstructive sleep apnea] • Candidate Evaluation Obstructive Sleep apnea 1000 Sleep Apnea 901 Apnea 827 … … Sleeping 793 Sleepy 755 • The limitations of MMTx in term extraction: • term over-generation • term concept diffusion • unrelated terms added to the final candidate list
Method - AMTEx Input Document d, MeSH Ontology Term Mapping C/NC-value Multi-word Term Extraction & Term Ranking C/NC-value Multi-word Term Extraction & Term Ranking Single-word Term Extraction Term Variant Generation MeSH Thesaurus Resource Output MeSH Term Lists Term Expansion
Step 1 & 2: C/NC value- Multi-word Term Extraction & Ranking • Part-of-Speech Tagging • Linguistic filtering: • Term Extraction - C-value • Term Ranking - NC-value • Keep terms up to threshold T1
Step 3 : Term Mapping • Candidate terms are mapped to terms of the MeSH Thesaurus (simple string matching). • Only candidate terms matching MeSH are retained. • Multi-word candidates not matching MeSH may contain (shorter) MeSH terms.
Step 4 : Single-word Term Extraction • For multi-word terms not matching MeSH • Multi-word are split into single-word terms • Single-word terms are validated against MeSH • Matched MeSH terms are added to term list
Step 5 : Term Variant Generation • Inflectional variants of the extracted terms are identified during term extraction • (C/NC-value) • Stemmed term-forms are also available in MeSH and are added to the list of terms
Step 6 : Term Expansion • Each term in the list is expanded with neighbor terms in MeSH • The expansion may include terms more than one level higher or lower than the original term, depending on T2
Experiments • Precision and Recall measures • Dataset • 61 full MEDLINE documents, from PMC database of NCBI Pubmed • MEDLINE documents are paired to respective MeSH index terms, manually assigned by experts • Ground Truth • the set of MeSH document index terms • Benchmark method • MMTx against AMTEx
Conclusion - AMTEx • designed for indexing and retrieval of MEDLINE documents • focuses on multi-word term extraction using valid linguistic & statistical criteria • based on MeSH - similarly to human indexing • selectively expands to term variants & synonyms • outperforms the current benchmark MMTx method, reaching better precision & recall
Personal Opinions • Advantage • Drawback • … • Application • …