10 likes | 143 Views
An Unsupervised Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline Bridget T McInnes University of Minnesota Twin Cities. Background and Introduction
E N D
An Unsupervised Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline Bridget T McInnesUniversity of Minnesota Twin Cities Background and Introduction Word Sense Disambiguation is the problem of determining the appropriate sense of a word that has multiple senses. This is a problem for biomedical applications such as medical coding and indexing. We explore the question of whether biomedical knowledge sources, such as the Unified Medical Language System (UMLS) and Medline, can be used to help identify the appropriate sense of a word. To do this, we introduce an unsupervised vector approach to disambiguate words in biomedical text using contextual information from the UMLS and compare our results to Humphrey, et al. (JAMIA , 2006) and SenseClusters (Pedersen, et al. http://senseclusters.sourceforge.net). Data and Resources • National Library of Medicine WSD dataset • Conflate Dataset • actin - antigens (a_a) • angiotensin II – olgomycin (a_o) • endogenous – extracellular matrix (e_e) • allogenic – arginine – ischemic (a_a_i) • X chromosome – peptide – plasmid (x_p_p) • diacetate – apamin – meatus – enterocyte (d_a_m_e) • CuiTools Software Package version 0.13 • http://sourceforge.cuitools.net EXAMPLE: Disambiguatingmole Instance: He calculated three moles of the substance in the first sample and five in the second. C0439189 : Mole, unit of measurement It is the amount of substance that contains as many elementary units as there are atoms in 0.012 kg of carbon-12. C0027962 : Melanocytic nevus A benign growth on the skin that contains a cluster of melanocytes and surrounding supportive tissue. Extract Possible Concepts Test Data He calculated three <head item=“mole” sense=“?”> mole </head> of the substance in the first sample and five in the second. Training Data ... was around 1 mole ... ... mole dose of angiotensin ... ... large mole with brown ... Algorithm Medline (Training Data) UMLS Test Data NLM-WSD Results Extract Context for Possible Concepts Possible Concepts and their context Three vectors C0439189 vector: amount 4 substance 4 elementary 8 units 12 atoms 32 carbon-12 3 benign 0 growth 0 skin 0 cluster 0 melanocytes 0 tissue 0 C0027962 vector: amount 0 substance 0 elementary 0 units 0 atoms 0 carbon-12 0 benign 10 growth 12 skin 34 cluster 11 melanocytes 5 tissue 6 Target word vector: amount 0 substance 4 elementary 0 units 0 atoms 0 carbon-12 0 benign 0 growth 0 skin 0 cluster 0 melanocytes 0 tissue 0 Create Vectors Create Vectors Vectors of Possible Concepts Vector of Target Word Conflate Results Calculate Cosine Concept of Target Word Calculate the Cosine Context • Context of Possible Concepts: • Definition of possible concepts Concept Unique Identifier (CUI) • Definition of possible concepts Semantic Types (ST) • Definition of possible concepts CUI unless one does not exist • then use the definition of its ST (CUI->ST) • Definition of possible concepts CUI and ST (CUI+ST) Θ¹ Θ² Conclusion • The CUI —> ST definition obtains the highest accuracy when compared to other context definitions • Our approach makes for disambiguation distinctions for words that have the same ST, unlike Humphrey et al. • Our approach can be used to perform all-words disambiguation, unlike SenseClusters Assign Sense He calculated three <head item=“mole” sense=“C0439189”> mole </head> of the substance in the first sample and five in the second. Acknowledgements • Ted Pedersen, University of Minnesota Duluth • John Carlis, University of Minnesota Twin Cities