Codifying Semantic Information in Medical Questions Using Lexical Sources

Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu

Research Purpose • To find a method for classifying medical questions that are asked by clinicians • Hypothesis - Simply indexing by keywords isn’t enough to • distinguish questions with different meanings but similar wording, or to • group questions with similar meanings but different words.

Definitions • Semantic Information – the meaning of the words • Syntactic Information – the parts of speech of the words (word type, sentence part) • Medical Questions – a question asked by a clinician • Lexical Sources – sources of words and vocabularies • UMLS – Unified Medical Language System

UMLS • Ambitious project of the National Library of Medicine, begun in 1986 • Help researchers retrieve and integrate electronic biomedical information from a variety of sources • Links over 100 controlled vocabularies • Assigns unique identifiers to medical concepts and strings • Maps the hierarchical relationships between the medical concepts

Why Bother?(To classify medical questions?) • Clinicians have questions when treating patients • Researchers have gathered collections of these questions • No good method exists to classify the questions • How many times has a particular question been asked? • Which questions should receive priority for evidence-based answers?

Examples • What is the best way to treat acute pharyngitis? • How should I approach a patient with a sore throat? • What should I do with a patient with diabetes and insulin resistance? • What should I do with a patient with diabetes who is resistant to taking insulin?

MethodsSource Questions • American researcher – observed clinicians at work • British researchers – questions sent in by clinicians – answered by researchers • Australian researchers – questions sent in by clinicians – answered by researchers • 4083 total questions

Methods Source Vocabulary • MRCON – a table from the Metathesaurus • Lists the medical concepts by unique identifiers (CUI) and each string associated with a concept • unique (string => 1 concept) • ambiguous (string => 2+ concepts) • COLD – ambient temperature, viral respiratory infection, chronic obstructive lung disease • 2,247,454 strings associated with concepts • Non-medical Lexicon – from Roget’s Thesaurus • Query objects (why, when, how), identifiers (I, you, he), modifiers (soon, frequently) • 749 terms in this lexicon

String Matching • Parsing program (written in C) • Separates individual questions into 3-word, 2-word, 1-word windows • Matches the window against MRCON and our lexicon • Generates a report of: • Total number of words parsed • Number of matches from unique, ambiguous, non-medical lists • Strings that didn’t match any of the lists

Results • String – individual word or words that matched • Hits – how often the string was found • Words – total number of matching words (some strings have more than one word in them)

Results • 100 strings occurred 7850 times – or 57.6% of the total matches • 712 strings => 3+ hits, 85% of all hits • Our focus was on strings that didn’t match one of the source vocabularies • 19.1% didn’t match • Hypothesis that additional terms not found in MRCON will be important for indexing

Results • Unmatched words – 2+ occurrences * can be more than one word type, depending on the context. Attacks, step, process all can be nouns or verbs

Discussion • MRCON – selected because of low rate of ambiguous string-CUI combinations • 89% unique string matches • 11% ambiguous string matches • Other tables have greater word coverage, but have more ambiguity for each of the words

Discussion • Our word-matching results were similar to other researchers • Cimino matched 43% of words with Meta-1 (we had 56% MRCON matches) • Computers & Biomedical Research. Aug 1992;25(4):366-373. • Hersh matched 60% of words to medical terminology & names dictionary (we had 79% combined lexicon matches) • Proceedings/AMIA Annual Fall Symposium. p. 1997.

Discussion • Stop words – commonly removed by most normalization tools. Prepositions, conjunctions, pronouns • Provide valuable contextual information. • Blood FOR an HIV-positive patient • Blood FROM an HIV-positive patient • Asprin AND warfarin • Asprin OR warfarin

Discussion • Integers • 186 distinct integers or integer word combinations • Occurred 647 times • Additional modification of concepts • Hyperkalemia – 5.3 mEq/li & 8.7 mEq/li • Both are hyperkalemia, but the evaluation and management are markedly different

Discussion • Verbs – largest category of unmatched words • Include action and relation concepts • Non-medical lexicon contained some • Treats, attends, increases, lessens, reduce, follows, starts, can, should, is, equal, improve • Verb tense changes the meaning of a question • In a patient TAKING antibiotics • In a patient who TOOK antibiotics

Discussion • Verbs may be conceptually related to medical concepts • Diagnose => Diagnosis • Treat => Treatment • Evaluate => Evaluation • Prescribe => Prescription • In these cases the verb (relationship) is not equivalent to the noun (concept)

Summary • We developed an application to • Parse individual words from collections of medical questions • Match the words (phrases) with lexical sources, codified by the UMLS • Our results were better than previous investigators (for percentage of matched words) • We still have some work to do….

Related Experiments • We attempted to cluster questions by sequences of semantic types • Initial attempts mostly clustered common phrases such as “How should I” and “What is the” • We may repeat this method after discarding ‘stop phrases’

Future Work • Family Practice Inquiries Network (FPIN) has 200 questions that have associated MeSH terms manually assigned by librarians. • We will look at these question-term groups for clustering purposes (with the hypothesis that they will not make distinct clusters).

Future Work I will work with researchers at NLM to apply MetaMap to medical questions • extract triplets (Medical Concept-Allowable Relation-Medical Concept) from questions. Drug-treats-Disease • Insert the triplets into a vector-space model and look for clusters

Thank-you!! ???

Codifying Semantic Information in Medical Questions Using Lexical Sources

Codifying Semantic Information in Medical Questions Using Lexical Sources

Presentation Transcript

Learning Semantic Descriptions of Web Information Sources

Codifying Directors’ Duties

Lexical and semantic change

Lexical Retrieval Processes: Semantic Field Effects

Using Hispanic Market Information Sources in SimplyMap

Semantic Information Retrieval from Distributed Heterogeneous Data Sources

Lexical Semantics and Semantic Annotation

Using and Managing Sources of Information

Question Answering Using Enhanced Lexical Semantic Models

Semantic Satiation, Lexical Ambiguity , and Semantic Distance

Semantic Information

An Experiment in Using Lexical Disambiguation to Enhance Information Access

Differentiated Semantic Analysis in Lexical Affect Sensing

syntax driven lexical semantic linking

Comments Living Sources in Lexical Description

Information Sources in Chemistry

Using and Managing Sources of Information

Using Electronic Sources to Find Information

Automated Classification of Medical Questions Using Semantic Parsing Techniques

Lexicalization patterns: semantic structure in lexical forms

Information Sources in Chemistry

Lexical and semantic selection