230 likes | 342 Views
Codifying Semantic Information in Medical Questions Using Lexical Sources. Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu. Research Purpose. To find a method for classifying medical questions that are asked by clinicians Hypothesis - Simply indexing by keywords isn’t enough to
E N D
Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu
Research Purpose • To find a method for classifying medical questions that are asked by clinicians • Hypothesis - Simply indexing by keywords isn’t enough to • distinguish questions with different meanings but similar wording, or to • group questions with similar meanings but different words.
Definitions • Semantic Information – the meaning of the words • Syntactic Information – the parts of speech of the words (word type, sentence part) • Medical Questions – a question asked by a clinician • Lexical Sources – sources of words and vocabularies • UMLS – Unified Medical Language System
UMLS • Ambitious project of the National Library of Medicine, begun in 1986 • Help researchers retrieve and integrate electronic biomedical information from a variety of sources • Links over 100 controlled vocabularies • Assigns unique identifiers to medical concepts and strings • Maps the hierarchical relationships between the medical concepts
Why Bother?(To classify medical questions?) • Clinicians have questions when treating patients • Researchers have gathered collections of these questions • No good method exists to classify the questions • How many times has a particular question been asked? • Which questions should receive priority for evidence-based answers?
Examples • What is the best way to treat acute pharyngitis? • How should I approach a patient with a sore throat? • What should I do with a patient with diabetes and insulin resistance? • What should I do with a patient with diabetes who is resistant to taking insulin?
MethodsSource Questions • American researcher – observed clinicians at work • British researchers – questions sent in by clinicians – answered by researchers • Australian researchers – questions sent in by clinicians – answered by researchers • 4083 total questions
Methods Source Vocabulary • MRCON – a table from the Metathesaurus • Lists the medical concepts by unique identifiers (CUI) and each string associated with a concept • unique (string => 1 concept) • ambiguous (string => 2+ concepts) • COLD – ambient temperature, viral respiratory infection, chronic obstructive lung disease • 2,247,454 strings associated with concepts • Non-medical Lexicon – from Roget’s Thesaurus • Query objects (why, when, how), identifiers (I, you, he), modifiers (soon, frequently) • 749 terms in this lexicon
String Matching • Parsing program (written in C) • Separates individual questions into 3-word, 2-word, 1-word windows • Matches the window against MRCON and our lexicon • Generates a report of: • Total number of words parsed • Number of matches from unique, ambiguous, non-medical lists • Strings that didn’t match any of the lists
Results • String – individual word or words that matched • Hits – how often the string was found • Words – total number of matching words (some strings have more than one word in them)
Results • 100 strings occurred 7850 times – or 57.6% of the total matches • 712 strings => 3+ hits, 85% of all hits • Our focus was on strings that didn’t match one of the source vocabularies • 19.1% didn’t match • Hypothesis that additional terms not found in MRCON will be important for indexing
Results • Unmatched words – 2+ occurrences * can be more than one word type, depending on the context. Attacks, step, process all can be nouns or verbs
Discussion • MRCON – selected because of low rate of ambiguous string-CUI combinations • 89% unique string matches • 11% ambiguous string matches • Other tables have greater word coverage, but have more ambiguity for each of the words
Discussion • Our word-matching results were similar to other researchers • Cimino matched 43% of words with Meta-1 (we had 56% MRCON matches) • Computers & Biomedical Research. Aug 1992;25(4):366-373. • Hersh matched 60% of words to medical terminology & names dictionary (we had 79% combined lexicon matches) • Proceedings/AMIA Annual Fall Symposium. p. 1997.
Discussion • Stop words – commonly removed by most normalization tools. Prepositions, conjunctions, pronouns • Provide valuable contextual information. • Blood FOR an HIV-positive patient • Blood FROM an HIV-positive patient • Asprin AND warfarin • Asprin OR warfarin
Discussion • Integers • 186 distinct integers or integer word combinations • Occurred 647 times • Additional modification of concepts • Hyperkalemia – 5.3 mEq/li & 8.7 mEq/li • Both are hyperkalemia, but the evaluation and management are markedly different
Discussion • Verbs – largest category of unmatched words • Include action and relation concepts • Non-medical lexicon contained some • Treats, attends, increases, lessens, reduce, follows, starts, can, should, is, equal, improve • Verb tense changes the meaning of a question • In a patient TAKING antibiotics • In a patient who TOOK antibiotics
Discussion • Verbs may be conceptually related to medical concepts • Diagnose => Diagnosis • Treat => Treatment • Evaluate => Evaluation • Prescribe => Prescription • In these cases the verb (relationship) is not equivalent to the noun (concept)
Summary • We developed an application to • Parse individual words from collections of medical questions • Match the words (phrases) with lexical sources, codified by the UMLS • Our results were better than previous investigators (for percentage of matched words) • We still have some work to do….
Related Experiments • We attempted to cluster questions by sequences of semantic types • Initial attempts mostly clustered common phrases such as “How should I” and “What is the” • We may repeat this method after discarding ‘stop phrases’
Future Work • Family Practice Inquiries Network (FPIN) has 200 questions that have associated MeSH terms manually assigned by librarians. • We will look at these question-term groups for clustering purposes (with the hypothesis that they will not make distinct clusters).
Future Work I will work with researchers at NLM to apply MetaMap to medical questions • extract triplets (Medical Concept-Allowable Relation-Medical Concept) from questions. Drug-treats-Disease • Insert the triplets into a vector-space model and look for clusters
Thank-you!! ???