200 likes | 205 Views
Real-time system classifying and retrieving pediatric clinical data using MetaMap software and UMLS concepts. Enabling intelligent decision support, automatic billing, and IR capabilities. Addressing challenges in text preprocessing, disambiguation, and semantic interpretation. Strategic approach to structured information extraction and unique word sense determination.
E N D
The Semantic Retrieval System:Real-time System for Classifying and Retrieving Unstructured Pediatric Clinical Annotations Charlotte Andersen John Pestian Karen Davis Lukasz Itert Paweł Matykewicz Włodzisław Duch Cincinnati, February 2005
Outline • The project • Goals • Focus • Software • Results • Plans…
CCHRF project outline (simplified) INPUT (raw medical text) Preprocessing MetaMap input MetaMap Software - UMLS concept discovering and indexing Hypothesisgeneration, validation, important relations. Annotations: Concept Space (UMLS concepts) Decision support systems Automatic medical billing
Long-term goals (too ambitious?) IR system facilitating discoveries, helping to answer questions like: • Retrieve similar cases using discharge summaries. • Is X related to Y? • Will X help patient with Y? • What correlates with X? • What causes changes of X? • What are therapy options for X? Automatic creation of medical billing codes from text. Can we work out scenarios of use for our target system?
Map raw text to some structured form, removing all ambiguities, expanding acronyms etc. Use the NLM’s MetaMap to create an XML formatted data whose schema is based on the Unified Medical Language System’s (ULMS) Semantic Network ontology. <semantic type>word</semantic type> E.C. => <bacterium>Escherichia Coli</bacterium> <patient> <FIRST-NAME>Bob</FIRST-NAME> <LAST-NAME>Nope</LAST-NAME> </patient> First big problem: disambiguation
Final XML should include maximum information that can be derived with high confidence for each word, including: 1. Annotations for parts of speech (tree tagger) – for which type of words? 2. Tags for semantic type (135 types in ULMS + tags for other non-medical types); 3. Tags for word sense (ULMS + dictionaries such as WordNet); 4. Values assigned to some semantic types, ex: Temperature=high, or T=102F. What should we keep depends on the scenarios how the system will be used. XML or structured text
Main subproblems: Removing patient-specific information, but keeping all information related to a single case together; how to link sequence of records for a single person? Text cleaning: misspellings, obtaining unique terms. Expansion of abbreviations and acronyms. Ambiguity of medical terms. Ambiguity of common words; how interesting are common terms, which categories/semantic types should be used? Assigning values to some categories, ex: blood pressure; temperature. Check XML standards developed at AMIA. Small steps to solve the big problem
3 levels: First: local, recognition of terms, we bring a lot of background knowledge reading the text, ignore misspellings and small mistakes. Second: larger units, semantic interpretation, discover and understand meaning of concepts composed of several terms, define semantic word sense for ambiguous words, expand terms and acronyms to reach unambiguous interpretation. Third: episodic level of processing, or what the whole record or text is about? Knowing the category of text helps in unique intrepretaiton at recognition and semantic level. Human information retrieval
Pawel started some work, a short report on text recognition memory was written. NLM has GSpell and WedSpell spelling suggestion tools, and the BagOWordsPlus phrase retrieval tool (new, worth trying). GSpell java classes, used to propose spelling corrections and unique spelling for words that have alternative spellings. Even if spelled correctly it may be a mistake, ex: disease|disease|0.0|1.0|NGrams|Correct disease|discase|1.0|0.873|NGrams| disease|diseased|1.0|0.873|NGrams| disease|decease|2.0|0.5819672267388108|NGrams| Recognition
Is this an issue in our case? Can we estimate how serious are problems at the recognition level? The term may be a part of the phrase, and this would be discovered only when the term is correctly recognized. How do we know that we have acronym/abbreviation? Frequently capital letters, usually 2-4 letters, morphological structure using bi-grams is improbable, ex: DMI, CRC, IVF. Acronyms and abbreviations should be recognized and expanded. Need probability of various typos (keys that are close, characters that are inverted, frequent errors, anticipation what character should come next etc), and errors at the spelling and phonological level. External dictionaries should be checked to find out if the word is not a specific medical term that is not listed in ULMS. Recognition cont
Required to: select the most probable term from recognition process that gives several alternatives at the same confidence level; WSD, or find semantic word sense for ambiguous words. Semantic level Word may have correct spelling but no sense at the semantic level; go back to the recognition level and generate more similar words to check which one is the most probable at the semantic level. This should give in most cases highly probable term; once this is achieved unique semantic word sense is defined. Semantic knowledge representation may be done using: • context vectors, • concept-description vectors; • more elaborate approaches, like frames (CYC).
Context vectors: numerical, easy to generate from co-occurrence. Widely used statistical approach, but lacks semantics; concept name and its properties may be far apart; A B A B Concept description vectors (CDV), knowledge-based: list properties of concepts, derive info from definitions, dictionaries, ontologies, pay more attention to unique features. Frames, structured representations: more expressive power, with symbolic values such as color = blue or color in {blue, green} ect. time = admission_time; time = day before discharge [time = morning ...], ect ... Initially simple vector representation should be sufficient for WSD, but remember that expressive power is limited. Some thinking about simplified, computationally efficient frame-based representation should be done. Semantic knowledge representation
Try to understand what the whole record or paragraph is about. ACP has at least 14 distinct meanings in the Medline abstracts; recognition/semantic level is not sufficient for disambiguation. Essentially requires categorization of documents/paragraphs. The record should be placed in some category and this will restrict the type of semantic meanings that are probable in this category. This is more expensive than the semantic level. To achieve this categories of records should be identified (document classification). Lukasz has made first experiments using different knowledge rep with discharge summaries. Different levels – R, S, E - are coupled. Knowing the disease it is easier to uniquely expand some acronyms and provide WSD. Adding some XML-annotation should make text categorization easier. Several interpretations should be maintained, then one selected. Episodic level
Is it feasible? Complete automatisation may be hard. Many courses and books are on the market, B$ annually. Simplest solution: proper database => codes automatically. Knowledge-based approach to derive billing codes from texts: look at the rules in books, try to analyze text, estimate which fields are easy and which difficult. Memory-based approach – find similar description that have the same codes (used in national census). Correlation-based: look at the statistical distribution of codes, correlation between digit values; useful for checking, osmetimes fro prediction. Demo Billing codes
Many courses and books are on the market, B$ annually. Simplest solution: proper database => codes automatically. Knowledge-based approach to derive billing codes from texts: look at the rules in books, try to analyze text, estimate which fields are easy and which difficult. Memory-based approach – find similar description that have the same codes (used in national census). Correlation-based: look at the statistical distribution of codes, correlation between digit values; useful for checking, osmetimes fro prediction. Billing codes
General questions • How should we proceed? Depending on the scenario of use, we can work on selected aspects of the problem or try to put the whole system together & go on improving it. • What data can we have access to? How reliable it is? • What should we still do at the pre-processing stage? Anonymizing but linking individual patients? • How should we leverage on the POS-tagged corpus? Compare different unsupervised taggers; check the improvement of supervised taggers; use POS as additional info in concept discovery and WSD;other ideas ?
Recognition memory level • Cleaning the text, focusing on details: many misspellings, various recognition memory techniques may be applied to token => term mappings, Pawel has made a good start; but be careful, it is easy to introduce errors. • Improvements of GSpell are of interest to NLM. • About 1000 disambiguation rules were derived from >700K trigrams, but how universal are these rules on the new texts? Are some not too specific? • Semi-automatic approach may be based on context vectors; cluster different use of mm, ALL etc first and for each try to assign unique meaning from context; how does it compare with manually derived rules? Can we combine the two approaches for higher confidence?
Semantic memory level • So far we have used only MetaMap but we need phrase and concept indexing: noun phrases, creating equivalence classes, compression of information; finding concepts in whole sentences or large windows, not only in phrases. • WSD, or rather concept sense disambiguation, CSD; work with the context vectors in the compressed text. • Knowledge-based approach: create concept-description vectors from medical dictionaries and ontologies; that goes beyond context vectors by providing reference knowledge. • Knowledge discovery: assigning values to concepts, assigning concepts to numbers and adjectives, ex. blood_pressure=[xxx-yyy], or blood_pressure=normal; adjective – noun relation, or number – concept relations; looking for relations at this stage, use fuzzy/similarity logic.
Episodic memory level • Document categorization: what categories? For billing very detailed ones, but even rough categories are useful to narrow down the choices for acronym and WSD. • Lukasz: most common categories derived from database, not clear how accurate is initial diagnosis but at this rough level should be rather fine. • Use MesH headings at some level? Challenge: select the best set of headings that will help to find unique sense of words and acronyms. • Many advanced approaches to text categorization, like kernel-based methods for text, nice field, but the secret is in pre-processing, finding good feature space. • Relation to the 20Q game, gaining confidence stepwise.
Suggestions & priorities • What are our priorities? All 3 levels are important. Where will our greatest impact be? • Start with document categorization? People usually know document category when they read it; misunderstandings is certain if short documents are given to wrong experts. Try: knowledge-based clustering and supervised learning; recurrent NN for structured problems;decision trees for many missing values ... • Good categorization needs concepts/phrases, we should focus on concept discovery and check coupling with document categorization, exploring parallel hypothesis. • Some work should also be finished at the recognition memory level: acronyms + misspellings.