420 likes | 538 Views
Battling Scylla and Charybdis: The Search for Redundancy and Ambiguity in the 2001 UMLS Metathesuarus. James J. Cimino Department of Medical Informatics Columbia University. 2001 Metathesaurus. 99 sources (92 in 2000) 1,734,707 strings (1,598,176 in 2000) 797,360 concepts (730,155 in 2000).
E N D
Battling Scylla and Charybdis:The Search for Redundancy and Ambiguity in the 2001 UMLS Metathesuarus James J. Cimino Department of Medical Informatics Columbia University
2001 Metathesaurus • 99 sources (92 in 2000) • 1,734,707 strings (1,598,176 in 2000) • 797,360 concepts (730,155 in 2000)
Cold (temperature) COLD (temperature) Cold (infection) COLD (COPD) Redundancy! Lumping vs. Splitting Cold (temperature) COLD (temperature) Cold (infection) COLD (COPD) Ambiguity!
Three Auditing Methods • Ambiguity through of multiple semantic types • Redundancy through semantic string matching • Inconsistency in parent-child semantic types
* * Cimino JJ. Auditing the Unified Medical Language System with semantic methods. Journal of the American Medical Informatics Association; 1998;5:41-51. Previous Results: 1995 Possible ambiguity 1,817 Possible redundancy 5,031 Actually redundancy 3,274 Parent-Child problems 544
Tools and Rules • Simple Metathesaurus data model • Normalized word index • “Mutually exclusive semantic types” • “Mutual concept subsumption”
L0486186: S0837575: “Chronic Obstructive Airway Disease” L0486186: S0837576: “Chronic Obstructive Lung Disease” Simple Metathesaurus Data Model C0024117: Chronic Obstructive Airway Disease L0009264: S0829315: “COLD <3>” S0474508: “COLD” Semantic type: T04: Disease or Syndrome
Simple Metathesaurus Data Model C0024117: Chronic Obstructive Airway Disease S0837575: “Chronic Obstructive Airway Disease” S0837576: “Chronic Obstructive Lung Disease” S0829315: “COLD <3>” S0474508: “COLD” Semantic type: T04: Disease or Syndrome
Simple Metathesaurus Data Model C0024117: Chronic Obstructive Airway Disease “Chronic Obstructive Airway Disease” “Chronic Obstructive Lung Disease” “COLD <3>” “COLD” Semantic type: T04: Disease or Syndrome
C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease Chronic Obstructive Airway Disease Chronic Obstructive Lung Disease COLD <3> COLD Semantic type: T04: Disease or Syndrome Simple Metathesaurus Data Model
Substance Animal Plant Invertebrate Food Alga UMLS Semantic Types Physical Object Organism
Mutually Inclusive Semantic Types Physical Object Organism Substance Animal Plant Invertebrate Food Alga
Mutually Exclusive Semantic Types Physical Object Organism Substance Animal Plant Food Invertebrate Alga
Rules for Multiple Semantic Types 3. Concepts can have two Substance types, except: a) Element, Ion or Isotope and Chemicals Viewed Structurally b) Inorganic Chemical and Organic Chemicals 5. Concepts can have two Conceptual Entity types, except: Molecular Sequence and Geographic Area Molecular Sequence and Body Location or Region Geographic Area and Body Location or Region 7. Concepts can have two Event types, except: Diagnostic Procedure and Laboratory Procedure 8. Concepts can have two types that ancestors/descendants
Detection of Ambiguity by Mutually Exclusive Semantic Types If a concept has multiple semantic types And if any pair of the types are mutually exclusive Then the concept may have multiple meanings (ambiguity) Or the semantic type assignment is incorrect
Ambiguity Examples C0015155: Euglena gracilis Alga and Invertebrate C0223537: Fourth lumbar vertebra Body Part, Organ, or Organ Component and Disease or Syndrome C0035510: Toxicodendron Plant and Disease or Syndrome C0242789: Crown-Rump Length Organism Attribute and Diagnostic Procedure C0007608: Cell Movement Cell Function and Biomedical Occupation or Discipline C0030756: Lice Infestations Invertebrate and Disease or Syndrome C0008715: Chronically Ill Disease or Syndrome and Patient or Disabled Group
Normalized Word Index • UMLS Normalized Word Index • e.g., “lungs” “lung” • 293,004 words • Keyword synonyms • e.g., “lung” “pulmonary” • 9,650 mappings • Translated strings • Built word index
C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease Chronic Obstructive Airway Disease Chronic Obstructive Lung Disease COLD <3> COLD Semantic type: T04: Disease or Syndrome Word Normalization
Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disease chronic obstructive lung disease cold 3 cold Semantic type: T04: Disease or Syndrome
Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disease chronic obstructive pulmonary disease cold 3 cold Semantic type: T04: Disease or Syndrome
Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold 3 cold Semantic type: T04: Disease or Syndrome
Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold Semantic type: T04: Disease or Syndrome
Word Index C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder airway chronic cold disorder obstructive pulmonary three chronic obstructive pulmonary disorder cold three cold Semantic type: T04: Disease or Syndrome
Mutual String Subsumption 1) If Concept A has String A1 And all words in A1 are in Concept B’s word list Then B subsumes A1 2) If B subsumes any string in A And A subsumes any string in B Then A and B are mutually subsumptive
C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome Mutual String Subsumption
Mutual String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome
Mutual String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome
Mutual String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome
Detection of Redundancy by String Subsumption If A and B are mutually subsumptive And semantic types of A and B are mutually inclusive Then A and B may be redundant
Detection of Redundancy by String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome
Redundancy Examples C0673603: NPS-R-467 (Organic Chemical) C0673604: NPS R-467 (Organic Chemical) C0673769: des-Arg(10)-(Leu(9))kallidin (Amino Acid, Peptide or Protein) C0673771: kallidin, des-Arg(10)-(Leu(9))-) (Amino Acid, Peptide or Protein) C0266133: Congenital diverticulum of esophagus (Congenital Abnormality) C0555218: Congenital esophageal pouch (Congenital Abnormality)
Redundancy False Positives • Partial names as synonyms: C0687720: Central Diabetes Insipidus has “Diabetes Insipidus” as synonym so it is mutually subsumptive with C0011848: Diabetes Insipidus • Incorrect synonymy (MeSH translations) C0013005: Dolphins has synonyms “ORCA” (Span.) and "FALSA BALEIA ASSASSINA“ (Port.) so it is mutually subsumptive with C0325138: Whale, False Killer which has synonym "FALSA ORCA" (Span.)
Detecting Semantic Type Problems through Parent-Child Relations If Concept A is Parent of Concept B And Concept A has semantic type X And Concept B has semantic type Y And if X and Y are different And X is not an ancestor of Y (in Semantic Net) Then one (or both) semantic types are wrong Or the parent-child relation is wrong
Skate (manufactured object) Shark (vertebrate) Stingray (animal) Dogfish (fish) Detecting Semantic Type Problems through Parent-Child Relations Cartilaginous Fish (vertebrate) Parent-Child Relations OK Wrong Type or Wrong Concept Nonspecific Semantic Type OK
Parent-Child Examples C00013769: Elbow has type Body Location or Regions which is in the Conceptual Entity hierarchy Is parent of: C0230353: Right elbow has type Body Part, Organ, or Organ Component which is in the Physical Object hierarchy
Results: 1995 VS. 2001 Possible ambiguity 1,817 Possible redundancy 5,031 Actually redundant 3,274 Parent-Child problems 544 8,082 38,140 not done 2,868 Number of concepts: 222,927 797,359 (3.6x) Parent-Child relations 100,586 607,043 (6.0x)
Results: 1995 VS. 2001 Possible ambiguity 1,817 (0.82%) 8,082 (1.01%) Possible redundancy 5,031 (2.26%) 38,140 (4.78%) Actually redundant 3,274 (1.47%) not done Parent-Child problems 544 (0.54%) 2,868 (0.47%) Number of concepts: 222,927 797,359 (3.6x) Parent-Child relations 100,586 607,043 (6.0x)
Discussion: Ambiguity Detection • Small number (1.01%) is a good sign • Allows focusing manual review • Semantic type definitions need to be clarified • Semantic type assignment rules need to be clarified
Discussion: Redundancy Detection • Specificity is worse, without improved sensitivity • Normalized string index is part of the reason • “Incomplete” names are a bigger part of the reason • Manual review will be relatively inefficient • Incorrect mappings detected, especially foreign language
Discussion: Parent-Child Relations • Mostly detects errors in semantic type assignment • Strict hierarchy in Semantic Net causes problems
Conclusions • Specific “answers” not possible • Domain expertise needed for assessment of chemical names • Assessments are necessarily subjective • NLM gets to make the rules • NLM hasn’t finished making the rules • Methods provide focus for manual review • Methods highlight where clearer definitions are needed • The results show the UMLS is doing well at a difficult task
Acknowledgments • NLM: Bill Hole, Alexa McCray and Betsy Humphreys • Home: Rachel and Rebecca Cimino