210 likes | 229 Views
Free-text Medical Document Retrieval via Phrase-based Vector Space Model. Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer Science Department University of California, Los Angeles. Outline. Vector space model (VSM) in document retrieval Stem-based VSM
E N D
Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer Science Department University of California, Los Angeles
Outline • Vector space model (VSM) in document retrieval • Stem-based VSM • Concept-based VSM • Conceptual similarity • Phrase-based VSM • Retrieval effectiveness comparison • Conclusion AMIA 2002
Document Retrieval • Find free-text documents to answer queries like, • “Hyperthermia, leukocytosis, increased intracranial pressure, and central herniation.Cerebral edema secondary to infection, diagnosis and treatment.” AMIA 2002
Hyperthermia q d q q ,d Leukocytosis Words as terms Vector Space Model (VSM) AMIA 2002
Stems: “hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”… Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… Stem-based VSM • Morphological variants bear similar content • E.g., “edema” and “edemas” • Use stemmer to extract stems • Lovins stemmer and Porter stemmer • Baseline of comparison AMIA 2002
Shortcomings of Stem-based VSM • Inability to capture multi-word concepts • “Increased intracranial pressure” • Inability to utilize the relations between concepts: • Synonyms: “hyperthermia” and “fever” • IS-A relation: “hyperthermia” and “body temperature elevation” AMIA 2002
Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… CUIs: (C0015967),(C0023518),(C0151740)… Concept-based VSM • Uses concepts in knowledge base (KB) as terms • KB: Metathesaurus in UMLS • Captures multi-word concepts • Captures synonyms AMIA 2002
Shortcomings of Concept-based VSM • Concepts may be related: • E.g. “hyperthermia” and “body temperature elevation” are not identical but related concepts • Need to quantify conceptual relations • Knowledge bases are often incomplete, which reduces the retrieval effectiveness AMIA 2002
Disease Node Distance d(c3,c4)=1 Animal disease c1 c2 Descendant Count D(c3)=2D(c4)=0 Body temperature elevation c3 Hyperthermia c4 Conceptual Similarity Evaluation AMIA 2002
Disease Animal disease c1 c2 Body temperature elevation c3 c4 Hyperthermia Deriving Conceptual Similarity From Hypernym Hierarchy AMIA 2002
Shortcomings of Concept-based VSM • Concepts may be related: • The conceptual similarity measure, s(ci,cj), quantifies relations between concepts. • Knowledge bases are often incomplete, which reduces the retrieval effectiveness. AMIA 2002
Missing concepts in KB, e.g., “Infiltrative small bowelprocess” (),(C0021852),() • Missing links between related concepts, e.g., (cerebral edema) (cerebral lesion) Incompleteness of the Knowledge Bases • In general, concept-based VSM cannot outperform stem-based VSM AMIA 2002
Phrases: [(C0015967); “hypertherm”][(C0023518); “leukocytos”][(C0151740); “increas”, “intracran”, “pressur”]… Query: “Hyperthermia,leukocytosis,increased intracranial pressure…” [(); “infiltr”][(C0021852); “smal”, “bowel”][(); ”proces”] “Infiltrative small bowelprocess” [(C0699725); “cerebr”, “edem”] Query: “Cerebral edema” [(C0221505); “cerebr”, “lesion”] Document: “Cerebral lesion” Phrase-based Indexing Examples AMIA 2002
Due to the conceptual similarity s(ci,cj) between concepts in pq and pd Due to the stem overlap in pq and pd Evaluate Phrase-based Document Similarity AMIA 2002
To Compare Retrieval Effectiveness • The test set: OHSUMED • 106 queries, 14K documents • Expert relevance judgment: R or N • Retrieval effectiveness: • Recall – the percentage of relevant documents retrieved so far • Precision – the percentage of retrieved documents that are relevant AMIA 2002
Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 16%100 queries vs. 5% 50 queries AMIA 2002
: similarity contribution weight for concepts : similarity contribution weight for stems Stem and Concept Similarity Contribution Weights AMIA 2002
Optimal region Concepts Stems Sensitivity of Retrieval Effectiveness to fsand fc AMIA 2002
Computation Complexity Using Phrase-based VSM • Data reorganization: • Build separate indexes on stems and concepts • Keep a list of related concepts cj’s and conceptual similarity s(ci,cj) with ci. • Time complexities of document similarity calculation, same order of magnitude • Stem-based VSM: • Phrase-based VSM: AMIA 2002
Conclusion • A new document indexing paradigm based on phrasesis proposed • Use phrases (conceptand its word stems) as terms • Document similarity is derived from both the stem and the concept contributions • Conceptual similarity quantifies the concept relations and improves retrieval effectiveness • Stems remedy the incomplete coverage of the knowledge base (missing concepts and missing links between related concepts) • Experimental results reveal a significant retrieval effectiveness improvement of the phrase-based VSM over the stem-based VSM AMIA 2002
Acknowledgement This research is supported in part by NIC/NIH Grant#4442511-33780 AMIA 2002