1 / 21

Free-text Medical Document Retrieval via Phrase-based Vector Space Model

Free-text Medical Document Retrieval via Phrase-based Vector Space Model. Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer Science Department University of California, Los Angeles. Outline. Vector space model (VSM) in document retrieval Stem-based VSM

jada
Download Presentation

Free-text Medical Document Retrieval via Phrase-based Vector Space Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD wenlei@cs.ucla.edu and wwc@cs.ucla.edu Computer Science Department University of California, Los Angeles

  2. Outline • Vector space model (VSM) in document retrieval • Stem-based VSM • Concept-based VSM • Conceptual similarity • Phrase-based VSM • Retrieval effectiveness comparison • Conclusion AMIA 2002

  3. Document Retrieval • Find free-text documents to answer queries like, • “Hyperthermia, leukocytosis, increased intracranial pressure, and central herniation.Cerebral edema secondary to infection, diagnosis and treatment.” AMIA 2002

  4. Hyperthermia q d q q ,d Leukocytosis Words as terms Vector Space Model (VSM) AMIA 2002

  5. Stems: “hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”… Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… Stem-based VSM • Morphological variants bear similar content • E.g., “edema” and “edemas” • Use stemmer to extract stems • Lovins stemmer and Porter stemmer • Baseline of comparison AMIA 2002

  6. Shortcomings of Stem-based VSM • Inability to capture multi-word concepts • “Increased intracranial pressure” • Inability to utilize the relations between concepts: • Synonyms: “hyperthermia” and “fever” • IS-A relation: “hyperthermia” and “body temperature elevation” AMIA 2002

  7. Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… CUIs: (C0015967),(C0023518),(C0151740)… Concept-based VSM • Uses concepts in knowledge base (KB) as terms • KB: Metathesaurus in UMLS • Captures multi-word concepts • Captures synonyms AMIA 2002

  8. Shortcomings of Concept-based VSM • Concepts may be related: • E.g. “hyperthermia” and “body temperature elevation” are not identical but related concepts • Need to quantify conceptual relations • Knowledge bases are often incomplete, which reduces the retrieval effectiveness AMIA 2002

  9. Disease Node Distance d(c3,c4)=1 Animal disease c1 c2 Descendant Count D(c3)=2D(c4)=0 Body temperature elevation c3 Hyperthermia c4 Conceptual Similarity Evaluation AMIA 2002

  10. Disease Animal disease c1 c2 Body temperature elevation c3 c4 Hyperthermia Deriving Conceptual Similarity From Hypernym Hierarchy AMIA 2002

  11. Shortcomings of Concept-based VSM • Concepts may be related: • The conceptual similarity measure, s(ci,cj), quantifies relations between concepts. • Knowledge bases are often incomplete, which reduces the retrieval effectiveness. AMIA 2002

  12. Missing concepts in KB, e.g., “Infiltrative small bowelprocess” (),(C0021852),() • Missing links between related concepts, e.g., (cerebral edema) (cerebral lesion) Incompleteness of the Knowledge Bases • In general, concept-based VSM cannot outperform stem-based VSM AMIA 2002

  13. Phrases: [(C0015967); “hypertherm”][(C0023518); “leukocytos”][(C0151740); “increas”, “intracran”, “pressur”]… Query: “Hyperthermia,leukocytosis,increased intracranial pressure…” [(); “infiltr”][(C0021852); “smal”, “bowel”][(); ”proces”] “Infiltrative small bowelprocess” [(C0699725); “cerebr”, “edem”] Query: “Cerebral edema” [(C0221505); “cerebr”, “lesion”] Document: “Cerebral lesion” Phrase-based Indexing Examples AMIA 2002

  14. Due to the conceptual similarity s(ci,cj) between concepts in pq and pd Due to the stem overlap in pq and pd Evaluate Phrase-based Document Similarity AMIA 2002

  15. To Compare Retrieval Effectiveness • The test set: OHSUMED • 106 queries, 14K documents • Expert relevance judgment: R or N • Retrieval effectiveness: • Recall – the percentage of relevant documents retrieved so far • Precision – the percentage of retrieved documents that are relevant AMIA 2002

  16. Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 16%100 queries vs. 5% 50 queries AMIA 2002

  17. : similarity contribution weight for concepts : similarity contribution weight for stems Stem and Concept Similarity Contribution Weights AMIA 2002

  18. Optimal region Concepts Stems Sensitivity of Retrieval Effectiveness to fsand fc AMIA 2002

  19. Computation Complexity Using Phrase-based VSM • Data reorganization: • Build separate indexes on stems and concepts • Keep a list of related concepts cj’s and conceptual similarity s(ci,cj) with ci. • Time complexities of document similarity calculation, same order of magnitude • Stem-based VSM: • Phrase-based VSM: AMIA 2002

  20. Conclusion • A new document indexing paradigm based on phrasesis proposed • Use phrases (conceptand its word stems) as terms • Document similarity is derived from both the stem and the concept contributions • Conceptual similarity quantifies the concept relations and improves retrieval effectiveness • Stems remedy the incomplete coverage of the knowledge base (missing concepts and missing links between related concepts) • Experimental results reveal a significant retrieval effectiveness improvement of the phrase-based VSM over the stem-based VSM AMIA 2002

  21. Acknowledgement This research is supported in part by NIC/NIH Grant#4442511-33780 AMIA 2002

More Related