1 / 27

A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

A System for Finding Biological Entities that Satisfy Certain Conditions from Texts. Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton. Outline. Problem statement Techniques and methods Experimental results Discussion and conclusion. Problem statement.

shirin
Download Presentation

A System for Finding Biological Entities that Satisfy Certain Conditions from Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton

  2. Outline • Problem statement • Techniques and methods • Experimental results • Discussion and conclusion CIKM 2008 By Clement Yu from UIC

  3. Problem statement Given a complex biological question, output relevant passages (or excerpts) where the answer can be found. CIKM 2008 By Clement Yu from UIC

  4. An Example A sample relevant passage: In all insect species examined, neural expression of hb is conserved, suggesting that a neural function is ancestral. However, as the expression of the eve and ftz genes during segmentation is not conserved between grasshopper and Drosophila, and these genes lie below gap genes such as hb in the Drosophila segmentation hierarchy, it was unclear whether the role of hb in AP patterning would be conserved in more basal insects. A sample question: What [GENES] are involved in insect segmentation? Target: GENES Qualification concepts: 1) insect 2) segmentation [hb, ftz, and eve are targets found in the passage] CIKM 2008 By Clement Yu from UIC

  5. Technique and methods • Identify concepts in queries and texts • Use of domain knowledge • Related concepts (query expansion) • Gene symbol disambiguation • Conceptual IR models CIKM 2008 By Clement Yu from UIC

  6. Identify concepts in queries and texts In texts Window size: all component words appear within a certain window size. An example: ...Women who are postmenopausal and who have never used hormone replacement therapy have a higher risk of colon , but not rectal, cancer than do women who ...”, [Query concept: colon cancer] In queries PubMed automatic term mapping CIKM 2008 By Clement Yu from UIC

  7. Use of domain knowledge • Gene/protein species control (rule-based): if a query is asking for genes/proteins related to a specific species, then genes/proteins related to other species are considered irrelevant. Example: Query: What [GENES] are involved axon guidance in C.elegans? An irrelevant passage because of a different species: “We describe DPTP52F, which is probably the last remaining RPTPencoded in the Drosophila genome. Ptp52F mutations cause specificCNS and motor axon guidance phenotypes, and exhibit geneticinteractions with mutations in the other Rptp genes”. [Ptp52F is not a relevant target because the passage is about Drosophila, not C.elegans] CIKM 2008 By Clement Yu from UIC

  8. Use of domain knowledge • Compilation of Instances from Thesauruses: Retrieve concepts from UMLS, genes from Entrez gene and map them to the TREC entity types. An example: [Target types]: TUMOR TYPES [Dictionary]: UMLS Metathesaurus [Instances]: Lung Cancer; T-cell lymphoma; Pheochromocytoma CIKM 2008 By Clement Yu from UIC

  9. Related concepts • Synonyms • Hyponyms (one-level only) • Hypernyms (one-level only) • Lexical variants • Related abbreviations CIKM 2008 By Clement Yu from UIC

  10. Related concepts: lexical variants • Type 1: Automatically generate lexical variants using manually created heuristics: e.g., PLA2  PLA 2, PLAII, and PLA II Note: PLA2: Phospholipase A2 CIKM 2008 By Clement Yu from UIC

  11. Related concepts: lexical variants • Type 2: Retrieve additional lexical variants from a term database of MEDLINE e.g., PLA2 PL-A2 Note: PLA2: Phospholipase A2 CIKM 2008 By Clement Yu from UIC

  12. Related concepts – Lexical variants Type 3: Retrieve additional lexical variants by recognizing equiv. long-forms of an abbr. CIKM 2008 By Clement Yu from UIC

  13. Related concepts: related abbreviations • Abbreviations whose definitions (or long-forms) consume the query concept. For example some related abbreviations for concept “lung cancer” are): • SCLC (small cell lung cancer) • LCSS (lung cancer symptom scale) • NSCLC(non-small cell lung cancer) CIKM 2008 By Clement Yu from UIC

  14. Gene symbol disambiguation • 3 simple rules are defined to disambiguate gene symbols from • Abbreviations of non-gene meanings (Rule 1 & 2) Example: “Here, utilizing non-obese diabetic (NOD) mice deficient for CD154 (CD154-KO/NOD), we have identified a mandatory role of CD4 T cells as the functional source of CD154 in the initiation of T1DM. ” [NOD is a gene symbol, but it has a non-gene meaning here because it has a non-gene definition “non-obese diabetic”] • Common English words (Rule 3) Example: “The Kit gene, which codes for the KIT ligand (KITL) receptor or stem cell factor, was one of the genes identified in this study. ” [“Kit” is a common English word, but it has a gene meaning here because of the adjacent word “gene”] CIKM 2008 By Clement Yu from UIC

  15. Conceptual IR Models • Model 1 • Differentiate target instances • Model 2 • Equally weight target instances CIKM 2008 By Clement Yu from UIC

  16. Conceptual IR Models – Model 1 CIKM 2008 By Clement Yu from UIC

  17. Conceptual IR Models – Model 2 CIKM 2008 By Clement Yu from UIC

  18. Experimental results • Data sets and evaluation metrics • Impact of different techniques and methods • Comparison with best reported results CIKM 2008 By Clement Yu from UIC

  19. Data sets and evaluation metrics • Query collection: 36 questions collected from biologists in 2007. • Document collection: 162,259 Highwire full-text documents in HTML format. • Performance Metrics • Passage MAP • Aspect MAP • Document MAP CIKM 2008 By Clement Yu from UIC

  20. Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC

  21. Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC

  22. Comparison with best reported results The improvement of our result over the best reported results is significant (22% for automatic and 16.7% for non-automatic in passage retrieval). CIKM 2008 By Clement Yu from UIC

  23. Summary • Studied five different levels of related concepts for query expansion and examined their impacts on retrieval effectiveness. • Achieved significant improvement over the best reported results • Compared two conceptual IR models in retrieval effectiveness • Evaluated a simple method for gene symbol disambiguation CIKM 2008 By Clement Yu from UIC

  24. Conclusions • 1. Incorporating domain-specific knowledge through query expansion using multiple semantic relations significantly improved the retrieval effectiveness. CIKM 2008 By Clement Yu from UIC

  25. Conclusions • 2: The biggest improvement comes from the lexical variants. This result also indicates that biologists are likely to use different variants of the same concept according to their own writing preferences and these variants might not be collected in the existing biomedical thesauruses. CIKM 2008 By Clement Yu from UIC

  26. Future work • Improve the quality of target instances retrieved from different resources • Improve gene symbol disambiguation method • Handle pronouns • More evaluations on other gold standards CIKM 2008 By Clement Yu from UIC

  27. Questiosn • Thanks CIKM 2008 By Clement Yu from UIC

More Related