270 likes | 377 Views
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts. Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton. Outline. Problem statement Techniques and methods Experimental results Discussion and conclusion. Problem statement.
E N D
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton
Outline • Problem statement • Techniques and methods • Experimental results • Discussion and conclusion CIKM 2008 By Clement Yu from UIC
Problem statement Given a complex biological question, output relevant passages (or excerpts) where the answer can be found. CIKM 2008 By Clement Yu from UIC
An Example A sample relevant passage: In all insect species examined, neural expression of hb is conserved, suggesting that a neural function is ancestral. However, as the expression of the eve and ftz genes during segmentation is not conserved between grasshopper and Drosophila, and these genes lie below gap genes such as hb in the Drosophila segmentation hierarchy, it was unclear whether the role of hb in AP patterning would be conserved in more basal insects. A sample question: What [GENES] are involved in insect segmentation? Target: GENES Qualification concepts: 1) insect 2) segmentation [hb, ftz, and eve are targets found in the passage] CIKM 2008 By Clement Yu from UIC
Technique and methods • Identify concepts in queries and texts • Use of domain knowledge • Related concepts (query expansion) • Gene symbol disambiguation • Conceptual IR models CIKM 2008 By Clement Yu from UIC
Identify concepts in queries and texts In texts Window size: all component words appear within a certain window size. An example: ...Women who are postmenopausal and who have never used hormone replacement therapy have a higher risk of colon , but not rectal, cancer than do women who ...”, [Query concept: colon cancer] In queries PubMed automatic term mapping CIKM 2008 By Clement Yu from UIC
Use of domain knowledge • Gene/protein species control (rule-based): if a query is asking for genes/proteins related to a specific species, then genes/proteins related to other species are considered irrelevant. Example: Query: What [GENES] are involved axon guidance in C.elegans? An irrelevant passage because of a different species: “We describe DPTP52F, which is probably the last remaining RPTPencoded in the Drosophila genome. Ptp52F mutations cause specificCNS and motor axon guidance phenotypes, and exhibit geneticinteractions with mutations in the other Rptp genes”. [Ptp52F is not a relevant target because the passage is about Drosophila, not C.elegans] CIKM 2008 By Clement Yu from UIC
Use of domain knowledge • Compilation of Instances from Thesauruses: Retrieve concepts from UMLS, genes from Entrez gene and map them to the TREC entity types. An example: [Target types]: TUMOR TYPES [Dictionary]: UMLS Metathesaurus [Instances]: Lung Cancer; T-cell lymphoma; Pheochromocytoma CIKM 2008 By Clement Yu from UIC
Related concepts • Synonyms • Hyponyms (one-level only) • Hypernyms (one-level only) • Lexical variants • Related abbreviations CIKM 2008 By Clement Yu from UIC
Related concepts: lexical variants • Type 1: Automatically generate lexical variants using manually created heuristics: e.g., PLA2 PLA 2, PLAII, and PLA II Note: PLA2: Phospholipase A2 CIKM 2008 By Clement Yu from UIC
Related concepts: lexical variants • Type 2: Retrieve additional lexical variants from a term database of MEDLINE e.g., PLA2 PL-A2 Note: PLA2: Phospholipase A2 CIKM 2008 By Clement Yu from UIC
Related concepts – Lexical variants Type 3: Retrieve additional lexical variants by recognizing equiv. long-forms of an abbr. CIKM 2008 By Clement Yu from UIC
Related concepts: related abbreviations • Abbreviations whose definitions (or long-forms) consume the query concept. For example some related abbreviations for concept “lung cancer” are): • SCLC (small cell lung cancer) • LCSS (lung cancer symptom scale) • NSCLC(non-small cell lung cancer) CIKM 2008 By Clement Yu from UIC
Gene symbol disambiguation • 3 simple rules are defined to disambiguate gene symbols from • Abbreviations of non-gene meanings (Rule 1 & 2) Example: “Here, utilizing non-obese diabetic (NOD) mice deficient for CD154 (CD154-KO/NOD), we have identified a mandatory role of CD4 T cells as the functional source of CD154 in the initiation of T1DM. ” [NOD is a gene symbol, but it has a non-gene meaning here because it has a non-gene definition “non-obese diabetic”] • Common English words (Rule 3) Example: “The Kit gene, which codes for the KIT ligand (KITL) receptor or stem cell factor, was one of the genes identified in this study. ” [“Kit” is a common English word, but it has a gene meaning here because of the adjacent word “gene”] CIKM 2008 By Clement Yu from UIC
Conceptual IR Models • Model 1 • Differentiate target instances • Model 2 • Equally weight target instances CIKM 2008 By Clement Yu from UIC
Conceptual IR Models – Model 1 CIKM 2008 By Clement Yu from UIC
Conceptual IR Models – Model 2 CIKM 2008 By Clement Yu from UIC
Experimental results • Data sets and evaluation metrics • Impact of different techniques and methods • Comparison with best reported results CIKM 2008 By Clement Yu from UIC
Data sets and evaluation metrics • Query collection: 36 questions collected from biologists in 2007. • Document collection: 162,259 Highwire full-text documents in HTML format. • Performance Metrics • Passage MAP • Aspect MAP • Document MAP CIKM 2008 By Clement Yu from UIC
Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC
Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC
Comparison with best reported results The improvement of our result over the best reported results is significant (22% for automatic and 16.7% for non-automatic in passage retrieval). CIKM 2008 By Clement Yu from UIC
Summary • Studied five different levels of related concepts for query expansion and examined their impacts on retrieval effectiveness. • Achieved significant improvement over the best reported results • Compared two conceptual IR models in retrieval effectiveness • Evaluated a simple method for gene symbol disambiguation CIKM 2008 By Clement Yu from UIC
Conclusions • 1. Incorporating domain-specific knowledge through query expansion using multiple semantic relations significantly improved the retrieval effectiveness. CIKM 2008 By Clement Yu from UIC
Conclusions • 2: The biggest improvement comes from the lexical variants. This result also indicates that biologists are likely to use different variants of the same concept according to their own writing preferences and these variants might not be collected in the existing biomedical thesauruses. CIKM 2008 By Clement Yu from UIC
Future work • Improve the quality of target instances retrieved from different resources • Improve gene symbol disambiguation method • Handle pronouns • More evaluations on other gold standards CIKM 2008 By Clement Yu from UIC
Questiosn • Thanks CIKM 2008 By Clement Yu from UIC