1 / 1

Sentence Extraction Results

Automatically Generating Gene Summaries From Biomedical Literature Jing Jiang, Xu Ling, ChengXiang Zhai University of Illinois at Urbana-Champaign. Document Retrieval. Sentence Extraction. Document Clustering. IE. IR. Problem Definition. System Overview.

Download Presentation

Sentence Extraction Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Generating Gene Summaries From Biomedical Literature Jing Jiang, Xu Ling, ChengXiang Zhai University of Illinois at Urbana-Champaign Document Retrieval Sentence Extraction Document Clustering IE IR Problem Definition System Overview Document Similarity and Clustering • Motivation: manual annotation of genes can hardly catch up with the fast growth of biomedical literature • Goal: to automatically extract relevant statements from biomedical literature to summarize the already-discovered knowledge about a target gene • Methodology: two phases • IR phase: retrieve articles that contain relevant information about the gene of interest • IE phase: cluster articles into different information aspects, then extract the facts about the target gene at the sentence level • Goal: group similar documents together to help summarize the target gene in several aspects • Difficulties: • Documents on different genes touch different aspects • No clear boundary between different aspects • False positives may dominate due to gene name ambiguity • Partial solution: • Predefine a number of categories based on observation • Generate prior language models using FlyBase annotations • Use priors on PLSA to guide the clustering • Relevant document retrieval • Query expansion using gene synonym list • Stopword removal from synonym list to reduce false positives • Exact phrase matching for multi-token synonyms • Document clustering • PLSA w/ and w/o prior language models • K-means • Sentence extraction • Rank sentences by cosine similarity to the cluster centroid using TF-IDF vectors • Rank sentences based on probabilities of being generated from corresponding language model • Rank by other heuristics: title, location, etc. Document Clustering Results Sentence Extraction Results Conclusion • Document Clustering • Can help recognize false positives • K-means is more effective than PLSA for this problem • Sentence Extraction • Can represent the document cluster to some extent • Future Work • Automatically remove false positives using document clustering • Derive better scoring function for sentence extraction • Develop more systematic way to evaluate and tune the system • Used PubMed abstracts on Drosophila • Evaluated results on two genes: EAG and SS • Two humans with domain knowledge grouped documents into 6 categories (Std 1 and Std 2) • Another grouping based on false positives (Std 3) • Measured results by class entropy, cluster entropy, and their average • Extracted sentences from standard clusters of documents to isolate the evaluation of sentence extraction from document clustering • Combined top K sentences extracted by different methods into a single judgment pool, and judged by a human with domain knowledge • Measured results by precision at K sentences

More Related