10 likes | 97 Views
Automatically Generating Gene Summaries From Biomedical Literature Jing Jiang, Xu Ling, ChengXiang Zhai University of Illinois at Urbana-Champaign. Document Retrieval. Sentence Extraction. Document Clustering. IE. IR. Problem Definition. System Overview.
E N D
Automatically Generating Gene Summaries From Biomedical Literature Jing Jiang, Xu Ling, ChengXiang Zhai University of Illinois at Urbana-Champaign Document Retrieval Sentence Extraction Document Clustering IE IR Problem Definition System Overview Document Similarity and Clustering • Motivation: manual annotation of genes can hardly catch up with the fast growth of biomedical literature • Goal: to automatically extract relevant statements from biomedical literature to summarize the already-discovered knowledge about a target gene • Methodology: two phases • IR phase: retrieve articles that contain relevant information about the gene of interest • IE phase: cluster articles into different information aspects, then extract the facts about the target gene at the sentence level • Goal: group similar documents together to help summarize the target gene in several aspects • Difficulties: • Documents on different genes touch different aspects • No clear boundary between different aspects • False positives may dominate due to gene name ambiguity • Partial solution: • Predefine a number of categories based on observation • Generate prior language models using FlyBase annotations • Use priors on PLSA to guide the clustering • Relevant document retrieval • Query expansion using gene synonym list • Stopword removal from synonym list to reduce false positives • Exact phrase matching for multi-token synonyms • Document clustering • PLSA w/ and w/o prior language models • K-means • Sentence extraction • Rank sentences by cosine similarity to the cluster centroid using TF-IDF vectors • Rank sentences based on probabilities of being generated from corresponding language model • Rank by other heuristics: title, location, etc. Document Clustering Results Sentence Extraction Results Conclusion • Document Clustering • Can help recognize false positives • K-means is more effective than PLSA for this problem • Sentence Extraction • Can represent the document cluster to some extent • Future Work • Automatically remove false positives using document clustering • Derive better scoring function for sentence extraction • Develop more systematic way to evaluate and tune the system • Used PubMed abstracts on Drosophila • Evaluated results on two genes: EAG and SS • Two humans with domain knowledge grouped documents into 6 categories (Std 1 and Std 2) • Another grouping based on false positives (Std 3) • Measured results by class entropy, cluster entropy, and their average • Extracted sentences from standard clusters of documents to isolate the evaluation of sentence extraction from document clustering • Combined top K sentences extracted by different methods into a single judgment pool, and judged by a human with domain knowledge • Measured results by precision at K sentences