1 / 9

Annotating Gene List From Literature

Annotating Gene List From Literature. Xin He Department of Computer Science UIUC. Motivation. Biologists often need to understand the commonalities of a list of genes (e.g. whether they are involved in the same pathway).

Download Presentation

Annotating Gene List From Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotating Gene List From Literature Xin He Department of Computer Science UIUC

  2. Motivation • Biologists often need to understand the commonalities of a list of genes (e.g. whether they are involved in the same pathway). • These genes typically come from clustering results in microarray expression • Given a list of gene names, is there any automatic way to find the common themes from literature articles?

  3. Related Work • The most popular way is based on the analysis of GO terms associated with genes. • Method: each gene is associated with a set of GO terms. Find the GO terms that are overrepresented in the input list • Hypergeometric test: p-value of a GO term N: total number of genes M: total number of genes annotated with this term n: number of genes in the list k: number of genes in the list annotated with this term

  4. Problems with GO-based Approach • GO cannot cover all the important concepts in the literature. E.g. GO has relatively low coverage for behavior terms (compared with specialized behavior ontology) • The associations of genes and concepts change very rapidly. E.g. new functions of known genes are constantly found..

  5. Text-based Gene List Annotation • Hypothesis testing approach: • find terms that are overrepresented for each gene: Poisson distribution • find common terms across the gene list: hypergeometric distribution • Comparative text mining approach: find the common themes in multiple collections (one for each gene)

  6. Comparative Text Mining • For each gene, find a collection of articles that discuss this gene • Each article in a collection is a mixture of two distributions: a theme common to all collections; and a collection-specific theme • Parameter estimation in the mixture model: the standard EM algorithm

  7. Results: Pelle System • Pelle system in Drosophila: Saptzle, Toll, Pelle, Tube, Cacus, Dorsal • Among the top-50 words: signaling, pathway, receptor, embryo, ventral, dorsoventral, patterning, embryonic

  8. Results: MET cluster • MET cluster from yeast cell-cycle data: MET28, MET14, MET16, MET10, MET2, MUP1 • Among the top-50 words: amino, met25, sulphite

  9. Problems and Plan • Many common words (such as stop words) in the top-list, not properly normalized • Use the entire Medline corpus as background: not working • Hypothesis testing approach as alternative • Single words not very suggestive • Phrase extraction as the postprocessing step

More Related