100 likes | 197 Views
Annotating Gene List From Literature. Xin He Department of Computer Science UIUC. Motivation. Biologists often need to understand the commonalities of a list of genes (e.g. whether they are involved in the same pathway).
E N D
Annotating Gene List From Literature Xin He Department of Computer Science UIUC
Motivation • Biologists often need to understand the commonalities of a list of genes (e.g. whether they are involved in the same pathway). • These genes typically come from clustering results in microarray expression • Given a list of gene names, is there any automatic way to find the common themes from literature articles?
Related Work • The most popular way is based on the analysis of GO terms associated with genes. • Method: each gene is associated with a set of GO terms. Find the GO terms that are overrepresented in the input list • Hypergeometric test: p-value of a GO term N: total number of genes M: total number of genes annotated with this term n: number of genes in the list k: number of genes in the list annotated with this term
Problems with GO-based Approach • GO cannot cover all the important concepts in the literature. E.g. GO has relatively low coverage for behavior terms (compared with specialized behavior ontology) • The associations of genes and concepts change very rapidly. E.g. new functions of known genes are constantly found..
Text-based Gene List Annotation • Hypothesis testing approach: • find terms that are overrepresented for each gene: Poisson distribution • find common terms across the gene list: hypergeometric distribution • Comparative text mining approach: find the common themes in multiple collections (one for each gene)
Comparative Text Mining • For each gene, find a collection of articles that discuss this gene • Each article in a collection is a mixture of two distributions: a theme common to all collections; and a collection-specific theme • Parameter estimation in the mixture model: the standard EM algorithm
Results: Pelle System • Pelle system in Drosophila: Saptzle, Toll, Pelle, Tube, Cacus, Dorsal • Among the top-50 words: signaling, pathway, receptor, embryo, ventral, dorsoventral, patterning, embryonic
Results: MET cluster • MET cluster from yeast cell-cycle data: MET28, MET14, MET16, MET10, MET2, MUP1 • Among the top-50 words: amino, met25, sulphite
Problems and Plan • Many common words (such as stop words) in the top-list, not properly normalized • Use the entire Medline corpus as background: not working • Hypothesis testing approach as alternative • Single words not very suggestive • Phrase extraction as the postprocessing step