220 likes | 336 Views
A Statistical Approach to Literature-based Gene Group Annotation. Xin He 01/31/2007. Annotating a Gene List. Goal: understand the commonalities in a list of genes from: Clustering by expression patterns Differentially expressed genes Genes sharing cis-regulatory elements
E N D
A Statistical Approach to Literature-based Gene Group Annotation Xin He 01/31/2007
Annotating a Gene List • Goal: understand the commonalities in a list of genes from: • Clustering by expression patterns • Differentially expressed genes • Genes sharing cis-regulatory elements • How to automatically construct the annotations?
Gene Ontology-based Approach • Each gene is annotated by a set of GO terms • The importance of any term wrt the gene list is measured by the number of genes that are associated with this term • Need to correct for the uneven distribution of GO terms: a hypergeometric test
Limitations of GO-based Approach • GO annotations of all genes: may not be available • Rapid growth of literature: constantly add new functions to existing genes • Coverage is not even in all areas. E.g. ecology and behavior; medicine; anatomy and physiology; etc.
Literature-based Approach • Annotate via the analysis of text extracted from literature • Advantages: • Not dependent on manually created data • Easy to keep up with the recent discoveries • Broad coverage • Explorative tool: suggest new hypothesis
Literature-based Approach • Extract abstracts for each gene • Idea: If a word is overrepresented in the abstracts for the list, then it is likely to describe the common functions of the list • A simple measure of significance: Z-score = (observed count – expected count under background distribution) / standard deviation
Motivations for the Current Work • Drawbacks of the existing approach • Biased textual representation: genes that are well-studied will dominate the results • Motivations for a new approach: • Need to capture overrepresentation of words • Favor words that are common to all or most genes • A unified way to solve both problems?
Ideas for the Statistical Model • Observation: typically, some genes in the list are related to a given word, but the other genes are not (Few gene clusters are perfect!) • Assumption: the count of a term in a document follows Poisson distribution • Idea: the count of a term in one gene is either from a background distribution (if the gene is unrelated) or from a positive distribution (if the gene is related)
Poisson Mixture Model Notations: Each count is generated from a mixture of the background and signal Poisson distributions: The probability of observing the data is thus:
Parameter Estimation Maximum likelihood estimation of parameters: EM algorithm to maximize the likelihood function. The updating formula is given by: The posterior probability of missing label (zi) is:
Evaluating the Statistical Significance • The candidate terms: those with a large theta (an estimate of the proportion of related genes) • Need to assess the significance • E.g. a term from a distribution slightly different from the background. EM may estimate lambda to be this distribution and theta close to 1 • Idea: if the counts can be explained well by the background, then there is no need to use a mixture of two distributions. This word would be insignificant regardless of the estimated theta
Likelihood Ratio Test Hypothesis testing: Generalized Likelihood Ratio test Statistic (LRS): Reject H0 if T is greater than a certain threshold.
Asymptotic Distribution of LRS • It is well known that the distribution of LRS converges to chi-square, with degree of freedom equal to the difference between the number of free parameters of null and alternative hypothesis • Bonferroni correction: significance level (0.05) divided by the number of hypothesis tested (number of terms)
Implementation • Computational framework • Medline collections: yeast and fly • Indexing, retrieval, analysis by Indri 2.2 • Background distribution of terms • Organism-specific distribution • Phrase construction: bigrams only • NSP (Ngram statistical package) for bigram selection: chi-square measure • Procedure • Retrieval of articles about the given list of genes • Collection of counts of all words and significant bigrams • Statistical analysis by Poission mixture model • Presentation of results: (1) sorted by statistical significance; (2) remove concepts whose fractions (theta) are below a threshold
Web Interface • Web link: http://beespace.igb.uiuc.edu:8080/BeeSpace/Annot.jsp • User interface • Choose collection: MedYeast and MedFly • Input genes: textual names rather than identifiers
Experimental Validation • Methods compared • StatAnnot • vs. term overpresentation method: GEISHA • vs. GO enrichment analysis: FuncAssociate http://llama.med.harvard.edu/cgi/func/funcassociate • Evaluation • Concepts that are common to both methods • Concepts that are unique to the method compared • Concepts that are unique to StatAnnot • Genes that are found by StatAnnot
Comparison with other Literature-based Method • Gene List: Eisen K cluster (15 genes) • Mainly respiratory chain complex (13), one mitochondrial membrane pore (por1) • However, por1 matches many more articles than other genes, thus dominates the results, eg. voltage-dependent channel • Results: • GESHIA: voltage-dependent, outer member, pores, channel, succinate • StatAnnot: electron_transport, electron_transfer, respiratory, cytochrome_c1, mitochondrial_membrane, succinate_dehydrogenase
Comparison with GO-based Method • Gene list: Eisen B cluster (11 genes) • Results • cell_cycle, mitosis, s_phase, microtubule, cytokinesi, cytoskeletal_protein, spindle_pole, pole_body, mitotic_spindle, spindle_apparatus, spindle_checkpoint • contractile ring, actomyosin ring, cell cortex, microtubule nucleation, microtubule organizing center, M phase, bud site selection, phosphatidylinositol binding, septin ring • anaphase, anaphase_promote, ubiquitin, ligase, bud_neck, metaphase, proteolysis • cyclin_b, cdc3, cdc28, cdc12, cdc15, cdc2
Comparison with GO-based Method • Gene List: cAMPDown001 cluster • Originally 59 genes • 16 genes have fly orthologs that match at least one article • Results • heat_shock, molecular_chaperone, stress (response to stress) • response to temperature, unfolded protein binding, response to biotic stimulus • channel/ion_channel/channel_gate, kinase/threonine_kinase/serine_threonine, gtpase, ethanol, immunophilin, tetratricopeptide_repeat, geldanamycin, tacrolimu_bind, co_chaperone/cochaperon, steroid, quinone, cyclophilin • genes: hsp90, hsc70, hsp70, hsp40, cyp40, fkbp52, mok,
Conclusion from Experiments • Solved the biased textual representation problem of the earlier literature-based method • In general, the new method is able to cover a large proportion of terms from GO enrichment analysis • Supplement with additional biological concepts, including many related genes • May be particularly useful for studying aspects not focused in GO, such as medicine
Future Plan (I) • Phrases • N-gram, N > 2 • Syntactic phrase construction • Removal of redundant concepts in the results • Retrieval for genes • Gene name mapping to database entries • Synonym expansion • Gene name disambiguition • Entity recognition in the resulting concepts • Gene names, chemicals, organisms
Future Plan (II) • Sentence selection • Informative sentences summarizing the concepts (theme retrieval in Beespace) • Clustering of concepts • Ex. Distance measure of terms based on co-occurrence • A general tool to summarize a group of entities? For example: • Common aspects of a set of diseases • Genes commonly involved in multiple developmental stages