230 likes | 441 Views
Predicting Gene Functions from Text Using a Cross-Species Approach. Emilia Stoica and Marti Hearst School of Information University of California, Berkeley. Research Supported by NSF DBI-0317510 and a gift from Genentech. Goal.
E N D
Predicting Gene Functions from Text Using a Cross-Species Approach Emilia Stoica and Marti HearstSchool of InformationUniversity of California, Berkeley Research Supported by NSF DBI-0317510 and a gift from Genentech
Goal Annotate genes with functional information derived from journal articles.
Gene Ontology (GO) • Gene Ontology (GO) controlled vocabulary for functional annotation • ~ 17,600 terms (circa July 2004) • Organized into 3 distinct acyclic graphs • molecular functions • biological processes • cellular locations • More general terms are “parents” of less general terms: • development(GO:0007275) is the parent of embryonic development(GO:0001756)
Challenges • GO tokens might not appear explicitly Example: PubMed 10692450 GO:0008285:negative regulation of cell proliferation Occurs as:inhibition of cell proliferation • GO tokens might not occur contiguously Example: PubMed 10734056, GO:0007186: G-protein coupled receptor protein signaling pathway Occurs as: Results indicate that CCR1-mediated responses are regulated …in the signaling pathway, by receptor phosphorylation at the level of receptor G/proteincoupling … CCR1 binds MIP-1 alpha.
Challenges • The simplest strategy (assigning GO codes to genes simply because the GO tokens occur near the gene) yields a large number of false positives. • Issues: • The text does not contain evidence to support the annotation, • The text contains evidence for the annotation, but the curator knows the gene to be involved in a function that is more general or more specific than the GO code matched in text.
Challenges • GO contains hints about what kinds of evidence are required for annotation, e.g.: • The text should mention co-purification, co-immunoprecipitationexperiments • Requiring these evidence terms does not seem to improve algorithms.
Related Work • Mainly in the context of BioCreative competition (2004) • Chiang and Yu 2003, 2004: • Find phrase patterns commonly used in sentences describing gene functions • (e.g., “gene plays an important role in”, “gene is involved in”) • Final assignments made with a Naïve Bayes classifier • Ray and Craven 2004, 2005: • Learn a statistical model for each GO code (which words are likely to co-occur in the paragraphs containing GO codes); • Decide among candidates via a multinomial Naïve Bayes classifier • Rice et al. 2004: • Train an SVM for each GO code. • Target genes assigned best-scoring GO code.
Related Work, cont. • Couto et al. 2004 • Determine if the “information content” of the matching GO terms is larger than for all the candidate GO terms. • Verspoor et al. 2004 • Expand GO tokens with words that frequently co-occur in a training set; use a categorizer that explores the structure of the Gene Ontology to find best hits. • Ehler and Ruch 2004: • Treat each document as a query to be categorized • Create a score based on a combination of pattern matching and TF*IDF weighting • Annotate gene with top-scoring GO codes.
Our Approach • Two main contributions: • Use cross-species information (CSM) • Check for biological (in) consistencies (CSC)
Cross-Species MatchMain Idea • Use orthologous genes • [Genes of different species that have evolved directly from a common ancestor.] • Assumption: • Since there is an overlap between the genomes of the two species, their orthologs may share some functions, and consequently some GO codes • Idea: to predict GO codes for target genes in target species, use the GO codes assigned to their orthologous genes • We use Mouse vs. Human genes
General procedure • Analyze text at sentence level • Eliminate stop words, punctuation characters and divide the text into tokens using space as delimiter • Normalize and match different variations of gene names using the algorithm of Bhalotia et al.’03 • For every sentence that contains the target gene: • A GO code is matched if the sentence contains a percentage of GO tokens larger than a threshold (0.75 for CSM and 1 for CSC)
Cross Species Match Algorithm • CSM(g, a): For a target gene g, search in article a for only the GO codes annotated to its ortholog • If at least 75% of the GO code terms are found in a sentence containing the gene name, the code is matched. • Note: we must eliminate annotations of orthologs marked with IEA and ISS codes to avoid circular references.
Cross-Species Correlation Main Idea • Observation: • Since GO codes indicate gene function, it is logical for some to often co-occur in annotations and for others to rarely do so. • Assumption: • If one GO code tends to occur in the orthologous genes’ annotations when another one does not, then assume the second is not a valid assignment for the target species • Example: • If text seems to contain evidence for rRNA transcription (GO:0009303) nucleolus(GO:0005737) and extracellular(GO:0005576), then extracellular is suspicious. • The algorithm identifies the “suspicious” cases.
Cross-Species Correlation Algorithm • For every pair of GO codes in the orthologous genes database, compute a X2coefficient. • N: the total number of GO codes • O11: # of times the ortholog is annotated with both GO1 and GO2 • O12: # of times the ortholog is annotated with GO1 but not GO2 • O21: # of times the ortholog is annotated with GO2 but not GO1 • O12: # of times the ortholog is not annotated with GO1 or GO2 X2
Cross-Species Correlation Algorithm • M(g,a) = GO codes matched in article a for gene g • O(g) = GO codes assigned to the ortholog of g • o = size of O(g), p = percentage (0.2) • For every potentially matching GO code GO1 in M(g,a) • For every GO code GO2 in O(g) • Count how often X2(GO1,GO2) is significant • If this count is < p*o then assume GO1 is not valid. • Else assign GO1 to g
Evaluation using BioCreative • Task 2.2: • Annotate 138 human genes with GO codes using 99 full text articles; • For each annotation, provide the passage of text that the annotation was based upon. • Annotations from participants were manually judged by human curators • A prediction was considered “perfect” if the text passage • contained the gene name, and • provided evidence for annotating the gene with the GO code
Results on BioCreative • Our research was conducted after the competition had past, so our annotations could not be judged by the same curators • Used the “perfect predictions” • (unfair to our system; ignores relevant predictions we find that other systems do not) • Our prediction is correct if it matches a perfect prediction (e.g., vhl is annotated with transcription(GO:0006350) in PubMed 12169961 “vhl inhibits transcription elongation, mRNA stability and PKC activity”)
Results on Larger Dataset • A much larger test set has been made publicly available by Chiang and Yu. • EBI human test set • 4,410 genes • 13,626 GO code annotations • MGI mouse test set • 2,188 genes • 6,338 GO code annotations • Note that Chiang and Yu used the same data for both training and testing.
Results on EBI Human and MGI datasets • EBI human: 4,410 genes and 5,714 abstracts • MGI: 2,188 genes and 1,947 abstracts
Conclusions and Future Work • We propose an algorithm that annotates genes with GO codes using the information available from other species • Experimental results on three datasets show that our algorithm consistently achieves higher F-measures than other solutions • Future improvements to our algorithm: - combine or use a voting scheme between the predictions our system makes and the predictions of a machine learning system - investigate how effective are other genes with sequences similar to the target gene (but not orthologous to the gene) for predicting the GO codes
Thank you! http://biotext.berkeley.edu Research Supported by NSF DBI-0317510 and a gift from Genentech