10 likes | 114 Views
Cross-species Review Comparing annotations across species helps: Show terms that can be added by via experimental data in orthologs. Ensure annotation consistency e.g. by spotting outliers that may reflect curation errors. Reveal significant biological differences between species.
E N D
Cross-species Review • Comparing annotations across species helps: • Show terms that can be added by via experimental data in orthologs. • Ensure annotation consistency e.g. by spotting outliers that may reflect curation errors. • Reveal significant biological differences between species. Detail from graph used for annotation summary and comparison purposes showing some of the GO biological process terms annotated to the human gene MSH2 and predicted orthologs. Graphs for all genes curated in this project are available at the GO website: www.geneontology.org/images/RefGenomeGraphs 35000 30000 200 25000 150 20000 Number of Genes 15000 100 10000 50 5000 P P P P P P P P P P P P P P P P P P P P F F F F F F F F F F F F F F F F F F F F C C C C C C C C C C C C C C C C C C C C S. pombe S. cerevisiae D. discoideum D. melanogaster C. elegans A. thaliana M. musculus R. rattus H. sapiens D. rerio S. pombe S. cerevisiae D. discoideum D. melanogaster C. elegans A. thaliana M. musculus R. rattus H. sapiens D. rerio 5.3K 6.1K 12K 13.9K 22.9K 27.3K 27.9K 28K 28K 28K Organism, total gene number and GO aspect the GO Reference Genome Annotation Project Susan Tweedie, Rex Chisholm, Karen Christie, Emily Dimmer, Mary E. Dolan, Pascale Gaudet, David P. Hill, Doug Howe, Jim Hu, Donghui, Li, Ruth Lovering, Fiona McCarthy, Sohel Merchant, Victoria Petri, Kimberley Van Auken, Valerie Wood, Suzanna Lewis, Michael Ashburner, J. Michael Cherry, Judy A. Blake, and The Gene Ontology Consortium. Summary The GO Reference Genome Annotation Project is a collaboration between model organism databases representing 12 diverse species. Our aim is to provide comprehensive high quality GO annotation for every gene in each species. This will serve as a valuable reference set for annotating other genomes. Our strategy is to work together, curating the same genes simultaneously from an agreed list. This poster illustrates the process we follow and highlights some of the curation issues that we have faced. ZFIN E. coli • What genes to curate first? • For the first year of the project we chose orthologs of • human disease genes (taken from the OMIM collection) as • our priority targets for curation. • We have now expanded our priority targets to 4 areas: • Orthologs of human disease genes • Genes involved in metabolic pathways • Topical or ‘hot’ genes • Genes that currently lack GO annotation but are conserved from yeast to human • We try to curate related genes in batches to promote • curation efficiency. MODs will now take turns to choose • genes for curation. Overview Get list of 20 genes to curate/month Chicken What are the related genes in my species? Currently each MOD has its own method of ortholog identification. These include: YOGY, InParanoid, OrthoMCL, TreeFam, Homologene and in-house sequence analysis. Unfortunately, none of these cover all 12 reference genome species and there are problems comparing methods such as identifier variation and different update frequencies. We are now working with Kara Dolinski at Princeton to establish a consistent system for representing homologs / orthologs across the reference genome species set. Identify orthologs Record ortholog details Triage papers for GO Making a database Target genes, orthologs and curation status are currently recorded in shared google spreadsheets. This has proved inconvenient so we are developing a dedicated database to store this information. A prototype is shown below. Curate selected papers for new GO annotations What papers to curate? Identifying the relevant papers and prioritizing them for GO curation can be a rate limiting step for MODs that do not have a literature triage system in place, particularly when there are many papers about a gene. In some cases, working back to the primary literature from recent reviews is an effective approach. Another strategy is the use of text mining tools such as TextPresso. Discuss annotations with other curators Create new GO terms as required Clean-up existing GO annotations Ontology Development Working on the same genes together encourages the development of new GO terms. Over the last year 450 new GO terms have been added by the reference genome annotation group. Review annotations by other ref genome MODs • Review annotation quality • Annotations should conform to the agreed standards • of the reference genome annotation group: • Experimental evidence (evidence codes IDA, IPI, IGI, IMP, IEP) is preferred. • Terms assigned by TAS (traceable author statement) should be traced to the primary literature - these don’t always turn out to apply to the correct species! TAS is discouraged for use in this project. • Terms should only be assigned by sequence similarity (ISS) where the terms are supported by experimental evidence for the similar sequence. • Non-traceable author statements should be avoided. Release annotation set Data availability The annotations from this project are submitted to GO as part of the standard gene association file available from the GO web site or via AmiGO. Efforts to highlight this data set in AmiGO and via dedicated web pages are in progress. Progress summary Over 200 human disease genes have been examined by the group in the last year. A comparison of the categories of evidence codes used to assign GO terms to genes indicates generally higher proportions of experimental evidence (shown in pale blue) in the reference genome target set (left graph) versus GO annotation across all genes (right graph). Number of Reference Genome Target Genes Organism and GO aspect (P= Biological Process, F=Molecular Function, C=Cellular Component)