320 likes | 346 Views
Empower scientific research with fully annotated genomes. Capture experimental data for high-confidence functional descriptions. Utilize comparative methods for inference and evolution insights. Deliver proteome sets, best practices, documentation, and software tools for comprehensive genome annotation sets and ortholog identification. Ensure concurrent and consistent gene annotations using homology-based evolution trees.
E N D
Motivation for Reference Genome Effort Fully and reliably annotated Genomes: • empower scientific research • are essential for use in automatic inference. We comprehensively capture the experimental data from the most active research communities producing high-confidence functional descriptions to leverage the power of the comparative method for inference.
Deliverable of Reference Genome Effort • Proteome sets • Annotation best practices documentation • Annotation software tool • Reference annotations for inference of function in other species
Evolutionary relationships are the “glue” in RefGenome • Goal • identify genes in reference genomes that may have the same or similar functions, so that comprehensive curation can be done simultaneously • Why? • Different model organisms have different strengths for investigating gene function, and these can often inform each other • Most genes did not first evolve within a given extant species: they were INHERITED from a common ancestor shared with other species. Genes in different organisms have similar functions because they were inherited, and haven’t changed much since the common ancestor.
Selection of “annotation set”, including independent ortholog identification at each MOD structural annotation of genomes used to build gp2protein files Current process Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set ISS annotations made independently by each MOD
Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files New processcoordinate and centralize where possible Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins
Gp2protein files used to build trees Select “gene set for concurrent annotation” from a central resource with more complete information Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins
Gp2protein files used to build trees Make homology-based annotations concurrently and consistently in the context of an evolutionary tree Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins
Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins
Update on progress:comprehensive gene sets from each MOD • Short term solution implemented as of 9/4 • Gp2protein files are now approximately complete • Most sets were OK as deposited by the MOD • A few sets had to be augmented (missing genes filled in from Ensembl or Entrez Gene), one set had to be reduced by selecting a single “representative” protein sequence per gene • Long term solution: UniProt? • SwissProt record includes all alternatively spliced exons , which is ideal for evolutionary modeling of protein coding gene history • We have already shared the gp2protein files with SwissProt, and they are comparing to UniProt “complete proteome” sets
Proposal made at this meeting • Write a white paper describing the “complete protein-coding gene set” needs/requirements for the RefGenome project • Michael will approach Amos and discuss options for working together
Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins
Example: NEDD4 • Selected for electronic jamboree Oct. 2008 • Human NEDD4 was “core” target • OrthoMCL identified “orthologs” in • Drosophila • C. elegans • Mouse (2) • Human (2) • Zebrafish • Rat • Curators at SGD identified an ortholog in yeast from a published paper
duplications at base of metazoa WWP1/2; SMURF1/2 diverge NEDD4 conserved duplication at base of chordata HACE1 diverges NEDD4 conserved duplication at base of reptilia? Orthologs (green) and paralogs (orange) of human NEDD4 (red)
OrthoMCL cluster containing human NEDD4/NEDD4L (blue) and curator-identified yeast ortholog (lt. blue) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia
Orthologs (green) and paralogs (orange) of human NEDD4 (red) And “conserved orthologs” of NEDD4/NEDD4L (yellow) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia
Update on progressGene trees and “homology set” selection tool • Gene trees have been built for all existing PANTHER families, from all RefGenome species, plus 35 other “phylogenetically informative” species • Tree Curation Tool has been updated by Paul’s and Suzi’s groups in collaboration • Retrieves and displays tree, and UniProt information for each sequence • Displays OrthoMCL clustering results-- scalable to any number of different clustering algorithms • “Pre-alpha” prototype has been installed and is being tested by Pascale • GOC has obtained supplemental funding to support • Adding multiple homology clustering algorithms • A “protein family curator”
Proposal made at this meeting • Lead RefGenome Curator and Protein Family Curator work together to define set of genes to be annotated concurrently • No need for review by individual MODs
Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins
Annotation inference based on homology • We need to make homology inferences correctly and consistently • Infer only from annotations with experimental evidence • Use explicit evolutionary model: inheritance (maybe with modification) from a common ancestor! • Homology inference is actually two inferences • 1. the common ancestor has the same annotation as its descendant that has been characterized • 2. another (unannotated) descendant has the same annotation as its ancestor • Need traceable, versioned evidence trail: • Inferred annotation -> tree -> experimental annotation(s) -> literature
? ? GO process: positive regulation of synaptogenesis
Proposal made at this meeting • Protein family curator makes first pass at homology inferences • Confers with individual MODs as necessary • Iterative: protein family curator prepares list of inferred annotations for each MOD, each MOD reviews and can suggest changes
Gp2protein files used to build trees Annotation process Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins
Trees and clusters used to define ref. genome annotation sets Protein family curator (Princeton/Pascale) suggests protein set based on report/examination of trees MOD curators annotate all experimental data to completion Protein family curator mediates annotation review Review and sign off on r.g. experimental annotations Protein family curator Inferences made to ancestral proteins Protein family curator Reviewed by protein family and MOD curators Inferences made to extant proteins Done!
Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Transformations Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins
Princeton / P-POD update • New run with protein sets used by PANTHER under way • Implementing algorithms for generation of consensus clusters and other ortholog prediction methods • New P-POD features
Gp2protein files used to build trees Pascale picks a focal gene structural annotation of genomes used to build gp2protein files Trees and clusters used to define ref. genome annotation sets Gp2protein files used to build “ortholog clusters” UniProtcomplete proteome project? Review and sign off on r.g. experimental annotations How to most efficiently incorporate input from all MOD curators? Inferences made to ancestral proteins Inferences made to extant proteins How are resulting homology-based annotations delivered to MODs?