Motivation for Reference Genome Effort

Motivation for Reference Genome Effort Fully and reliably annotated Genomes: • empower scientific research • are essential for use in automatic inference. We comprehensively capture the experimental data from the most active research communities producing high-confidence functional descriptions to leverage the power of the comparative method for inference.

Deliverable of Reference Genome Effort • Proteome sets • Annotation best practices documentation • Annotation software tool • Reference annotations for inference of function in other species

Evolutionary relationships are the “glue” in RefGenome • Goal • identify genes in reference genomes that may have the same or similar functions, so that comprehensive curation can be done simultaneously • Why? • Different model organisms have different strengths for investigating gene function, and these can often inform each other • Most genes did not first evolve within a given extant species: they were INHERITED from a common ancestor shared with other species. Genes in different organisms have similar functions because they were inherited, and haven’t changed much since the common ancestor.

Selection of “annotation set”, including independent ortholog identification at each MOD structural annotation of genomes used to build gp2protein files Current process Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set ISS annotations made independently by each MOD

Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files New processcoordinate and centralize where possible Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

Gp2protein files used to build trees Select “gene set for concurrent annotation” from a central resource with more complete information Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

Gp2protein files used to build trees Make homology-based annotations concurrently and consistently in the context of an evolutionary tree Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

Update on progress:comprehensive gene sets from each MOD • Short term solution implemented as of 9/4 • Gp2protein files are now approximately complete • Most sets were OK as deposited by the MOD • A few sets had to be augmented (missing genes filled in from Ensembl or Entrez Gene), one set had to be reduced by selecting a single “representative” protein sequence per gene • Long term solution: UniProt? • SwissProt record includes all alternatively spliced exons , which is ideal for evolutionary modeling of protein coding gene history • We have already shared the gp2protein files with SwissProt, and they are comparing to UniProt “complete proteome” sets

Proposal made at this meeting • Write a white paper describing the “complete protein-coding gene set” needs/requirements for the RefGenome project • Michael will approach Amos and discuss options for working together

Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins

Example: NEDD4 • Selected for electronic jamboree Oct. 2008 • Human NEDD4 was “core” target • OrthoMCL identified “orthologs” in • Drosophila • C. elegans • Mouse (2) • Human (2) • Zebrafish • Rat • Curators at SGD identified an ortholog in yeast from a published paper

duplications at base of metazoa WWP1/2; SMURF1/2 diverge NEDD4 conserved duplication at base of chordata HACE1 diverges NEDD4 conserved duplication at base of reptilia? Orthologs (green) and paralogs (orange) of human NEDD4 (red)

OrthoMCL cluster containing human NEDD4/NEDD4L (blue) and curator-identified yeast ortholog (lt. blue) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia

Orthologs (green) and paralogs (orange) of human NEDD4 (red) And “conserved orthologs” of NEDD4/NEDD4L (yellow) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia

Update on progressGene trees and “homology set” selection tool • Gene trees have been built for all existing PANTHER families, from all RefGenome species, plus 35 other “phylogenetically informative” species • Tree Curation Tool has been updated by Paul’s and Suzi’s groups in collaboration • Retrieves and displays tree, and UniProt information for each sequence • Displays OrthoMCL clustering results-- scalable to any number of different clustering algorithms • “Pre-alpha” prototype has been installed and is being tested by Pascale • GOC has obtained supplemental funding to support • Adding multiple homology clustering algorithms • A “protein family curator”

Proposal made at this meeting • Lead RefGenome Curator and Protein Family Curator work together to define set of genes to be annotated concurrently • No need for review by individual MODs

Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins

Annotation inference based on homology • We need to make homology inferences correctly and consistently • Infer only from annotations with experimental evidence • Use explicit evolutionary model: inheritance (maybe with modification) from a common ancestor! • Homology inference is actually two inferences • 1. the common ancestor has the same annotation as its descendant that has been characterized • 2. another (unannotated) descendant has the same annotation as its ancestor • Need traceable, versioned evidence trail: • Inferred annotation -> tree -> experimental annotation(s) -> literature

GO process: cellular response to UV

? ? GO process: positive regulation of synaptogenesis

GO function: ubiquitin-protein ligase activity

Proposal made at this meeting • Protein family curator makes first pass at homology inferences • Confers with individual MODs as necessary • Iterative: protein family curator prepares list of inferred annotations for each MOD, each MOD reviews and can suggest changes

Gp2protein files used to build trees Annotation process Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins

Trees and clusters used to define ref. genome annotation sets Protein family curator (Princeton/Pascale) suggests protein set based on report/examination of trees MOD curators annotate all experimental data to completion Protein family curator mediates annotation review Review and sign off on r.g. experimental annotations Protein family curator Inferences made to ancestral proteins Protein family curator Reviewed by protein family and MOD curators Inferences made to extant proteins Done!

Gp2protein files used to build trees Trees and clusters used to define ref. genome annotation sets structural annotation of genomes used to build gp2protein files Transformations Gp2protein files used to build “ortholog clusters” Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins

Princeton / P-POD update • New run with protein sets used by PANTHER under way • Implementing algorithms for generation of consensus clusters and other ortholog prediction methods • New P-POD features

P-POD search

P-POD results/disambiguation

P-POD-Notung

Gp2protein files used to build trees Pascale picks a focal gene structural annotation of genomes used to build gp2protein files Trees and clusters used to define ref. genome annotation sets Gp2protein files used to build “ortholog clusters” UniProtcomplete proteome project? Review and sign off on r.g. experimental annotations How to most efficiently incorporate input from all MOD curators? Inferences made to ancestral proteins Inferences made to extant proteins How are resulting homology-based annotations delivered to MODs?

Motivation for Reference Genome Effort

Motivation for Reference Genome Effort

Presentation Transcript

Motivation for Internetworking

Least Effort and the Economics of Motivation

PineRefSeq : Conifer Reference Genome Sequencing

Motivation for TOPS

MOTIVATION FOR IWM

The Evolution of the Reference Human Genome

RNA seq analysis with reference genome

HUGO: Hierarchical mUlti -reference Genome cOmpression tool for aligned short reads

Motivation for Internetworking

Genome databases and webtools for genome analysis

Motivation for Study

Mapping NGS sequences to a reference genome

Motivation for Recovery

Genome databases and webtools for genome analysis

the GO Reference Genome Annotation Project

Building a Unified Gene Catalog for the Mouse Reference Genome

Assessing Reference Services Using the READ Scale (Reference Effort Assessment Data)

Motivation for mathematicians

Motivation, Altruism and Effort

Motivation for writers

Motivation For Commitment