220 likes | 366 Views
Orthology, paralogy and GO annotation. Paul D. Thomas SRI International. Outline. Why does orthology matter to us? A little background on evolution, orthology and paralogy Practical considerations for RefGenome. Why does “orthology” matter to us?. Goal
E N D
Orthology, paralogy andGO annotation Paul D. Thomas SRI International
Outline • Why does orthology matter to us? • A little background on evolution, orthology and paralogy • Practical considerations for RefGenome
Why does “orthology” matter to us? • Goal • identify genes in reference genomes that have the same or similar functions, so that comprehensive curation can be done simultaneously • Why? • Different model organisms have different strengths for exploring different facets of gene function, and these can often inform each other • Most genes did not first evolve within a given extant species: they were INHERITED from a common ancestor. Genes in different organisms have similar functions because they were inherited, and haven’t changed much since the common ancestor.
How do we identify genes with similar functions? • Evolutionary analysis • Where do orthologs fit in, and what do we mean by orthologs?
How do we identify genes with similar functions? • Evolutionary analysis • Where do orthologs fit in, and what do we mean by orthologs? • Simple answer: “The same gene in different organisms” (separated only by speciation) • Orthology = similar function
How do we identify genes with similar functions? • Evolutionary analysis • Where do orthologs fit in, and what do we mean by orthologs? • Simple answer: “The same gene in different organisms” (separated only by speciation) • Orthology = similar function • Unfortunately, the world is not that simple • Orthologous genes can have the different functions • Paralogous genes (duplications) can have (to some extent at least) similar functions
How do we identify genes with similar functions? • Evolutionary analysis • Where do orthologs fit in, and what do we mean by orthologs? • Simple answer: “The same gene in different organisms” (separated only by speciation) • Orthology = similar function • Unfortunately, the world is not that simple • Orthologous genes can have the different functions • Paralogous genes (duplications) can have (to some extent at least) similar functions • Fortunately, a slightly more complicated view can get us much closer to addressing the question of gene function
Representing evolution of related genes • Start with Darwin’s basic model: • Copying • An ancestral “species” “splits” into two separate species • Divergence • Each copy (species) changes independently over generations • NATURAL SELECTION: adaptation to different environment
Darwin’s species tree • Number of generations/time along one axis • Amount of divergence along other axis • Characters in common are due to inheritance • Also tells us something about common ancestor
Representing evolution of related genes • “Gene families” • Add detail from population genetics/molecular evolution to apply to genes • Copying • An ancestral species “splits” into two separate species • SPECIATION • A gene is duplicated in one population and subsequently inherited • DUPLICATION • Divergence • Each copy (gene sequence) changes independently over generations • NATURAL SELECTION: sequence substitutions to adapt to new function/role • NEUTRAL DRIFT: accumulation of “neutral” substitutions
How does this relate to gene function? • Copying • An ancestral species “splits” into two separate species • SPECIATION: likely to continue performing ancestral function • BUT not always • A gene is duplicated in one population and subsequently inherited • DUPLICATION: “redundant gene” free from previous constraints can adapt to a new function • BUT still inherits some aspects of ancestral function • Divergence • Each “new” (gene sequence) changes independently over generations • NATURAL SELECTION: sequence substitutions adapt to new/modified function/role • NEUTRAL DRIFT: sequence changes from accumulation of “neutral” substitutions. This is the MAJOR source of sequence differences!
A gene tree E.c. A.t. MTHFR1 A.t. MTHFR2 D.d. • Only one “informative” axis: rate of sequence evolution • For neutral changes this can often act as a “molecular clock” • Non-neutral changes will speed up the rate of evolution S.p. S.c. MET13 S.p. S.c. MET12 C.e. D.m. A.g. D.r. G.g. H.s. MTHFR R.n. M.m.
OrthoMCL “ortholog cluster” E.c. A.t. MTHFR1 A.t. MTHFR2 D.d. • An “ortholog cluster” is made by one or more “slices” through the protein family tree • Some combination of evolutionary rates and history of duplications • Might miss genes that could be efficiently annotated at the same time • From a strict evolutionary standpoint, orthologs are separated ONLY by speciation events; TIGRFAMs has coined the term “equivalog” for functionally conserved groups S.p. S.c. MET13 S.p. S.c. MET12 C.e. D.m. A.g. D.r. G.g. H.s. MTHFR R.n. M.m.
“ISS” • Inference from sequence similarity • A class of database search algorithm (e.g. BLAST) has become a metaphor • Implies “genes have similar functions because they have similar sequences” • Function is best determined using pairwise comparison
“ISS” • More properly, ISS of function is inheritance! • “related genes have a common function because their common ancestor had that function, which was inherited by its descendants” • ISS is not just a statement about one gene. It is also making assertions about • The common ancestor • Inheritance of a “character” by • Both “pairwise similar” descendants • Other descendants
“methylene tetrahydrofolate reductase activity” (m.f.) “methionine metabolic process” (b.p.) Homology inference in a treeinheritance and divergence of function E.c. A.t. MTHFR1 A.t. MTHFR2 D.d. S.p. S.c. MET13 S.p. S.c. MET12 C.e. D.m. A.g. D.r. G.g. H.s. MTHFR R.n. M.m.
NOT “methionine metabolic process” (b.p.) “regulation of homocysteine metabolic process” (b.p.) NOT “methylene tetrahydrofolate reductase activity” (m.f.)? NOT “methionine metabolic process” (b.p.)? Homology inference in a tree E.c. A.t. MTHFR1 A.t. MTHFR2 D.d. S.p. S.c. MET13 S.p. S.c. MET12 C.e. D.m. A.g. “methylene tetrahydrofolate reductase activity” (m.f.) “methionine metabolic process” (b.p.) D.r. G.g. H.s. MTHFR R.n. M.m.
Homology inference in a tree E.c. A.t. MTHFR1 A.t. MTHFR2 D.d. S.p. S.c. MET13 S.p. S.c. MET12 C.e. D.m. A.g. D.r. G.g. COMBINES: Evolutionary information (tree) Experimental knowledge (GO annotations from literature) Organism-specific biological knowledge (curators) H.s. MTHFR R.n. M.m.
This is just an easy, self-consistent way of doing ISS! E.c. A.t. MTHFR1 A.t. MTHFR2 D.d. S.p. S.c. MET13 S.p. S.c. MET12 C.e. D.m. A.g. D.r. G.g. H.s. MTHFR R.n. M.m. We have a picture of ALL the relationships rather than N flat lists that need to be reconciled
Tree annotation tool for RefGenome • Pre-computed, searchable “library” of gene trees • Include “outgroup” organisms to help infer evolutionary histories • Gene members in any tree can be modified by curator feedback • Tool for viewing tree and selecting “homology group” to be annotated • Tool for viewing tree labeled with in-depth GO annotations from all MODs, and inferring ancestral functions and homology annotations • Homology annotations will be supported by a tree node as evidence, trees will be available to scientific community • HMMs will be constructed to allow other genome projects to infer GO terms, distributed by InterPro