Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Bioinformatics and Evolutionary GenomicsGene Trees, Gene Duplications (I), and Orthology

Gene Trees, Gene Duplications and Orthology

Phylogenetic gene trees: how to make them • Homology: are two pieces of sequence related; Trees: when did they diverge (how are they related) • Start from a multiple sequence alignment • All multiple sequence programs alignments make a global alignment, thus feed it regions that you know are homologous → Domains ! • MUSCLE / clustal / t_coffee • Visual inspection of alignments (gaps, fragments/complete sequences, weird things e.g. A)

Put homologs in the alignment • Even if they are not homologous MUSCLE will align them (muscle/clustalw implicitly “assumes” that the sequences you feed it are homologous) • And in a phylogeny program, non-homologous sequences will be clustered

Visual inspection of alignments: ?!

1 1 3 B A B CD A x 12 9 B 12 x 8 CD 9 8 x C A 4 6 4 B 3 5 D A 1 3 3 D A BCD A x 10 BCD 10 x C An additive tree which is wrongly reconstructed by UPGMA B A B C D A x 12 9 9 B 12 x 9 7 C 9 9 x 6 D 9 7 6 x A 5 6 2 1 D 3 C

Neighbour-Joining (Saitou and Nei, 1987) • Global measure. keeps total branch length minimal • At each step, join two nodes such that distances are minimal (criterion of minimal evolution) • Leads to unrooted tree

Neighbour-Joining At each step all possible “neighbour joinings” are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.

A B 6 6 U 3 D 3 C Neighbour-Joining r= net divergence A B C D r A x 12 9 9 30 B 12 x 9 7 28 C 9 9 x 6 24 D 9 7 6 x 22 Mab = dab – (ra+rb)/(N-2) Mab = 12 – (30+28)/(4-2)) = -17 A B C D A x -17 -18 -17 B x -17 -18 C x -17 D x AC→U dau = dac/2 + (ra-rc)/(2(N-2)) = 9/2 + (30-24)/(2*2) = 6 dcu = dac - dau = 9 – 6 = 3 dbu = (dab + dbc – dac ) / 2 = (12 + 9 – 9 ) / 2 = 6 ddu = (dad + dcd – dac ) / 2 = (9+ 6 – 9) / 2 = 3

U B D r U x 6 3 9 B 6 x 7 13 D 3 7 x 10 U B D U x -16 -16 B x -16 D x e.g. UB →V Dvu = dub / 2 + (ru – rb )/ (2(N-2)) = 6/2 + (9-13)/(2*1) = 3 – 2 = 1 Dvb = dub – duv = 6 – 1 = 5 Ddv = (dud +dbd –dub)/2 = (3+7-6)/2 = 2 B A 5 6 V 1 2 U D 3 C

Unequal rates between speciesare a very real phenomenon

Character based: parsimony and maximum likelihood • Two way classification in phylogeny distance based vs character based • character state method. Searches “directly” (i.e. without defining distances) for a tree that fits best to the data (the alignment)

Maximum likelihood • Search the tree with the highest maximum likelihood • one searches for the maximum likelihood (ML) value for the character state configurations among the sequences under study for each possible tree and chooses the one with the largest ML value as the preferred tree.

Maximum likelihood • have to specify a model of sequence evolution • likelihood for all sites is the product of the likelihoods for individual sites assuming all the nucleotide sites evolve independently. • maximum likelihood method computes the probabilities for all possible combinations of ancestral states! • ML methods evaluate phylogenetic hypotheses n terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree (hypothesis) would give rise to the observed data (the alignment). The tree found to have the highest (log)ML value is considered to be the preferred tree.

Interpreting trees (recurring theme)

Interpreting the tree • Taxonomic findings • Paraphyly • Monophyly

Interpreting the tree • Outgroup. place root between distant homologouss sequence and rest group (b) • Midpoint. place root at midpoint of longest path (sum of branches between any two leafs) NB njplot • Gene duplication. Place root between paralogous gene copies (b) • NB all affected by rates ! b

Simple example (kinase)

Two genes per species: how to differentiate between one ancient or two recent duplications? • Two genes in Human chromosomes ( Human A & Human B) & two genes in mouse chromosomes (Mouse A & Mouse B)

Duplications, Speciations 1 2 3 ?

Interpreting the tree: duplications vs speciations, going pseudo 3D Gene Duplication Speciation

Interpreting the tree: gene trees vs species trees

Interpreting the tree Example: vertebrate duplications • Tetraploidy?

Interpreting the tree: Horizontal Gene Transfer ( HGT ) Bacteria Eukarya Archaea

Jargon for interpretation: Orthology (and paralogy) as a specification of homology when discussing two species mouse1 human2 human1 Fitch 1970 Two genes in two species are orthologous if they derive from one gene in their last common ancestor “the corresponding gene” Genes can diverge by “Gene duplication by cell division” Speciation, or Duplication implied to have the same function

Orthology ~ annotating internal nodesas duplications or speciations Because of the definition, how does that translate to a tree With or without species phylogeny?

Inparalogs Co-orthologs Outparalogs Terminology: inparalogs, outparalogs, co-orthologs

Importance of orthology for comparative genomics: more resolution Af Af Bs Bs Ec Ec Hi Mg Gene family present in Ec Hi Bs Mg Af Orthologs 1 present in Ec Hi Bs Af Orthologs 2 present in Ec Bs Mg Af Phenotype ~ gene correlation Func prediction if Hi is only biochem characterized enzyme Func prediction by co-oc Evolution of gene content: loss vs dupl

Heurisitcs for orthology definition • Needed because • Speed (MSA plus reliable tree building is slow) • Difficulty in deciding of which things you should make a tree in the first place (PFAM?) • Difficulty in operationalizing nuanced tree orthology into group orthology • Historically bidirectional blast hits BBH

BBH Af Af Bs2 Bs1 Ec2 Ec1 Hi Mg Extracting tree-like information from pairwise similarities Ec1Bs1 50% Ec1Bs2 35% Ec2 Bs1 33% Ec2 Bs2 48%

BBH issues 1: unequal rates prpC N. meningitidis 1:1 orthologs prpC E. coli prpC P. aeruginosa . VCh1337 V cholerae . mmgD B. subtilis mmgD B. halodurans citZ B. subtilis Outparalogs citZ B. halodurans VCh2092 V. cholerae . gltA P. multocida gltA E. coli gltA P. aeruginosa gltA N. meningitidis Duplication Speciation

BBH issues 2: ignores inparalogs Af Af Bs2 Bs1 Ec2 Ec1 Hi Bs3 Prevalence? Depends on e.g. evo distance, group vs pairwise orthology At least 16% prokaryotes INPARANOID Ec1 Hi 70% Ec2 Hi 38% Ec2 Bs2 48% Ec2 Bs3 51% (Bs2 Bs3 70%)

BBH issues 3: differential gene loss Af Af Bs2 Bs1 Ec2 Ec1 Hi Mg Mg Hi 35%

Other Large Scale orthology schemes: Inparanoid Eric Sonnhammer

Orthologous groups • Solution to the non-transitivity of the concept of orthology sensu stricto is: “Group orthology” • Conceptually: all proteins that are directly descended from one protein in the last common ancestor are considered orthologous to each other • Operationally: Combine all connected “best triangular hits” into Clusters of Orthologous Groups (COGs, Tatusov et al, 1997). WWW.NCBI.NLM.GOV (Watch out for fusion/fission though !!!)

Large Scale orthology schemes: COG • 1. Perform the all-against-all protein sequence comparison. • 2. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar to each other than to any proteins from other species. • 3. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the paralogous groups detected at step 2. • 4. Merge triangles with a common side to form COGs. • 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. • 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs.

Large Scale orthology schemes: COG • 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. • 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs.

Other Large Scale orthology schemes: Ortho MCL

The too ambitious comparative genomics dilemma: duplication/speciation vs domains Domain composition, accretion Single structural elements? Gene fusion Domains Gene Domain cassettes Very distant past present TIME ~orthologs homologs Distant homologs Gene Trivial orthologs Sequence divergence i.e. genome comparison between close species: no domain considerations, sub-sub-ortholog. Between distant Homologs, loads of domain considerations

Implication of coupling between duplication & domain accretion for evolution and function prediction • for some genes life is easy 1:1:1 orthologs, no fusion / domains, couple of losses. But a minority of families but a large proportion of proteins is a formidable challenge, domains permutations and duplications make life complicated

Orthology & function predictionBlast with a newly sequenced globin from frog What kind of globin is it?

Globins Blast query

Orthologous & function prediction vshomologous that are not orthologous & function • Orthologs tend to have the exact same molecular function, mere HTANO’s not • and operate in the same “pathway”. • Orthologs mostly have the same domain composition;

… but inparalogs: fate after duplication: neofunctionalization or subfunctionalization • Even evolutionary true orthologs can have “different functions” • Both co-orthologs have taken over some aspect of the ancestral function and have lost other aspects • Acquiring of new function or loss-of-function: one of co-orthologs does something different now.

Does retaining the ancestral “role” correlate with speed of sequence evolution: yes but a substantial minority is inconsistent 386 220

rfbB / rffG RfbB and RffG catalyze the same reaction, but are involved in two different biological processes. rfb gene cluster: biosynthesis of O-specific polysaccharides (inner membrane). rff gene cluster: complex biosynthesis of enterobacteria common antigen (outer membrane).

Why do observe inconsistencies? Consistent 70 Inconsistent 60 50 40 Frequency (# cases) 30 20 10 0 5 1 0 1 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 0 Sequence identity between inparalogs (%) Not because of chance due to lack of divergence time

Why do observe inconsistencies? Similar sequence divergence of inparalogs relative to their single-ortholog, molecular function similar? Any inconsistencies are then a chance outcome: both duplicates have diverged, but at (roughly) the same evolutionary speed (most amino acids substitutions are only been subject to purifying selection and not to adaptive selection)

In certain orthology scheme gene order is given prevalence above most similarity • Gene at conserved position is considered the “original” and the other duplicate the “copy”

Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology