950 likes | 1.44k Views
Phylogenetic analysis. A brief introduction in 2 x 4 hours. brigitte.boeckmann@isb-sib.ch. What you can learn today. Understand trees Different types of gene relationships The difference between a cladogram and a phylogram Phylogenetic analysis methods
E N D
Phylogenetic analysis A brief introduction in 2 x 4 hours brigitte.boeckmann@isb-sib.ch
What you can learn today • Understand trees • Different types of gene relationships • The difference between a cladogram and a phylogram • Phylogenetic analysis methods • Steps performed during a phylogenetic analysis • Search strategies for tree topologies • Measures for tree robustness • Gene relationships and function prediction
Outline Introduction to phylogenetic analysis Application: Protein function prediction Databases, servers and software TP5
Introduction Ancestral genome Polymorphisms - CNV Gene duplication – Gene loss – gene fusion – gene fission - exon shuffling – retroposition – mobile elements – de novo gene origination Genome species 1 Genome species 2 HGT HGT Phylogeny is the study of evolutionary relationships.Phylogenetic analysis is the means of inferring evolutionary relationships.
Trees B C D E F G A B C D E F G A End nodes Internal nodes Branches Roots
Phylogenetic trees Cladogram Phylogram The branch length represents the number of character changes Molecular clock
Phylogenetic trees A phylogenetic tree is a model about the evolutionary relationship between operational taxonomic units (OTUs) based on homologous characters. But not all trees are phylogenetic trees Dendrogram: general term for a branching diagram Cladogram: branching diagram without branch length estimates Phylogramor phylogenetic tree: branching diagram with branch length estimates Please note: Guide trees produced during multiple sequence alignment have nophylogenetic meaning: the dendrograms are based on distances derived from pair-wise alignments; they are used to determine in what order sequences are aligned during the construction of the MSA.
Rooted and unrooted trees Outgroup
Solved (bifurcating) and un(re)solved (multifurcating) trees A A B B C C D D E E F F G G
Speciation and gene duplication A1 A1 B1 B1 Gene duplication C1 B2 Gene duplication A2 C B2 D C2 E D F
Relationships within homologs Frog gene 1 Orthologs Human gene 1 Mouse gene 1 Gene duplication Paralogs Mouse gene 2 Homologs Ancestral gene Human gene 2 Orthologs Frog gene 2 Drosophila gene
Relationships between orthologs and paralogs Frog gene 1 Orthologs (Group 1) Human gene 1 Mouse gene 1 Co-orthologs of the Drosophila gene Gene duplication Inparalogs of Group 2 Orthologs (Group 2) Mouse gene 2 Ancestral gene Human gene 2 Outparalogs of Group 1 Frog gene 2 Drosophila gene
Gene relationships Homologs = Genes of common origin Orthologs = 1. Genes resulting from a speciation event, 2. Genes originating from an ancestral gene in the last common ancestor of the compared genomes Co-orthologs = Orthologs that have undergone lineage-specific gene duplications subsequent to a particular speciation event Paralogs = Genes resulting from gene duplication Inparalogs = Paralogs resulting from lineage-specific duplication(s) subsequent to a particular speciation event Outparalogs = Paralogs resulting from gene duplication(s) preceding a particular speciation event One-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene duplications subsequent to a particular speciation event One-to-many (1:n) orthologs: Orthologs of which at least one - and at most all but one - has undergone lineage-specific gene duplication subsequent to a particular speciation event Many-to-many (n:n) orthologs = Orthologs which have undergone lineage-specific gene duplications subsequent to a particular speciation event Pseudo-orthologs = Paralogs with lineage-specific gene loss of orthologs Xenologs = Orthologs derived by horizontal gene transfer from another lineage
Sequence data of actin-related protein 2 >Species A - RecName: Full=Actin-related protein 2; MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDE ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNR EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTR RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVL VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKH IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAV LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR >Species B - RecName: Full=Actin-related protein 2; MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDE ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNR EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTR RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVL VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKH IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAV LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR …. Phylogenetic analysis – an approach I Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe
ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_C MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_E MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE *:* :* ******** *** *** . **::****::*: . *::::**:***:* ARP2_A AEAVRSLLQVKYPMENGIIRDFEEMNQLWDYTF-FEKLKIDPRGRKILLTEPPMNPVANR ARP2_B CSQLRQMLDINYPMDNGIVRNWDDMAHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR ARP2_C ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR ARP2_D ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR ARP2_E ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR .. :*.:*::.***:**::*::::* ::**:** :*:.**. *:******:** ** ARP2_A EKMCETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVG ARP2_B EKMFQVMFEQYGFNSIYVAVQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTR ARP2_C EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR ARP2_D EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR ARP2_E EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR **: :.*** * *.. *:*:****:****** :***:********* **** . * **. ARP2_A RLDVAGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVL ARP2_B RLDIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVL ARP2_C RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL ARP2_D RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL ARP2_E RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL ***:**** *.***.*** .** **.:******* :******:.*::* : .*: ***** ARP2_A MRNYTLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRA ARP2_B SQQYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKH ARP2_C VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH ARP2_D VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH ARP2_E VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH .*******:*.:*.**:*.** ******:. * *::*: *. ***:*:* * :*. ARP2_A IVLSGGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAV ARP2_B IVLSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAV ARP2_C IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV ARP2_D IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV ARP2_E IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV ******::**.*******.*:***:::***:.: : :**:.** .* *. **:**** ARP2_A LADIMAQND-HMWVSKAEWEEYGV-RALDKLGPRTT ARP2_B LANLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKA- ARP2_C LAEVTKDRD-GFWMSKQEYQEQGL-KVLQKLQKISH ARP2_D LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR ARP2_E LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR **:: :.* :*::. *::* *: . : ** Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe Which sequence is likely to correspond to which species?
ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
Distance matrix Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe
Expected species tree for … Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe
Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation
What data types can be used to infer phylogenies? Morphological characters Physiological characters Gene order Sequence data (nucleotide sequences, amino acid sequences) Mixed characters ….
Data selection To be considered: Input data must be homolog! Taxonomic range and ~ distribution (balance, avoid LB) Content of phylogenetic information Number of character states Size of the dataset etc
Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation
Data comparison To be considered: Prediction of characters that are derived from a common ancestor Chose a suitable alignment method Highly diverged sequences Domain/family predictions Structures
Alignment Pairwise alignment versus MSA MSA methods ClustalW (very fast) Muscle (very fast) MAFFT (fast) Probcons T-coffee … When to use which method and why?
Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation
Characters to be selected for the analysis To be considered: Each position in the alignment should be homolog! Missing data (in some OTU) Number of characters etc Selection of a data model
Selection of a data model Common methods Gap removal GBLOCKS
Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation
Evolutionary models Phylogenetic tree-building presumes particular evolutionary models The model chosen influences the outcome of the analysis and should be considered in the interpretation of the analysis results
Evolutionary models Which aspects are to be considered? … … … … etc
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange … … … etc
http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gifhttp://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif
Frequencies of aa exchange Substitution matrices Empirically derived from alignment datasets PAM (Dayhoff, 1968) JTT (Jones, Taylor, Thornton, 1992) Gonnet et al. (1992) WAG (Whelan, Goldman, 2001) mtrev (Hadachi, Hasegawa, 1996, specific for mitochondrial data) Estimated rate matrix -> series of replacement probability matrices (e.g. PAM1 … PAM250)
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution … … etc Why?
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution GC content Differs between species (20-72%) Differs within a genome (isochores) Biased recombination-associated DNA repair Temperature
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Exchangeability matrix can be build for a particular dataset JTT + F
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity Variation in substitution rates among different positions Mostly discrete gamma model
Gamma distribution is a continuous probability density function Alpha parameter Scaling factor Infinitely large alpha value, rate variation is the same for all sites alpha = 1, extensive rate variation alpha < 1, many invariable sites Probability density Relative evolutionary rate http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity Variation in substitution rates among different positions Mostly discrete gamma model Select the number of categories (4/8)
Evolutionary models Which aspects are to be considered? Frequencies of aa exchange Change of aa frequencies during evolution Between-site rate variation or Among-site substitution rate heterogenity Presence of invariable sites
Evolutionary models Notation, e.g. JTT JTT + F JTT + F + gamma (4 ) JTT + F + gamma (8 ) + I (under discussion) JTT + F + I It is not always the most complex model that produces the best result. The more complex the model, the more complex the explanation of the results.
Evolutionary models Selection of best-fit models (statistically) of evolution ProtTest AIC (Akaike Information Criterion); simple relationship between the likelihood and the number of parameters to estimate the distance of a model from truth BIC (Bayesian Information Criterion) includes a penalty for the number of parameters to avoid overfitting of the selected model
Phylogenetic analysis Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation
Tree-building methods Distance (matrix) methods Calculate distances for all pairs of taxa based on the sequence alignment Construct a phylogenetic tree based on a distance matrix Character-based (Sequence) methods Constructs a phylogenetic tree based on the sequence alignment
Step 1: Compute distances Simple measure for the extend of sequence divergence: p distance: p=nd/n p = proportion (p distance) nd= number of aa differences n = number of aa used ^