1 / 34

Phylogenetic trees

Phylogenetic trees. School B&I TCD Bioinformatics May 2010. Why do trees?. Phylogeny 101. OTUs operational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals)

yaron
Download Presentation

Phylogenetic trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic trees School B&I TCD Bioinformatics May 2010

  2. Why do trees?

  3. Phylogeny 101 • OTUs operational taxonomic units: species, populations, individuals • Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals) • Branches length scaled (length propn evo dist)Branches length unscaled, nominal, arbitrary • Outgroup an OTU that is most distantly related to all the other OTUs in the study. • Choose outgroup carefully

  4. Phylogeny 102 • Trees rooted N=(2n-3)! / 2n-2(n-2)!Trees unrooted N=(2n-5)! / 2n-3(n-3)! OTUs #rooted trees #unrooted trees 2 1 1 3 3 1 4 15 3 5 105 15 6 954 105 7 10395 954 8 135135 10395 9 2027025 135135 10 34349425 2027025 20 34*106 8*1021

  5. Topology Basic tree Four key aspects of tree A C A B B D C D Root A Branch lengths C A C B D B A D 78 Confidence C 100 B D

  6. Distances from sequence • Use Phylip Protdist or DNAdist • D= non-ident residues/total sequence length • Correction for multiple hits necessary because • Jukes-Cantor assumes all subs equally likely • Kimura: transition rate NE transversion rate • Ts usually > Tv G A A

  7. Methods • Distance matrix • UPGMA • Neighbour joining NJ • Maximum parsimony MP • tree requiring fewest changes • Maximum likelihood ML • Most likely tree • Bayesian: sort of ML • Samples large number of “pretty good” trees

  8. Trees NJ • Distancematrix • Neighbor joining is very fast Often a “good enough” tree Embedded in ClustalW

  9. Trees MP • Maximum parsimony • Minimum # mutations to construct tree • Better than NJ – information lost in distance matrix – but much slower • Sensitive to long-branch attraction • Long branches clustered together • No explicit evolutionary model • Protpars refuses to estimate branch lengths • Informative sites

  10. Long-branch attraction True tree MusHBA MusHBB Rodents evolve faster than primates HumHBB HumHBA False “LBA” tree MusHBA HumHBA HumHBB MusHBB

  11. Maximum parsimony Here we have a representative alignment. Want to determine the phylogenetic relationships among the OTUs Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * * It is a good alignment clearly aligning homologous sites without gaps.

  12. There are 3 possible trees for 4 taxa (OTUs): 1 3 1 2 1 2 \_____/ \_____/ \_____/ / \ / \ / \ 2 4 3 4 4 3 Or (1,2)(3,4) (1,3)(2,4) and (1,4)(2,3) Aim to identify (phylogenetically) informative sites and use these to determine which tree is most parsimonious.

  13. The identical sites 1, 6, 8 are useless for phylogenetic purposes. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * **

  14. Site 2 also useless: OTU1’s A could be grouped with any of the Gs. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

  15. Site 4 is uniformative as each site is different.UNLESS transitions weighted in which case (1,4)(2,3) Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

  16. For site 3 each tree can be made with (minimum) 2 mutations: Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

  17. (1,2)(3,4) G A G A G A \ / \ / \ / G---A C---A A---A / \ / \ / \ C A C A C A

  18. (1,3)(2,4) G C can do worse:G C \/ \ / A---A G---A / \ / \ A A A A

  19. (1,4)(2,3) G C \/ A---A / \ A A So site 3 is (Counterintuitively) NOT informative

  20. Site 5, however, is informative because one tree shortest. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

  21. (1,2)(3,4) (1,3)(2,4) (1,4)(2,3) G A G G G G \ / \/ \ / G---A A---A G---G / \ / \ /\ G A A A A A

  22. Likewise sites 7 and 9. By majority rule most parsimonious tree is (1,2)(3,4) supported by 2/3 informative sites. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

  23. Protpars infile: 8 370 BRU MSQNSLRLVE DNSV-DKTKA LDAALSQIER RLR ---------- ---V-DKSKA LEAALSQIER NGR ---------- -MSD-DKSKA LAAALAQIEK ECO ---------- AIDE-NKQKA LAAALGQIEK YPR ---------M AIDE-NKQKA LAAALGQIEK PSE ---------- -MDD-NKKRA LAAALGQIER TTH ---------- -MEE-NKRKS LENALKTIEK ACD ---------- -MDEPGGKIE FSPAFMQIEG

  24. Protpars • treefile:(((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);

  25. outfile:One most parsimonious tree found: +-ACD +-------7 ! +-TTH +-6 ! ! +----PSE ! +----5 +-3 ! +-YPR ! ! +-4 ! ! +-ECO +-2 ! ! ! +-------------NGR--1 ! ! +----------------RLR ! +-------------------BRU remember: this is an unrooted tree!requires a total of 853.000 steps

  26. Clustalw ****** PHYLOGENETIC TREE MENU ****** 1. Input an alignment 2. Exclude positions with gaps? = ON 3. Correct for multiple substitutions? = ON 4. Draw tree now 5. Bootstrap tree 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu

  27. Trees General guidelines – NOT rules • More data is better • Excellent alignment = few informative sites • Exclude unreliable data – toss all gaps? • Use seqs/sites evolving at appropriate rate • Phylip DISTANCE • 3rd positions saturated • 2nd positions invariant • Fast evolving seqs for closely related taxa • Eliminate transition - homoplasy

  28. Trees • Beware base composition bias in unrelated taxa • Are sites (hairpins?) independent? • Are substitution rates equal across dataset? • Long branches prone to error – remove them? • Choose outgroup carefully

  29. Bootstrapping

  30. Bootstrapping • Random re-sampling of the data • with replacement • The MSA stays the same • Each column of aligned residues in the MSA is a “site”. • The sites are what is re-sampled.

  31. Bootstrap 2 • Having resampled the data • to get a new dataset/alignment • based on the original • the same length • Redraw the tree from that dataset • For each node • ask is this node retained in the resampled data. • Re-iterate 100, 1000 or 10,000 times

  32. Boostrap dataset 4 OTUs and 9 “sites” Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

  33. What do the little numbers mean?

  34. Why does it work? • The tree based on the real data is the best tree – the best estimate of what happened in evolution. • If a node is based on many bits of info then some of these will be resampled • If the node is based on a single site then it is unlikely to be resampled so we are less confident in that node.

More Related