370 likes | 595 Views
Phylogenetic trees. School B&I TCD Bioinformatics May 2010. Why do trees?. Phylogeny 101. OTUs operational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals)
E N D
Phylogenetic trees School B&I TCD Bioinformatics May 2010
Phylogeny 101 • OTUs operational taxonomic units: species, populations, individuals • Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals) • Branches length scaled (length propn evo dist)Branches length unscaled, nominal, arbitrary • Outgroup an OTU that is most distantly related to all the other OTUs in the study. • Choose outgroup carefully
Phylogeny 102 • Trees rooted N=(2n-3)! / 2n-2(n-2)!Trees unrooted N=(2n-5)! / 2n-3(n-3)! OTUs #rooted trees #unrooted trees 2 1 1 3 3 1 4 15 3 5 105 15 6 954 105 7 10395 954 8 135135 10395 9 2027025 135135 10 34349425 2027025 20 34*106 8*1021
Topology Basic tree Four key aspects of tree A C A B B D C D Root A Branch lengths C A C B D B A D 78 Confidence C 100 B D
Distances from sequence • Use Phylip Protdist or DNAdist • D= non-ident residues/total sequence length • Correction for multiple hits necessary because • Jukes-Cantor assumes all subs equally likely • Kimura: transition rate NE transversion rate • Ts usually > Tv G A A
Methods • Distance matrix • UPGMA • Neighbour joining NJ • Maximum parsimony MP • tree requiring fewest changes • Maximum likelihood ML • Most likely tree • Bayesian: sort of ML • Samples large number of “pretty good” trees
Trees NJ • Distancematrix • Neighbor joining is very fast Often a “good enough” tree Embedded in ClustalW
Trees MP • Maximum parsimony • Minimum # mutations to construct tree • Better than NJ – information lost in distance matrix – but much slower • Sensitive to long-branch attraction • Long branches clustered together • No explicit evolutionary model • Protpars refuses to estimate branch lengths • Informative sites
Long-branch attraction True tree MusHBA MusHBB Rodents evolve faster than primates HumHBB HumHBA False “LBA” tree MusHBA HumHBA HumHBB MusHBB
Maximum parsimony Here we have a representative alignment. Want to determine the phylogenetic relationships among the OTUs Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * * It is a good alignment clearly aligning homologous sites without gaps.
There are 3 possible trees for 4 taxa (OTUs): 1 3 1 2 1 2 \_____/ \_____/ \_____/ / \ / \ / \ 2 4 3 4 4 3 Or (1,2)(3,4) (1,3)(2,4) and (1,4)(2,3) Aim to identify (phylogenetically) informative sites and use these to determine which tree is most parsimonious.
The identical sites 1, 6, 8 are useless for phylogenetic purposes. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * **
Site 2 also useless: OTU1’s A could be grouped with any of the Gs. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
Site 4 is uniformative as each site is different.UNLESS transitions weighted in which case (1,4)(2,3) Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
For site 3 each tree can be made with (minimum) 2 mutations: Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
(1,2)(3,4) G A G A G A \ / \ / \ / G---A C---A A---A / \ / \ / \ C A C A C A
(1,3)(2,4) G C can do worse:G C \/ \ / A---A G---A / \ / \ A A A A
(1,4)(2,3) G C \/ A---A / \ A A So site 3 is (Counterintuitively) NOT informative
Site 5, however, is informative because one tree shortest. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
(1,2)(3,4) (1,3)(2,4) (1,4)(2,3) G A G G G G \ / \/ \ / G---A A---A G---G / \ / \ /\ G A A A A A
Likewise sites 7 and 9. By majority rule most parsimonious tree is (1,2)(3,4) supported by 2/3 informative sites. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
Protpars infile: 8 370 BRU MSQNSLRLVE DNSV-DKTKA LDAALSQIER RLR ---------- ---V-DKSKA LEAALSQIER NGR ---------- -MSD-DKSKA LAAALAQIEK ECO ---------- AIDE-NKQKA LAAALGQIEK YPR ---------M AIDE-NKQKA LAAALGQIEK PSE ---------- -MDD-NKKRA LAAALGQIER TTH ---------- -MEE-NKRKS LENALKTIEK ACD ---------- -MDEPGGKIE FSPAFMQIEG
Protpars • treefile:(((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);
outfile:One most parsimonious tree found: +-ACD +-------7 ! +-TTH +-6 ! ! +----PSE ! +----5 +-3 ! +-YPR ! ! +-4 ! ! +-ECO +-2 ! ! ! +-------------NGR--1 ! ! +----------------RLR ! +-------------------BRU remember: this is an unrooted tree!requires a total of 853.000 steps
Clustalw ****** PHYLOGENETIC TREE MENU ****** 1. Input an alignment 2. Exclude positions with gaps? = ON 3. Correct for multiple substitutions? = ON 4. Draw tree now 5. Bootstrap tree 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu
Trees General guidelines – NOT rules • More data is better • Excellent alignment = few informative sites • Exclude unreliable data – toss all gaps? • Use seqs/sites evolving at appropriate rate • Phylip DISTANCE • 3rd positions saturated • 2nd positions invariant • Fast evolving seqs for closely related taxa • Eliminate transition - homoplasy
Trees • Beware base composition bias in unrelated taxa • Are sites (hairpins?) independent? • Are substitution rates equal across dataset? • Long branches prone to error – remove them? • Choose outgroup carefully
Bootstrapping • Random re-sampling of the data • with replacement • The MSA stays the same • Each column of aligned residues in the MSA is a “site”. • The sites are what is re-sampled.
Bootstrap 2 • Having resampled the data • to get a new dataset/alignment • based on the original • the same length • Redraw the tree from that dataset • For each node • ask is this node retained in the resampled data. • Re-iterate 100, 1000 or 10,000 times
Boostrap dataset 4 OTUs and 9 “sites” Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
Why does it work? • The tree based on the real data is the best tree – the best estimate of what happened in evolution. • If a node is based on many bits of info then some of these will be resampled • If the node is based on a single site then it is unlikely to be resampled so we are less confident in that node.