290 likes | 402 Views
Why do trees?. Phylogeny 101. OTUs operational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals) Branches length scaled Branches length unscaled, nominal, arbitrary
E N D
Phylogeny 101 • OTUs operational taxonomic units: species, populations, individuals • Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals) • Branches length scaledBranches length unscaled, nominal, arbitrary • Outgroup an OTU that is most distantly related to all the other OTUs in the study.
Phylogeny 102 • Trees rooted (N=(2n-3)! / 2n-2(n-2)!Trees unrooted (N=(2n-5)! / 2n-3(n-3)! OTUs #rooted trees #unrooted trees 2 1 1 3 3 1 4 15 3 5 105 15 6 954 105 7 10395 954 8 135135 10395 9 2027025 135135 10 34349425 2027025
Trees NJ • Distancematrix • UPGMA assumes constant rate of evolution – molecular clock: don’t publish UPGMA trees • Neighbor joining is very fast • Often a “good enough” tree • Embedded in ClustalW • Use in publications only if too many taxa to compute with MP or ML
Distances from sequence • Protdist/DNAdist • Non-identical residues/total sequence length • Correction for multiple hits necessary because 2 ID residues may be C -> T -> C • Jukes-Cantor assumes all subs equally likely • Kimura: transition rate NE transversion rate • Ts usually > Tv
Trees MP • Maximum parsimony • Minimum # mutations to construct tree • Better than NJ – information lost in distance matrix – but much slower • Sensitive to long-branch attraction • No explicit evolutionary model • Protpars refuses to estimate branch lengths • Informative sites
Trees ML • Very CPU intensive • Requires explicit model of evolution – rate and pattern of nucleotide substitution • JC Jukes/Cantor • K2P Kimura 2 parameter transition/transversion • F81 Felsenstein – base composition bias • HKY85 merges K2P and F81 • Explicit model -> preferred statistically • Assumes change more likely on long branch • No long-branch attraction • Wrong model -> wrong tree
Models of sequence evolution HKY85 A C G T A pCbpGapTb C pAbpGbpTa G pAapCbpTb T pAbpCapGb
Here we have a representative alignment. Want to determine the phylogenetic relationships among the OTUs: Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * * It is a good alignment clearly aligning homologous sites without gaps.
There are 3 possible trees for 4 taxa (OTUs): 1 3 1 2 1 2 \_____/ \_____/ \_____/ / \ / \ / \ 2 4 3 4 4 3 Or (1,2)(3,4) (1,3)(2,4) and (1,4)(2,3) Aim to identify (phylogenetically) informative sites and use these to determine which tree is most parsimonious.
The identical sites 1, 6, 8 are useless for phylogenetic purposes. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * **
Site 2 also useless: OTU1’s A could be grouped with any of the Gs. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
Site 4 is uniformative as each site is different.UNLESS transitions weighted in which case (1,4)(2,3) Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
For site 3 each tree can be made with (minimum) 2 mutations: Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
(1,2)(3,4) G A G A G A \ / \ / \ / G---A C---A A---A / \ / \ / \ C A C A C A
(1,3)(2,4) G C can do worse:G C \/ \ / A---A G---A / \ / \ A A A A
(1,4)(2,3) G C \/ A---A / \ A A So site 3 is (Counterintuitively) NOT informative
Site 5, however is informative because one tree shortest. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
(1,2)(3,4) (1,3)(2,4) (1,4)(2,3) G A G G G G \ / \/ \ / G---A A---A G---G / \ / \ /\ G A A A A A
Likewise sites 7 and 9. By majority rule most parsimonious tree is (1,2)(3,4) supported by 2/3 informative sites. Site: 1 2 3 4 5 6 7 8 9 OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *
Protpars • infile: • 8 370 • BRU MSQNSLRLVE DNSV-DKTKA LDAALSQIER • RLR ---------- ---V-DKSKA LEAALSQIER • NGR ---------- -MSD-DKSKA LAAALAQIEK • ECO ---------- AIDE-NKQKA LAAALGQIEK • YPR ---------M AIDE-NKQKA LAAALGQIEK • PSE ---------- -MDD-NKKRA LAAALGQIER • TTH ---------- -MEE-NKRKS LENALKTIEK • ACD ---------- -MDEPGGKIE FSPAFMQIEG
Protpars • treefile:(((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);
outfile:One most parsimonious tree found: +-ACD +-------7 ! +-TTH +-6 ! ! +----PSE ! +----5 +-3 ! +-YPR ! ! +-4 ! ! +-ECO +-2 ! ! ! +-------------NGR--1 ! ! +----------------RLR ! +-------------------BRU remember: this is an unrooted tree!requires a total of 853.000
Clustalw ****** PHYLOGENETIC TREE MENU ****** 1. Input an alignment 2. Exclude positions with gaps? = ON 3. Correct for multiple substitutions? = ON 4. Draw tree now 5. Bootstrap tree 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu
ClustalW NJ • (((ACD:0.28958,TTH:0.32705):0.03395,((BRU:0.07321,RLR:0.07032):0.11692,NGR:0.21168):0.02493):0.02092,(ECO:0.05022,YPR:0.05736):0.11997,PSE:0.15632); • topologically the same as(((ACD,TTH),((BRU,RLR),NGR)),(ECO,YPR),PSE);and cf: Protpars:(((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);
Dealing with CDSs • More info in DNA than proteins • Systematic 3rd posn changes can confuse • Use DNA directly only if evol dist short • For distant relationships: blank 3rd positions • Translate into protein to align • then copygaps back to DNA • Use dnadist with weights to investigate rates
Trees General guidelines – NOT rules • More data is better • Excellent alignment = few informative sites • Exclude unreliable data – toss all gaps? • Use seqs/sites evolving at appropriate rate • Phylip DISTANCE • 3rd positions saturated • 2nd positions invariant • Fast evolving seqs for closely related taxa • Eliminate transition - homoplasy
Trees • Beware base composition bias in unrealted taxa • Are sites (hairpins?) independent? • Are substitution rates equal across dataset? • Long branches prone to error – remove them?