680 likes | 1.32k Views
Class 9: Phylogenetic Trees. The Tree of Life. D’après Ernst Haeckel, 1891 . Evolution. Many theories of evolution Basic idea: speciation events lead to creation of different species Speciation caused by physical separation into groups where different genetic variants become dominant
E N D
The Tree of Life D’après Ernst Haeckel, 1891
Evolution • Many theories of evolution • Basic idea: • speciation events lead to creation of different species • Speciation caused by physical separation into groups where different genetic variants become dominant • Any two species share a (possibly distant) common ancestor
Phylogenies • A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species • Leafs - current day species • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the next Aardvark Bison Chimp Dog Elephant
branch internal node leaf Phylogenetic Tree • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2
Example: Primate evolution 20-25 mya 35-37 mya 40-45 mya
How to construct a Phylogeny? • Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) • Since then, focus on objective criteria for constructing phylogenetic trees • Thousands of articles in the last decades • Important for many aspects of biology • Classification (systematics) • Understanding biological mechanisms
Morphological vs. Molecular • Classical phylogenetic analysis: morphological features • number of legs, lengths of legs, etc. • Modern biological methods allow to use molecular features • Gene sequences • Protein sequences • Analysis based on homologous sequences (e.g., globins) in different species
Dangers in Molecular Phylogenies • We have to remember that gene/protein sequence can be homologous for different reasons: • Orthologs -- sequences diverged after a speciation event • Paralogs -- sequences diverged after a duplication event • Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)
Dangers of Paralogues Gene Duplication Speciation events 2B 1B 3A 3B 2A 1A
Dangers of Paralogs • If we only consider 1A, 2B, and 3A... Gene Duplication Speciation events 2B 1B 3A 3B 2A 1A
Types of Trees • A natural model to consider is that of rooted trees Common Ancestor
Types of Trees • Depending on the model, data from current day species does not distinguish between different placements of the root vs
Types of trees • Unrooted tree represents the same phylogeny with out the root node
Positioning Roots in Unrooted Trees • We can estimate the position of the root by introducing an outgroup: • a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant
Types of Data • Distance-based • Input is a matrix of distances between species • Can be fraction of residues they disagree on, or -alignment score between them, or … • Character-based • Examine each character (e.g., residue) separately
Simple Distance-Based Method Input: distance matrix between species Outline: • Cluster species together • Initially clusters are singletons • At each iteration combine two “closest” clusters to get a new one
UPGMA Clustering • Let Ci and Cj be clusters, define distance between them to be • When combining two clusters, Ci and Cj, to form a new cluster Ck, then
Molecular Clock • UPGMA implicitly assumes that all distances measure time in the same way 2 3 2 3 4 1 4 1
Additivity • A weaker requirement is additivity • In “real” tree, distances between species are the sum of distances between intermediate nodes k c b j a i
Consequences of Additivity • Suppose input distances are additive • For any three leaves • Thus k c b j a m i
Neighbor Joining • Can we use this fact to construct trees? • Let where Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree
k m j i Neighbor Joining • Set L to contain all leaves Iteration: • Choose i,j such that D(i,j) is minimal • Create new node k, and set • remove i,j from L, and add k Terminate:when |L| =2, connect two remaining nodes
Distance Based Methods • If we make strong assumptions on distances, we can reconstruct trees • In real-life distances are not additive • Sometimes they are close to additive
Character Based Methods • We start with a multiple alignment • Assumptions: • All sequences are homologous • Each position in alignment is homologous • Positions evolve independently • No gaps • We seek to explain the evolution of each position in the alignment
Parsimony • Character-based method • A way to score trees (but not to build trees!) Assumptions: • Independence of characters (no interactions) • Best tree is one where minimal changes take place
Aardvark Bison Chimp Dog Elephant A Simple Example • What is the parsimony score of A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA
A Simple Example A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA • Each column is scored separately. • Let’s look at the first column: • Minimal tree has one evolutionary change: C T C T C C C T T C
Evaluating Parsimony Scores • How do we compute the Parsimony score for a given tree? • Traditional Parsimony • Each base change has a cost of 1 • Weighted Parsimony • Each change is weighted by the score c(a,b)
a g a Traditional Parsimony a {a} • Solved independently for each position • Linear time solution a {a,g}
Evaluating Weighted Parsimony Dynamic programming on the tree S(i,a) = cost of tree rooted at i if i is labeled by a Initialization: • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: • if k is a node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b)) Termination: • cost of tree is minaS(r,a) where r is the root
Cost of Evaluating Parsimony • Score is evaluated on each position independetly. Scores are then summed over all positions. • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?
Maximum Parsimony How many possible unrooted trees? 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G
Maximum Parsimony How many substitutions? MP
0 0 0 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G
0 3 0 3 0 3 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G
G T 3 C A C G C 3 T A C G T 3 A C C Maximum Parsimony 2 1 - G 2 - C 3 - T 4 - A
0 3 2 0 3 2 0 3 2 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G
0 3 2 2 0 3 2 2 0 3 2 1 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G
G A 2 A G A G A 2 A G A A G 1 A G A Maximum Parsimony 4 1 - G 2 - A 3 - A 4 - G
0 3 2 2 0 1 1 1 1 3 14 0 3 2 2 0 1 2 1 2 3 16 0 3 2 1 0 1 2 1 2 3 15 Maximum Parsimony
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 1 1 1 1 3 14
Searching for the Optimal Tree • Exhaustive Search • Very intensive • Branch and Bound • A compromise • Heuristic • Fast • Usually starts with NJ
branch internal node leaf Phylogenetic Tree Assumptions • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2 • Lengths t = {ti} for each branch • Phylogenetic tree = (Topology, Lengths) = (T,t)
Probabilistic Methods • The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. • Background probabilities: q(a) • Mutation probabilities: P(a|b,t) • Models for evolutionary mutations • Jukes Cantor • Kimura 2-parameter model • Such models are used to derive the probabilities
Jukes Cantor model • A model for mutation rates • Mutation occurs at a constant rate • Each nucleotide is equally likely to mutate into any other nucleotide with rate a.
Kimura 2-parameter model • Allows a different rate for transitions and transversions.
Mutation Probabilities • The rate matrix R is used to derive the mutation probability matrix S: • S is obtained by integration. For Jukes Cantor: • q can be obtained by setting t to infinity