720 likes | 919 Views
Phylogenetic Trees Lecture 2. Based on: Durbin et al 7.4; Gusfield 17. Character-based methods for constructing phylogenies. In this approach, trees are constructed by comparing the characters of the corresponding species.
E N D
Phylogenetic TreesLecture 2 Based on: Durbin et al 7.4; Gusfield 17 .
Character-based methodsfor constructing phylogenies In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures) or molecular (homologous DNA sequences). One common approach is Maximum Parsimony. Assumptions: • Independence of characters (no interactions) • Best tree is one where minimal changes take place
One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. AAA AAA AAA 2 1 1 GGA AGA AAG AAA Total # of substitutions = 4 1. Maximum Parsimony Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences ?
AAA AAA 1 AAA AAA AGA AAA 1 2 1 1 1 AAA AGA AGA GGA AAG GGA AAG AAA Total #substitutions = 3 Total #substitutions = 4 Example Continued There are many trees possible. For example: The left tree is preferred over the right tree. The total number of changes is called the parsimony score.
Simple Example • Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position • Minimal tree has one evolutionary change: C T C T C C C T T C
Aardvark Bison Chimp Dog Elephant Extension to Many Letters • What is the parsimony score of A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA We do it character after character; each score is computed independently of the others.
Fitch’s Algorithm of Evaluating Trees • Assume that a tree is given. • Traverse tree from leaves to root determining set of possible states (e.g. nucleotides) for each internal node • Traverse tree from root to leaves picking ancestral states for internal nodes
T T AGT CT GT C G T T A T Fitch’s Algorithm – Step 1 • # of changes = # union operations
Fitch’s Algorithm – Step 1 • Do a post-order (from leaves to root) traversal of tree • Determine possible states Riof internal node i with children j and k
T T T T T T T T T T T T AGT AGT AGT AGT AGT AGT CT CT CT CT CT CT GT GT GT GT GT GT C C C C C C G G G G G G T T T T T T T T T T T T A A A A A A T T T T T T Fitch’s Algorithm – Step 2
Fitch’s Algorithm – Step 2 • Do a pre-order (from root to leaves) traversal of tree • Select state rj of internal node j with parent i
Weighted Version of Fitch’s Algorithm • Instead of assuming all state changes are equally likely, use different costs c(a, b) for different changes • 1st step of algorithm is to propagate costs up through tree
Weighted Version of Fitch’s Algorithm • Want to determine minimal cost S(i, a) • of assigning character a to node i • For leave nodes i :
Weighted Version of Fitch’s Algorithm • Want to determine minimal cost S(i, a) • of assigning character a to node i • For internal nodes: a i j k b
Weighted Version of Fitch’s Algorithm – Step 2 • Do a pre-order (from root to leaves) traversal of tree • Select minimal cost character for root • For each internal node j, select character that produced minimal cost at parent i
Weighted Parsimony Scores Weighted Parsimony score: • Each change is weighted by a score c(a, b). • The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b a.
i k j Evaluating Weighted Parsimony Scores Each position is independent and computed by itself. Use Dynamic Programming on a given tree. • If i is a node with children j and k , then S(i, a) = minx(S(j, x)+c(a, x)) + miny(S(k, y)+c(a, y)) S(i, a)the minimum score of subtree rooted at k when k has character a. S(i,a) S(j,x) S(k,y)
Evaluating Parsimony Scores Dynamic programming on a given tree Initialization: • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: • if i is node with children j and k, then S(i,a) = minx(S(j,x)+c(a,x)) + miny(S(k,y)+c(a,y)) Termination: • cost of tree is minxS(r,x) where r is the root Comment: To reconstruct an optimal assignment, we need to keep in each node i and for each character a the two characters x, y that bring about the minimum when i has character a.
Cost of Evaluating Parsimony for binary trees • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk2). Of course, we still need to search over ALL possible trees and find the best one. One usually resorts to heuristic search techniques.
Exploring the Space of Trees We’ve considered how to find the minimum number of changes for a given tree topology Need some search procedure for exploring the space of tree topologies Given n sequences there are possible rooted trees
1 3 2 Counting Trees n = 3 One Unrooted Tree: n = 4 3 Unrooted Trees A rooted tree with n leaves has (2n-1) nodes and (2n-2) edges, discounting the edge to the root; hence an unrooted tree has (2n-3) edges. For each additional leaf we add two edges. Therefore we have 1 • 3 • 5 • … • (2n-5) unrooted trees with n leaves. Each of such trees has (2n-3) edges, which can be chosen as a root of the rooted tree. Hence we have 1 • 3 • 5 • … • (2n-5) • (2n-3) rooted trees with n leaves
taxa (n) # of rooted trees 4 15 5 105 6 945 8 135,135 10 30,405,375 Exploring the Space of Trees
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 – A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?
Maximum Parsimony How many possible unrooted trees? 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G
Maximum Parsimony How many substitutions? MP
0 0 0 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 -A G G G T A A C T G 2 -A C G A T T A T T A 3 -A T A A T T G T C T 4 -A A T G T T G T C G
0 3 0 3 0 3 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 -A G G G T A A C T G 2 -A C G A T T A T T A 3 -A T A A T T G T C T 4 -A A T G T T G T C G
G T 3 C A C G C 3 T A C G T 3 A C C Maximum Parsimony 1 - G 2 - C 3 - T 4 - A
0 3 2 0 3 2 0 3 2 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 -A G G G T A A C T G 2 -A C G A T T A T T A 3 -A T A A T T G T C T 4 -A A T G T T G T C G
0 3 2 2 0 3 2 2 0 3 2 1 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 -A G G G T A A C T G 2 -A C G A T T A T T A 3 -A T A A T T G T C T 4 -A A T G T T G T C G
G A Maximum Parsimony G A 4 1 - G 2 - A 3 - A 4 - G 2 A G A G A 2 A G A 1 G A A
0 3 2 2 0 1 1 1 1 3 14 0 3 2 2 0 1 2 1 2 3 16 0 3 2 1 0 1 2 1 2 3 15 Maximum Parsimony
Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 -A G G G T A A C T G 2 -A C G A T T A T T A 3 -A T A A T T G T C T 4 -A A T G T T G T C G 0 3 2 2 0 1 1 1 1 314
Finding most parsimonious trees - exact solutions • Exact solutions can only be used for small numbers of taxa. • Exhaustive searchexamines all possible trees. • Typically used for problems with less than 10 taxa.
B C E D E A E E Finding most parsimonious trees - exhaustive search (1) B C Starting tree, any 3 taxa A Add fourth taxon (D) in each of three possible positions: three trees E D C D B B C (2b) (2a) (2c) A A Add fifth taxon (E) in each of the five possible positions on each of the three trees -> 15 trees, and so on
Finding most parsimonious trees - exact solutions • Branch and bound saves time by discarding families of trees during tree construction that can not be smaller than the smallest tree found so far. (Here “smaller” means “smaller score”; i.e., more parsimonious.) • Can be enhanced by specifying an initial upper bound for tree length(total # of changes on the tree); e.g., from distance method. • Typically used only for problems with less than 20 taxa.
Finding most parsimonious trees: branch and bound C2.1 B C C C3.1 D C B C2.2 B C3.2 D C2.3 C3.3 A C2.4 C3.4 B2 B3 A A C2.5 C3.5 D B E B E B D C C D C B1 C1.1 C1.5 A A A B E D D B D B E C C1.3 C C C1.2 E C1.4 A A A
Finding most parsimonious trees - heuristics • The number of possible trees increases exponentially with the number of taxa making exhaustive searches impractical for many data sets (an NP complete problem) • Heuristic methods are used to search tree space for most parsimonious trees • The trees found are not guaranteed to be the most parsimonious - they are best guesses
Finding most parsimonious trees - heuristics • Stepwise addition • Asis - the order in the distance matrix • Closest -starts with shortest 3-taxon tree and adds taxa in order that produces the least increase in tree length • Simple - the first taxon in the matrix is a taken as a reference - taxa are added to it in the order of their decreasing similarity to the reference • Random - taxa are added in a random sequence, many different sequences can be used • Recommend random with as many (e.g. 10-100) addition sequences as practical
Finding most parsimonious trees - heuristics Branch Swapping: Nearest neighbor interchange (NNI) • Subtree pruning and regrafting (SPR) • Tree bisection and reconnection (TBR)
Finding most parsimonious trees - heuristics 1 Nearest neighbor interchange (NNI) C D E A F B G D C C D E A E A F B F B G G
A B Finding most parsimonious trees - heuristics 2 Subtree pruning and regrafting (SPR) C D E A F B G E C D F E G C F B D A G
A B Finding most parsimonious trees - heuristics 3 Tree bisection and reconnection (TBR) C D E A F B G B G E F A D C F D C E G
Finding most parsimonious trees - heuristics - summary • Branch Swapping • Nearest neighbor interchange (NNI) • Subtree pruning and regrafting (SPR) • Tree bisection and reconnection (TBR) • The nature of heuristic searches means we cannot know which method will find the most parsimonious trees or all such trees. • However, TBR is the most extensive swapping routine and its use with multiple random addition sequences should work well.
Tree space may be populated by local minima and islands of most parsimonious trees RANDOMADDITIONSEQUENCE REPLICATES Tree FAILURE SUCCESS FAILURE Length Branch Swapping Branch Swapping BranchSwapping Local Minimum Local GLOBAL Minima MINIMUM
Multiple most parsimonious trees • Many parsimony analyses yield multiple equally optimal trees • Multiple trees are due to either: • Alternative equally parsimonious optimizations of homoplastic characters • Missing data • Or both • We can further select among these trees with additional criteria, but • Most commonly relationships common to all the optimal trees are summarized with consensus trees
Consensus methods • A consensus tree is a summary of the agreement among a set of fundamental trees • There are many different consensus methods that differ in: • 1. the kind of agreement • 2. the level of agreement • Consensus methods can be used with any types of tree - not just parsimony
Strict consensus methods • Strict consensus methods require agreement across all the fundamental trees • They show only those relationships that are unambiguously supported by the parsimonious interpretation of the data • The commonest method (strict component consensus) focuses on clades • This method produces a consensus tree that includes all and only those clades found in all the fundamental trees • Other relationships (those in which the fundamental trees disagree) are shown as unresolved polytomies
Strict consensus methods TWOFUNDAMENTALTREES A B C D E F G B E F G A C D B D F G A C E STRICT COMPONENT CONSENSUS TREE
Majority-rule consensus methods • Majority-rule consensus methods require agreement across a majority of the fundamental trees • May include relationships that are not supported by the most parsimonious interpretation of the data • The commonest method focuses on clades • This method produces a consensus tree that includes all and only those clades found in a majority (>50%) of the fundamental trees • Other relationships are shown as unresolved polytomies • Of particular use in bootstrapping