300 likes | 399 Views
Explore the concept of ultrametric trees, their properties, and use in phylogenetic analysis. Learn about ultrametric distances, matrices, and how they relate to identifying additive sets. Understand the construction process from matrices to trees and the application of ultrametric trees in solving the additive tree problem. Discover character-based methods like Maximum Parsimony in phylogenetic tree construction with practical examples.
E N D
Phylogenetic Trees (2)Lecture 13 Based on: Durbin et al 7.4, Gusfield 17.1-17.3, Setubal&Meidanis 6.1 .
Ultrametric trees as special weighted trees Definition:An Ultrametric tree is a rooted weighted tree all of whose leaves are at the same depth. Edge weights can be represented by the distances of internal vertices from the leaves Note: each internal vertex has at least two children 8 3 5 5 3 2 5 3 3 3 3 3 0: A E D B C
8 5 3 3 C B A D E LCA and distances in Ultrametric Tree Let LCA(i,j) denote the lowest common ancestor of leaves i and j. Let D(i,j) be the height of LCA(i, j), and dist(i,j) be the distance from i to j. Claim: For any pair of leaves i, j in an ultrametric tree: D(i,j)= 0.5 dist(i,j).
Identifying Ultrametric Distances Definition: A symmetric matrix D of dimension L by L is ultrametric iff for each 3 indices i, j, k : D(i,j) ≤ max {D(i,k),D(j,k)}. Theorem: The following conditions are equivalent for an LL symmetric matrix D: • D is ultrametric • There is an ultrametric tree of L leaves such that for each pair of leaves i,j: D(i,j) = height(LCA(i,j)) = ½ dist(i,j). Note: D(i,j) ≤ max {D(i,k),D(j,k)} is easier to check than the 4 points condition. Therefore the theorem implies that ultrametric additive sets are easier to characterize then arbitrary additive sets
Properties of ultrametric matrices used in the proof of the theorem Definition: Let D be an L by L matrix, and let S {1,...,L}. D[S] is the submatrix of D consisting of the rows and columns with indices from S. Claim 1: D is ultrametric iff for every S {1,...,L}, D[S] is ultrametric. Claim 2: If D is ultrametric and maxi,jD(i,j)=m, , then m appears in every row of D. One of the “?” Must be m
Ultrametric tree Ultrametric matrix There is an ultrametric tree s.t. D(i,j)=dist(i,j). D is an ultrametric matrix: By properties of Least Common Ancestors in trees D(k,i) = D(j,i) ≥ D(k,j) i j k
9 i j Ultrametric matrix Ultrametric tree Induction Base D is an ultrametric matrix D has an ultrametric tree : By induction on L, the size of D. Basis: L= 1: T is a leaf L= 2:T is a tree with two leaves i i i i j i j
m=m1 m2< m T1 T2 Induction step Induction step: L>2. Let S1 ={i: D(1,i) =m}, and S2={1,..,L}-S (note: 0<|S1|<L) By Claim 1, D[S1] and D[S2] are ultrametric. Construct a tree T1 for S1, rooted at m1≤ m. Construct a tree T2 for S2 with root labeled m2 < m (if m2=0 then T2 is a leaf). Join T1 and T2 to T with a root labeled m. [The construction when m1 = m]
m=m2 m1 T2 T1 Correctness Proof Need to prove: T is an ultrametric tree for D ie, D(i,j) is the label of the LCA of i and j in T. If i and j are in the same subtree, this holds by induction. Else D(1,i)= m and D(1,j) ≠ m, hence D(i,j) = m.
Complexity Analysis Let f(L) be the time complexity for L×L matrix. f(1)= f(2) = constant. For L>2: • Constructing S1 and S2: O(L). Let |S1| = k, |S2| = L-k. • Constructing T1 and T2: f(k)+f(L-k). • Joining T1 and T2 to T: Constant. Thus we have: f(L) ≤ maxk[ f(k) + f(L-k)] +cL, 0 < k < L. f(L) = cL2satisfies the above. Need an appropriate data structure!
Recall: identifying Additive Trees via Ultrametric trees • We solve the additive tree problem by reducing it to the ultrametric problem as follows: • Given an input matrix D=D(i,j) of distances, transform it to a matrix D’= D’(i,j), where D’(i,j) is the height of the LCA of i and j in the corresponding ultrametric tree T’. • Construct the ultrametric tree, T’, for D’. • Reconstruct the additive tree T from T’.
How D’ is constructed from D D’(i,j) should be the height of the Least Common Ancestror of i and j in T’, the ultrametric tree hanged at k: Thus, D’(i,j) = M - d(k,m), where d(k,m) is computed by: 9 a 2 7 1 3 b 2 4 c d
a 2 1 3 b 2 4 d c The transformation D D’ T’T M=9 9 T’ T 7 a 4 b c d D’ D
Character-based methodsfor constructing phylogenies In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures) or molecular (homologous DNA sequences). One common approach is Maximum Parsimony Assumptions: • Independence of characters (no interactions) • Best tree is one where minimal changes take place
One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. AAA AAA AAA 2 1 1 GGA AGA AAG AAA Total #substitutions = 4 1. Maximum Parsimony Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences ?
AAA AAA 1 AAA AAA AGA AAA 1 2 1 1 1 AAA AGA AGA GGA AAG GGA AAG AAA Total #substitutions = 3 Total #substitutions = 4 Example Continued There are many trees possible. For example: The left tree is preferred over the right tree. The total number of changes is called the parsimony score.
Example With One Letter • Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position • Minimal tree has one evolutionary change: C T C T C C C T T C
Aardvark Bison Chimp Dog Elephant Extension to Many Letters • What is the parsimony score of A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA We do it character after character; each score is computed independently of the others.
Weighted Parsimony Scores Weighted Parsimony score: • Each change is weighted by a score c(a,b). • The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b a.
k j i Evaluating Weighted Parsimony Scores Each position is independent and computed by itself. Use Dynamic Programming on a given tree. • if k is a node with children i and j, then S(k,a) = minx(S(i,x)+c(a,x)) + miny(S(j,y)+c(a,y)) S(k,a)the minimum score of subtree rooted at k when k has character a. S(k,a) S(i,x) S(j,y)
Evaluating Parsimony Scores Dynamic programming on a given tree Initialization: • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: • if k is node with children i and j, then S(k,a) = minx(S(i,x)+c(a,x)) + miny(S(j,y)+c(a,y)) Termination: • cost of tree is minxS(r,x) where r is the root Comment: To reconstruct an optimal assignment, we need to keep in each node k and for each character a the two characters x, y that bring about the minimum when k has character a.
Cost of Evaluating Parsimony for binary trees If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk2). Of course, we still need to search over possible trees and find the best one. One usually resorts to heuristic search techniques.
2. Perfect Phylogeny Data on species is given by a Character State Matrix. Cell (p,i) has value j iff character i of object (species) p has state j. Goal: constructing evolution tree for the species.
Motivation: Evolution Tree Internal nodes correspond to speciation events, where some character (attribute) is acquired. Assumptions: 1. No reversals (characters are not lost) 2. No convergences (a character is created only once)
Perfect Phylogeny for a 0-1 Matrix A 0-1 matrix: Each character is either 0 (non exists) or 1 (exists). • Each of the n objects label exactly one leaf of T • Each of the m characters labels exactly one edge of T • Object p has exactly the characters labeling the path from p to the root. A perfect phylogeny for the matrix: Tree with no convergence, no reversals. 2 3 1 4 E B D 5 A C
2 3 1 4 E D B 5 A C The (Binary) Perfect Phylogeny Problem Problem: Given a 0-1 matrix M, determine if it has a perfect phylogeny, and construct one if it does. (Note: edges are labeled by characters: edge labeled by i represent changing character i’sstate from 0 to 1).
Solution to Perfect Phylogeny Problem Definition: Given a 0-1 matrix M, Ok={j:Mjk=1}, ie: Ok is the set of objects that have character k. Theorem: M has a perfect phylogenetic tree iff the sets {Oi} are laminar, ie: for all i, j, either Oi and Oj are disjoint, or one includes the other. Laminar Not Laminar
Proof : Assume M has a perfect phylogeny, and let i, j be given. Consider the edges labeled i and j. Case 1: There is a root to leaf path containing both. Then one is included in the other (2 and 1 below). Case 2: not case 1. Then they are disjoint (2 and 3 below). 2 3 1 4 E D B 5 A C
1 B A Proof (cont.) : Assume for all i, j, either Oi and Oj are disjoint, or one includes the other. We prove by induction on the number of characters that it has a perfect phylogenetic tree for the matrix. Basis: one character. Then there are at most two objects, one with and one without this character.
Proof (cont.) : Induction step: Assume correctness for n-1 characters, and consider a matrix with n characters (non-zero columns). WLOG assume that O1 is not contained in Oj for j > 1. Let S1 be the set of objects that have character 1, and S2 be the remaining objects. Then each character belongs to objects in S1 or S2, but not both (prove!). By induction there are trees T1 andT2 for S1 and S2. Combining them as below gives the desired tree. 1 T1 T2