1.26k likes | 1.53k Views
break. Distance methods: p distances and the least squares (LS) approach. General concept of distance based methods. Two steps: Compute a distance D(i,j) between any two sequences i and j. Find the tree that agrees most with the distance table. . Simplest distance: the “ p ” distance.
E N D
Distance methods: p distances and the least squares (LS) approach
General concept of distance based methods • Two steps: • Compute a distance D(i,j) between any two sequences i and j. • Find the tree that agrees most with the distance table.
Simplest distance: the “p” distance SEQ1 AACAAGCG SEQ2 AACGAGCA There are 2 differences, so the distance = 2. The problem is that now, if you have a longer pair of sequences SEQ3 AACAAGCGCCCTCAGTCCGCTCGCACAA SEQ4 AACGAGCACCCTCAGTCCGCTCGCACAA The distance is still 2, but in fact, the distance between 3 and 4 should be smaller than the distance between 1 and 2.
Simplest distance: the “p” distance SEQ1 AACAAGCG SEQ2 AACGAGCA There are 2 differences, the length = 8, so the distance is 2/8 This is called the p distance.
Distance estimation There are better and more accurate methods to compute the distance D(i,j) between any two sequences i and j. For example, one can take into account different probabilities between transitions and transversions…
From a distance table to a tree Each tree has branch lengths from which “predicted” set of distances can be computed: d(i,j) (small d, denotes the distance of the branches, unlike the observed pairwise distances D). Human d(Human,Chimp) = 0.55 d(Human,Gorilla) = 0.71 d(Chimp, Gorilla) = 0.66 0.3 0.41 Gorilla 0.25 Chimp
From a distance table to a tree The question is can we find branch lengths, so that the d’s are equal to the D’s? Human D(Human,Chimp) = 0.3 D(Human,Gorilla) = 0.4 D(Chimp, Gorilla) = 0.5 X Y Gorilla Z Chimp
From a distance table to a tree Human D(Human, Chimp) = 0.3 D(Human, Gorilla) = 0.4 D(Chimp, Gorilla) = 0.5 X Y Gorilla Z Chimp d(Human, Chimp) = X+Z d(Human, Gorilla) = X+Y d(Chimp, Gorilla) = Y+Z X+Z = 0.3 X+Y = 0.4 Y+Z = 0.5 Y = 0.3 Z = 0.2 X = 0.1 YES Y-Z = 0.1 Y+Z = 0.5
Is there always a solution? Human D(Human, Chimp) = D1 D(Human, Gorilla) = D2 D(Chimp, Gorilla) = D3 X Y Gorilla Z Chimp d(Human, Chimp) = X+Z d(Human, Gorilla) = X+Y d(Chimp, Gorilla) = Y+Z X+Z = D1 X+Y = D2 Y+Z = D3 We get 3 equations with 3 variables: there’s always a solution!
Ex. Human D(Human, Chimp) = D1 D(Human, Gorilla) = D2 D(Chimp, Gorilla) = D3 X Y Gorilla Z Chimp d(Human, Chimp) = X+Z d(Human, Gorilla) = X+Y d(Chimp, Gorilla) = Y+Z X+Z = D1 X+Y = D2 Y+Z = D3 Show that for a 3 taxa tree, there’s always a solution and it is given by: Z=0.5(D1-D2+D3), Y=0.5(D2+D3-D1) X=0.5(D1+D2-D3)
Is there always a solution?? D A 5 Variables, 6 Equations, It might be that there’s no solution X Y W Z V B C D(A, B) = 2 D(A, D) = 3 D(A, C) = 3 D(B, C) = 3 D(B, D) = 3 D(C, D) = 4 An example of a case where there’s no solution (v=w=x=y=z=1 solves the first 5 equations)
Is there always a solution?? In real life, for n>3 sequences, there is never a solution. One might try to find the “best” solution.
Is there always a solution?? The simplest case where it might be that equations have no solution: two equations with 1 parameter a = 2 a = 3 We want to find the “best” solution which solves these equations
Is there always a solution?? Putting it another way: a – 2 = 0 a – 3 = 0 Let’s assign parameters instead of 0 a – 2 = e1 a – 3 = e2 Ideally, we want e1, and e2 to be as small as possible (e1=e2=0 could be the best).
The least square solution a – 2 = e1 a – 3 = e2 We want the distance of the point (e1,e2) from (0,0) to be the smallest. I.e., we want to find “a” that satisfies: Sqrt(e12+e22) is minimum.
The least square solution The term: sqrt(e12+e22) reaches its minimum when the term: e12+e22 reaches its minimum. So for: a – 2 = e1 a – 3 = e2 we want to minimize: [(a-2)2+(a-3)2]
The least square solution Min [(a-2)2+(a-3)2]= Min[2a2-10a+13]= Min[2a2-10a]= Min[a2-5a]. a2-5ais a parabola that crosses the X axis at a=0, and a=5, and its minimum is at a=2.5
Is there always a solution??? So for the simplest of two equations with 1 parameter a = 2 a = 3 The “best” solution is a = 2.5 which makes sense.
Back to phylogeny We have the D’s (“observed distances”), and we want to find the d’s (branches) that minimize the expression
Back to phylogeny For each tree topology we get a different Q. The least square (LS) method searches for the tree with the lowest Q.
Back to phylogeny The general formula for LS The w’s are weights that differ between different least square methods.
Back to phylogeny w’s used Cavalli-Sforza and Edwards (1967) Fitch Margoliash (1967) Beyer et al (1974)
Tree search There are the general heuristic searches. No branch-and-bound method published so far. Problem was shown to be NP-complete.
Minimum Evolution The general formula for LS Minimum Evolution (ME) for a given topology, it estimates the branch lengths using LS. But unlike LS, it chooses the topology that results in minimal sum of branches.
The Newick tree format is used to represent trees as strings A B C In Newick format: (A,B,C)
The Newick tree format is used to represent trees as strings B A D C In Newick format: (A,C,(B,D)). Each pair of parenthesis () enclose a monophyletic group, and the comma separates the members of the corresponding group.
Neighbor-joining is based on Star decomposition Red: best pair to group together B E A (C,B) C A D D E A ((C,B),E) D
Neighbor-joining The Neighbour Joining method is used for re-constructing phylogenetic trees. Both the tree topology and branch lengths are estimated. In each stage, the two nearest nodes of the tree (the term "nearest nodes" will be defined in the following paragraphs) are chosen and defined as neighbours in our tree. This is done recursively until all of the nodes are paired together.
Neighbor-joining The algorithm was originally written by Saitou and Nei, 1987. In 1988 a correction for the paper was published by Studier & Keppler. The correction was related to the main theorem in the algorithm. Studier and Keppler also suggested a slight change to the algorithm which brought the efficiency down to O(n3).We will first of all describe the original algorithm, and then elaborate on the changes made by Studier & Kepler.
OTU’s and HTU’s Reminder: OTU’s = operational taxonomic units, or in other words – leaves of the tree. HTU’s = hypothetical taxonomic units, or in other words – the internal nodes of the tree.
C A D B Neighbors, we are … What are neighbours?Neighbours are defined as a pair of OTU's who have one internal node connecting them. A and B are neighbours, C and D are neighbours, But… A and C are not neighbours…
Additive trees In an additive tree, the distance matrix exactly reflects the tree: C A Y X D B
Additive trees The NJ theorem: the NJ algorithm recovers the true tree, if the tree is additive.
NJ is an approximation of the Minimum evolution In the original article, Saitou and Nei defined the two nearest nodes as the pair of nodes that give the minimal sum of branches when placed in a tree.
NJ notations: • First of all – some notations: • D(i,j) is defined as the distance between leaves i and j (the observed distance which we have as an input from our distance matrix). • L(x,y) is defined as the sum of branch lengths between node X and node Y. L is used as a notation for distances between internal nodes, or an internal node to a leaf.
L(x,y) notation: We distinguish between L(X,Y) and D(A,B). D’s are given as input to the algorithm, L’s should be inferred… C A Y X D B
NJ step: • In each round we join as neighbours all possible pairs of leaves and evaluatethe sum of branches for each resultant tree. This means we compare the sum of branches when 1 and 2 are joined as neighbours, denoted as S(1,2), to the sum of branches when 1 and 3 are joined as neighbours, S(1,3), and so on. We look for the i and j pair for whichS(i,j) is minimal, where i and j denote numbers of leaves, and i<j. • This is why NJ is approximating ME (minimum evolution).
Computing S(1,2) How can we evaluate S(1,2) from the input (the distance matrix)? 3 1 X Y 4 5 2
Computing S(1,2) S(1,2) = L(1,X)+L(2,X)+L(X,Y)+L(Y,3)+L(Y,4)+L(Y,5) 3 1 X Y 4 5 2 The problem is that we don’t know the L’s. We only know the D’s…
3 1 X Y 4 5 2 Computing S(1,2) S(1,2) = L(1,X)+L(2,X)+L(X,Y)+L(Y,3)+L(Y,4)+L(Y,5) S(1,2) = D(1,2)+L(X,Y)+L(Y,3)+L(Y,4)+L(Y,5) Since our tree is additive, we can replace L(1,X)+L(2,X), with D(1,2).
3 1 X Y 4 5 2 Computing L(X,Y) in terms of the D’s L(1,X) is counted here N-2 times Here, -L(1,X) is counted N-2 times So L(1,X) is canceled out… N denotes the number of leaves
3 1 X Y 4 5 2 Computing L(X,Y) in terms of the D’s L(3,Y) is counted once here Once here Here, -L(3,Y) is counted 2 times So L(3,Y) is canceled out…
3 1 X Y 4 5 2 Computing L(X,Y) in terms of the D’s L(X,Y) is counted N-2 times here N-2 here So L(X,Y) is counted altogether 2(N-2) times. Dividing by 2(N-2) we get L(X,Y)
3 1 X Y 4 5 2 Computing L(X,Y) in terms of the D’s We still have to replace this term by the D’s
3 1 X Y 4 5 2 Computing L(X,Y) in terms of the Ds L(3,Y) is counted here N-3 times: once in D(3,4), once in D(3,5), till D(3,N).
3 1 X Y 4 5 2 Computing L(X,Y) in terms of the D’s