1 / 65

Understanding Phylogenetic Tree Construction Methods

Learn the fundamentals of building phylogenetic trees, including topology, rooted vs. unrooted trees, counting trees, UPGMA, Neighbor Joining, and algorithm complexities. Discover the applications, heuristics, and proofs behind tree construction methods.

mariannej
Download Presentation

Understanding Phylogenetic Tree Construction Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Phylogenetic Trees Yaw-Ling Lin (林耀鈴) Dept Computer Sci and Info Management Providence University, Taiwan E-mail: yllin@pu.edu.tw WWW: http://www.cs.pu.edu.tw/~yawlin

  2. branch internal node leaf Phylogenetic Tree • Topology: bifurcating • Leaves - 1…N • Internal nodes N+1…2N-2

  3. Orthologues / Paralogues

  4. Rooted / Unrooted Tree

  5. Counting Trees

  6. Counting Trees A B A C C D B C D A E B C A D E B F (2N - 5)!! = # unrooted trees for N taxa (2N- 3)!! = # rooted trees for N taxa

  7. B C Root D A A C B D Rooted tree Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Root Rrooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Unrooted tree

  8. UPGMA -- Unweighted Pair Group Method with Arithmetic mean A A B B dAB / 2 C d(AB)C / 2 A B B dAB C dAC dBC simplest method - uses sequential clustering algorithm (assumption of rate constancy among lineages - often violated) step 1 step 2 (AB) C d(AB)C Distance matrix Tree d(AB)C = (dAC + dAB) / 2

  9. d a e b c UPGMA Step 1combine B and C

  10. d a 2 2 e b c UPGMA step 2combine BC and D (10+12)/2 (4+6)/2

  11. 2 .5 .5 d a 2 2 e b c UPGMA step 3combine A and E

  12. 2 .5 3.5 3.5 .5 d a 2 2 e b c UPGMA step 4combine AE and BCD

  13. 2 .5 3.5 2.5 3.5 3.5 .5 d a 2 2 e b c UPGMA Result

  14. 2 .5 3.5 2.5 3.5 3.5 .5 d a 2 2 e b c UPGMA Result

  15. UPGMA(1)

  16. UPGMA(2)

  17. UPGMA -- Ilustrations

  18. When UPGMA fails …

  19. Neighbor Joining • Very popular method • Does not make molecular clock assumption : modified distance matrix constructed to adjust for differences in evolution rate of each taxon • Produces unrooted tree • Assumes additivity: distance between pairs of leaves = sum of lengths of edges connecting them • Like UPGMA, constructs tree by sequentially joining subtrees

  20. Additivity

  21. Naïve NJ by Additivity? • O(n2) (i,j) pairs • O(n2) (k,l) pairs • (k,l) “rejects” (i,j) whenever additivity fails • O(n4) to pick an (i,j) neighbor pair! • So totally O(n5) time suffices

  22. Neighbor Joining: Once we know the correct (i,j) pair

  23. Neighbour Joining: why not pick the smallest (i,j) pair?

  24. j i j i k m Neighbor-Joining: Algorithm

  25. Neighbor-Joining: Complexity • The method performs a search using time O(n2) and using time O(n2) to update distance matrix. • Giving a total time complexity of O(n3),and a space complexity of O(n2).

  26. Reasoning the NJ Method • How did the ideas of Si,j and Ri comes from ? • How correct is the algorithm? • Heuristic or exact solution?

  27. The “1-star” Sum of the Branch Lengths • D and L as the distance between OTUs and the branch length between nodes • Each branch is counted N-1 times when all distances are added

  28. The “paired-2-star” Sum of the Branch Lengths

  29. The “paired-2-star” Tree Size

  30. Before the proof

  31. Before the proof (Cont.)

  32. Neighbor-Joining: The proof

  33. 3 1 2 4 Lemma

  34. 3 1 2 4 Proof

  35. Proof of the Theorem: by contradiction r k i s Type1: A = -2Dux-2Duv Type2: B =-4Dvx+2Duv For the sum in formula b to be nonnegative, Type2 should be more than Type1. w B x x u v x A j l Suppose that i and j are not neighbors. Let k and l be any pair of neighbors, so that i, j, k, and l are distinct and are represented in the tree .Consider the sum in formula (b), which is nonnegative. If m is fifth OUT, then it joins the tree at point x along one of the indicated arcs. Say that m is of type 1 if it joins the path from I to j at any node different from u and that m is of type 2 if it joins the path from i to j at node u.

  36. Proof of the theorem (Cont.) If m is of type 1,then the corresponding summand in formula (b) is -2Dux-2Duv. If m is of type 2, then the corresponding summand in formula (b) is -4Dvx+2Duv. For the sum in formula (b) to be nonnegative, there must be at least as many terms corresponding to OTUs m of type 2 as there are terms corresponding top OTUs m of type 1. It follows that there are more OTUs that join the path from i to j at u than there are OTUs that join that path at all other nodes combined. Because neither i nor j has a neighbor, there must be a pair r,s of neighbors that argument applied to w that is different from u, By the above argument applied to w, there are more OTUs that join the path from i to j at w than there are OTUs that join that path at all other nodes combined. The conclusions about u and w contradict each other, and the theorem follows.

  37. Speeding up Neighbor-Joining Tree Construction • In this paper, the authors present several heuristics for speeding up the NJ method. • The heuristics attempt to reduce the search time by using a quad-tree. • The worst case time complexity remains O(n3) and the space complexity after adding the quad-tree is still O(n2). • The authors have implemented a tool, QuickJoin.

  38. Previous Work • The neighbor-joining method is introduced by Saitou and Nei. • The algorithm was later amended by Studier and Keppler with a running time O(n3). • BIONJ -- Gascuel et al. produce a O(n3) implementation of a variant of the NJ algorithm that produce more accurate trees in many cases. • QuickTree -- Durbin et al. produce an code optimized implementation of the NJ algorithm.

  39. +/- of distance methods • Advantages: • easy to perform • quick calculation • fit for sequences having high similarity scores • Disadvantages: • the sequences are not considered as such (loss of information) • all sites are generally equally treated (do not take into account differences of substitution rates ) • not applicable to distantly divergent sequences.

  40. Parsimony

  41. Maximum Parsimony Method Site OTU 1 2 3 4 5 6 7 8 9 10 1 2 3 4 T C A G A T C T A G T T A G A A C T A G T T C G A T C G A G T T C T A A G G A C principle - search for tree that requires the smallest number of character state changes between the OTUs informative sites - those that favor some trees over others operationally - at least two different kinds of residues at the site, each of which is found in at least two of the OUT sequences

  42. Evaluating Parsimony Scores • How do we compute the Parsimony score for a given tree? • Traditional Parsimony • Each base change has a cost of 1 • Weighted Parsimony • Each change is weighted by the score c(a,b)

  43. a g a Traditional Parsimony a {a} • Solved independently for each position • Linear time solution a {a,g}

  44. Traditional Parsimony

  45. k j i Evaluating Weighted Parsimony Dynamic programming on the tree Initialization: • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration: • if k is node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b)) Termination: • cost of tree is minaS(r,a) where r is the root

  46. Cost of Evaluating Parsimony • Score is evaluated on each position independetly. Scores are then summed over all positions. • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

  47. Weighted Parsimony

  48. Traditional Parsimony is not “complete”

  49. Parsimony: Searching over all trees by Branch and Bound

  50. Inferring trees – Maximum Likelihood method • Maximum likelihood supposes a model of evolution along tree branches. • Strategy: Find parameters (tree, branch lengths, substitution rate) that maximizes the likelihood assigned to the data. • Note: Model of evolution does not include indels! • In Phylip package: program PROTML

More Related