1 / 48

Algorithms in Computational Biology

Algorithms in Computational Biology. Building Phylogenetic Trees. Phylogeny. All organisms on Earth had a common ancestor Evidence from morphological, biochemical, and gene sequence data Phylogeny This history of organismal lineages as they change through time Phylogenetic tree

gary
Download Presentation

Algorithms in Computational Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms in Computational Biology Building Phylogenetic Trees Department of Mathematics & Computer Science

  2. Phylogeny • All organisms on Earth had a common ancestor • Evidence from morphological, biochemical, and gene sequence data • Phylogeny • This history of organismal lineages as they change through time • Phylogenetic tree • A tree showing the evolutionary relationships among various biological species • All living organisms today, from smallest microbe to the largest plants and animals, are connected by the passage of genes along the branches of the phylogenetic tree Department of Mathematics & Computer Science

  3. Phylogenetic Tree of Life Department of Mathematics & Computer Science

  4. Inferring Phylogenies • Traditionally • Use morphological characters (both from living and fossilized organisms) • 1962 • Zuckerkandl & Pauling showed that molecular sequences can be used to infer phylogenies • Assumes current sequences descended from some common ancestral gene in a common ancestral species Department of Mathematics & Computer Science

  5. Major Tree Building Algorithms • Distance based • Parsimony • Maximum likelihood Department of Mathematics & Computer Science

  6. Orthologue vs Paralogue • Both of them are homologous genes (homologues) • Orthologues are a set of genes diverged from a common ancestor through gene speciation • Homologous genes from different species • Paralogues are a set of genes diverged from a common ancestor through gene duplication • Homologous genes from the same species Department of Mathematics & Computer Science

  7. A Tree of Orthologues A tree of orthologues based on a set of alpha hemoglobins Department of Mathematics & Computer Science

  8. A Tree of Paralogues Department of Mathematics & Computer Science

  9. Background on Trees • Nodes and Edges • Nodes: unobserved ancestor • Edge length • On average, corresponds to evolutionary time period • Variations • Different proteins can change at different rates • Same sequence evolve much faster in some organism than others • Root of a phylogenetic tree • Ultimate ancestor of all species • Some algorithms provides the location of the root, while other don’t Department of Mathematics & Computer Science

  10. Counting and Labeling Trees • Counting: • For a rooted tree with n leaves • As we move up the tree, the edges coalesce as each new node is reached • In addition to n leaves, there are n-1 nodes (internal nodes plus root node). • A total of 2n-1 nodes • There will be 2n-2 edges (discounting the edge above the root node) • For an unrooted tree with n leaves • Total number of nodes = 2n – 2 • Total number of edges = 2n – 3 • Labeling (for rooted tree) • Label the leaves using 1 to n • Label the branch nodes using n+1 to 2n-2 • Label the root using 2n-1 Department of Mathematics & Computer Science

  11. 1 1 1 3 3 3 2 2 2 Rooting an Unrooted Tree 1 3 2 2 3 1 3 2 1 Department of Mathematics & Computer Science

  12. How Many Possible Topologies? (2n-5)!! # of rooted trees: (2n-3)!! Department of Mathematics & Computer Science

  13. Making a Tree from Pairwise Distances • Distance Measure • First find f which is the fraction of differences between two sequences presupposing an alignment of the two sequences • Fraction of difference expected by chance (by random substitution) is about 3/4 • Jukes-Cantor distance (odds ratio) • Clustering methods • UPGMA • Neighbor-joining Department of Mathematics & Computer Science

  14. Unweighted Pair Group Method Using Arithmetic Average (UPGMA)[Sokal & Michener, 1958] Overview 1. Cluster the sequences 2. Amalgamate two clusters at each stage, create a new node on a tree 3. Assemble the tree upwards, each node being added above the others 4. The edge length determined by the difference in the heights of the nodes at the top and bottom of an edge Department of Mathematics & Computer Science

  15. Distance Measure Used in UPGMA Distance b/w two clusters Ci and Cj is the average distance between pairs of sequences from each other Distance b/w two clusters Ck and Cl, if Ck is the union of two clusters Ci and Cj Department of Mathematics & Computer Science

  16. Algorithm UPGAM Initialization Assign each sequence i to its own cluster Ci Define one leaf of T for each sequence, and place at height zero Iteration Determine the two clusters i, j for which dij is minimal (if there are ties, pick one randomly) Define a new cluster k by Ck = Ci  Cj, and define dkl for all l using arithmetic average Define a node k with daughter nodes i and j, and place it at height dij/2. Add k to the current clusters and remove i and j Termination When only two clusters i, j remain, place the root at height dij/2 Department of Mathematics & Computer Science

  17. An Example Department of Mathematics & Computer Science

  18. Cont’ Department of Mathematics & Computer Science

  19. Molecular Clock Assumption in UPGMA • UPGMA produces a rooted tree • Edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate • The sum of times down a path to the leaves from any node is the same, whatever the path • The distances dij are said to be ultrametric, if for any triplet of sequences, xi, xj, xk, the distances dij, djk, dik are either all equal, or two are equal and the remaining one is smaller • True for a tree with a molecular clock • Implied additivity • The edge lengths are said to be additive if the distance b/w any pair of the leaves is the sum of the lengths of the edges on the path connecting them Department of Mathematics & Computer Science

  20. Molecular Clocks • Mutations may build up in any given stretch of DNA at a reliable rate • If the rate of mutation of a gene is reliable, this gene can be used as a molecular clock • This gene can be a powerful tool for estimating the dates of lineage-splitting events. Department of Mathematics & Computer Science

  21. ExampleThe entire length of DNA of a genes changes at a rate of approximately one base per 25 million years Department of Mathematics & Computer Science

  22. What If Molecular Clock Property Fails? A tree that is reconstructed incorrectly by UPGMA (right) 2 3 4 1 2 4 3 1 Department of Mathematics & Computer Science

  23. Additivity • Given a tree, its edge length is additive • If the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them • Build-in assumption in UPGMA Department of Mathematics & Computer Science

  24. Test for Additivity • For every set of four leaves, 1, 2, 3 and 4, two of the three distances d12 + d34 , d13 + d24 and d14 + d23 must be equal and larger than the 3rd. 3 1 4 2 Department of Mathematics & Computer Science

  25. m i k j Joining a Pair of Neighboring Leaves Node k joins leaf nodes i and j Dim = dik + dkm Djm = djk + dkm Dij = dik+ djk Dkm = 0.5(dim + djm – dij) Department of Mathematics & Computer Science

  26. Closest Pairs of Leaves Are not Necessarily Neighboring Leaves d Table 2 1 0.1 0.1 0.1 0.4 0.4 3 4 Department of Mathematics & Computer Science

  27. Compensation for Long Edges D Table r1 = 0.7 r2 = 0.7 r3 = 1 r4 = 1 Department of Mathematics & Computer Science

  28. Algorithm: Neighbor-Joining Initialization: Define T to be the set of leaf nodes, one for each given sequence, and put L = T. Iteration: Pick a pair i, j in L for which Dij is minimal Define a new node k and set dkm = 0.5(dim + djm – dij), for all m in L. Add k to T with edges of lengths dik = 0.5(dij+ri-rj), djk = dij – dik, joining k to i and j, respectively. Remove i and j from L and add k. Termination When L consists of two leaves i and j add the remaining edge between i and j, with length dij Produces an unrooted tree Department of Mathematics & Computer Science

  29. Rooting Trees • Outgroup • Species known to be more distantly related to each of the remaining species than they are to each other • Find the root by adding an outgroup • The point in the tree where the edge to the outgroup joins is expected to be the best root candidate • In the absence of a convenient outgroup, methods are quite ad hoc • E.g. picking the midpoint of the longest chain of consecutive edges if deviation from a molecular clock were not too great. Department of Mathematics & Computer Science

  30. Assumptions Used by UPGMA and Neighbor-Join • UPGMA (molecular clock with implied additivity) • The edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate • The divergence of sequences is assumed to occur at the same constant rate at all points in the tree • The distance from an internal node to a leaf node will always be the same no matter what path is taken • Neighbor-Join • It is possible for the molecular clock property to fail but for additivity to hold • Assume additivity only Department of Mathematics & Computer Science

  31. Parsimony • Most widely used tree building algorithm • It works by finding the tree which can explain the observed sequences with a minimum # of substitutions • Two components to the algorithm • The computation of a cost for a given tree T • A search through all trees, to find the overall minimum of this cost Department of Mathematics & Computer Science

  32. Notations Used in Weighted Parsimony • Sk(a) denotes the minimal cost for the assignment of a to node k • S(a, b): cost for each substitution of a by b Department of Mathematics & Computer Science

  33. Algorithm: Weighted ParsimonyCompute the minimum cost at site u[Sankoff & Cedergren 1983] Initialization: Set k = 2n – 1, the number of the root node Recursion: Compute Sk(a) for all a as follows: If k is a leaf node: Set Sk(a) = 0 for a = xuk, Sk(a) = , otherwise If k is not leaf node: Compute Si(b), Sj(b) for all b at the daughter nodes i, j and define Sk(a) = minb(Si(b) + S(a, b)) + minb(Sj(b) + S(a, b)). Termination: Minimal cost of tree = minaS2n-1(a) Weighted parsimony reduces to traditional parsimony if S(a, a) = 0 for all a, S(a, b) = 1 for all a  b Department of Mathematics & Computer Science

  34. Algorithm: Traditional Parsimony [Fitch 1971] Initialization Set C = 0 and k = 2n -1 Recursion: to obtain the set Rk If k is leaf node: Set Rk = xuk If k is not a leaf node: Compute Ri, Rj for the daughter nodes i, j of k, and set Rk = Ri Rj if this intersection is not empty, or else Rk = Ri  Rj and increment C Termination: Minimal cost of the tree = C Department of Mathematics & Computer Science

  35. Parsimony Example A B X {A, B} A X A A A A X X {A, B} B B B B B B A A A A A A Minimum cost = 2 Obtained by traditional parsimony Department of Mathematics & Computer Science

  36. Cont’ B B B B B A A Minimum cost tree: not obtained by traditional parsimony Department of Mathematics & Computer Science

  37. Enumeration of Unrooted Trees • Enumerate all unrooted trees by an array [i3] [i5] [i7] [i9]… [i2n-5] • Take the unrooted tree with 3 sequences x1, x2 and x3 and add an edge for x4 on the edge labeled by i3, since the new edge divides the preexisting edge in two, the total number of edges is now 3 + 2 = 5. The value of i5 determines which of these x5 is added to. • Think of [i3] [i5] [i7] [i9]… [i2n-5] as an odometer … Department of Mathematics & Computer Science

  38. Counting TreesCont’ • Counting complete trees • The rightmost numbers advance till they reach 2n-5 • The next-to-rightmost array index clicks forward by 1 when the rightmost array index go back to 1 • The second-to-rightmost index clicks forward by 1 when the next-to-rightmost index reaches 2n-7 • And so on and so forth … • Counting both complete and incomplete trees • Add 0 to each array index, meaning that there is no edge of the order specified by the counter Department of Mathematics & Computer Science

  39. Selecting Labeled Branching Patterns by Branch and Bound • Starts from the odometer setting [1][0][0]…[0] • Let the smallest cost so far for a complete tree be C • Brand and bound • Adding more leaves can only increase cost • No point branching out if current cost is larger than the minimum cost so far • Implementation trick • Whenever the cost of our current subtree T is more than C, we know that T is not part of the optimal tree • If all the counters to the right of a given non-zero counter are 0, instead of advancing them all to ‘1’ we can click the rightmost non-zero counter one forward Department of Mathematics & Computer Science

  40. 7 0 0 0 0 3 7 1 1 1 1 3 8 0 0 0 0 3 An Example of Branch-and-Bound …… …… Skip 3…70001 to 3…7(2n-11)(2n-9)(2n-7)(2n-5) and go directly to 3…80000 if the cost of 3…70000 is higher the the minimum cost found so far Department of Mathematics & Computer Science

  41. Assessing the Trees: the Bootstrap • Bootstrapping (sample with replacement) • Given a dataset consisting an alignment of sequences, generates an artificial dataset by picking columns from the alignment at random with replacement • Generate large number (order of thousands) of artificial alignment datasets • For each artificially generated data set, build a tree • Assessing phylogenetic features • Find the frequency of each phylogenetic feature that appears in the thousands trees generated above • The higher the frequency, the more confident we have with a phylogenetic feature Department of Mathematics & Computer Science

  42. Describe a New Hampshire Standard Tree Tree file representation of the above rooted tree, starting at the beginning of the file: (B,(A,C,E),D); (B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0); Department of Mathematics & Computer Science

  43. Visualize TreesPhylip DrawTree Department of Mathematics & Computer Science

  44. Visualize TreesCladogram Department of Mathematics & Computer Science

  45. Visualize TreesPhenogram Department of Mathematics & Computer Science

  46. Visualize TreesCurve-O-Gram Department of Mathematics & Computer Science

  47. Visualize TreesEurogram Department of Mathematics & Computer Science

  48. Programs to Build Phylogenetic Trees • PAUP • Include parsimony, maximum likelihood, and distance methods • Phylip • Include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. • MrBayes • Bayesian estimation of phylogeny • Uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees • NoTung • Incorporating duplication/loss parsimony into phylogenetic tasks • …… Department of Mathematics & Computer Science

More Related