1 / 54

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures. Chapter 17.4-6: Strings and Evolutionary Trees Lecturer: Dr. Rose Slides by: Dr. Rose April 10, 2007. Ultrametric Problem Centrality. Four related tree problems: Ultrametric Additive Binary perfect phylogeny Tree compatibility

Download Presentation

Bioinformatics Algorithms and Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Algorithms and Data Structures Chapter 17.4-6: Strings and Evolutionary Trees Lecturer: Dr. Rose Slides by: Dr. Rose April 10, 2007

  2. Ultrametric Problem Centrality Four related tree problems: • Ultrametric • Additive • Binary perfect phylogeny • Tree compatibility • All can be solved as ultrametric tree problems. • Recall tree compatibility reduces to perfect phylogeny. • Now we reduce additive tree & (binary) perfect phylogeny problems to the ultrametric tree problem.

  3. Ultrametric Problem: Additive Trees • Goal: reduce additive tree problem to ultrametric problem • Complexity: O(n2) reduction • Approach: create a matrix D that is ultrametric D is additive. • We will start by describing a reduction that involves a tree T for D and T for D. • We will then describe a direct reduction of D to D.

  4. Ultrametric Problem: Additive Trees • Assume that D is additive. • Assume that we know of an additive tree T for D • Assume that each of the n taxa in D labels a leaf of T. • Idea: label the nodes of T to create an ultrametric tree T. • Q: How can we do this?

  5. Ultrametric Problem: Additive Trees A: we will do the following: • Select one node as the root • Stretch the leaf edges so that they are equidistant from the root. • Let v be the row of D containing the largest entry. • Let mv denote the value of this entry. • Select node v as the root of T. • This creates a directed tree.

  6. Ultrametric Problem: Additive Trees Example: • A is the row of D containing the largest entry. • Select node A as the root of T.

  7. Ultrametric Problem: Additive Trees Stretch leaf edges: • for each leaf i, add mA – D(A, i) to the leaf edge. • Leaf edges are now equidistant from A.

  8. Ultrametric Problem: Additive Trees The resulting tree T is: • a rooted edge-weighted tree • distance mv from root to every leaf • each internal node is equidistant to leaves in its subtree.

  9. Ultrametric Problem: Additive Trees Since each internal node is equidistant to the leaves in its subtree: • Label each internal node by this unique distance. • These labels can be used to define an ultrametric matrix D. • D(i, j) is the label at the least common ancestor of leaves i and j in T. Q: How can we go directly from matrix D to matrix D without involving T and T?

  10. Ultrametric Problem: Additive Trees Consider leaves i & j in T: • Let node w be their least common ancestor • Let x be the distance from the root v to w. • Let y be the distance from node w to leaf i.

  11. Ultrametric Problem: Additive Trees Q: What is the distance from w to iin T? A: y + mv - D(v, i) in T. Q: Where does mv - D(v, i) come from? A: Recall we add mv - D(v, i) to stretch the leaf edges.

  12. Ultrametric Problem: Additive Trees Gusfield presents the following lemma: Without knowing T or T´ explicitly, we can deduce that D´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2 Q: Is this equation correct? D´(i, j) = mv + ((y + z) - (x + y) - (x + z))/2 ? D´(i, j) = mv + -2x/2 ? Should it instead be: D´(i, j) = 2mv + D(i, j) - D(v, i) - D(v, j)? i.e., D´(i, j) = 2mv - 2x? Probably, but it is not necessary for the reduction (slide 9)

  13. Ultrametric Problem: Additive Trees This brings us to the following Theorem: If D is an additive matrix, then D´ is ultrametric, whereD´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2 Proof. We’ve shown that: D´(i, j) = y + mv - D(v, i) y = D(v, i) – x x = (D(v, i) + D(v, j) - D(i, j))/2 Putting it altogether establishes the equation in the theorem. D´ satisfies the ultrametric requirement.

  14. Ultrametric Problem: Additive Trees Q: What is the value of y? A: y = D(v, i) - x. Q: What is the value of x in terms of values in D? A: x = (D(v, i) + D(v, j) - D(i, j))/2

  15. Ultrametric Problem: Additive Trees So: D additive  D´ ultrametric By contraposition: D´ ultrametric  D additive Q:does D´ ultrametric  D additive? A: Theorem: D´ ultrametric  D additive Proof. (constructive) • Let T´´ be the ultrametric tree for D´ • Assign weights to edges of T´´ • Note: the sum of edges from a leaf to an ancestor must match the ancestor’s label. • For each edge (p, q), assign the weight |p-q|

  16. Ultrametric Problem: Additive Trees • Assign weights to edges of T´´ continued • Note the path distance between leaves (i, j) is twice the value labeling the least common ancestor • Hence, 2D´(i, j) = 2mv + D(i, j) - D(v, i) - D(v, j) • Now shrink the edge into each leaf i by mv - D(v, i) • The path from leaf i to leaf j is now D(i, j) The result is an additive tree for matrix D from D´’s ultrametric tree. Putting all of this together results in a method for contructing and additive tree for an additive matrix.

  17. Ultrametric Problem: Additive Trees Additive Tree Algorithm • Create matrix D´ from D. • Create ultrametric tree T´´ from D´ • Create T from T´´ • Label edge (p, q) with the value |p-q| • For each leaf i, shrink the leaf edge by mv - D(v, i) Note: no step takes more than O(n2) time. Thm. An additive tree for an additive matrix can be constructed in O(n2) time.

  18. Ultrametric Problem: Additive Trees Example: Given D, first find D´ Recall: D´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2

  19. Ultrametric Problem: Additive Trees Example: From D´ find T´´ Recall: label edge inner edges (p, q) by |p-q|

  20. Ultrametric Problem: Additive Trees Example: From T´´ find T Recall: shrink leaf edge i by mv - D(v, i)

  21. Ultrametric Problem: Additive Trees Example: Finally compare the derived T with the original tree as a sanity check.

  22. Ultrametric Problem: Perfect Phylogeny We now recast perfect phylogeny in terms of an ultrametric tree problem. Defn.DM – the n by n matrix of shared characters More formally: Given the n by m character matrix M, define the n by n matrix DM: for each pair of objects, set DM(p, q) to be the number of characters that p and q both possess.

  23. Ultrametric Problem: Perfect Phylogeny Lemma: If M has a perfect phylogeny, then DM is a min-ultrametric matrix. Proof: convert M’s perfect phylogeny T to a min-ultrametric tree for DM • Let T be the perfect phylogeny for M. • Label T’s root be zero. • Traverse T from top to bottom, for each node v: • Let pv be the number labeling node v’s parent. • Let ev be the # of characters labeling the edge into v. • Label node v with the sum pv + ev

  24. Ultrametric Problem: Perfect Phylogeny • The label of node v is the number of characters common to all leaves in the subtree rooted at v. • if v is the immediate parent of leaves p and q, then the label of v is DM(p, q) • The numbers labeling nodes on any path from the root are strictly increasing. • The result is an ultrametric tree for matrix DM.

  25. Ultrametric Problem: Perfect Phylogeny Algorithm: perfect phylogeny via ultrametrics: • Create matrix DM from M. • Attempt to create a min-ultrametric tree T´ from DM. If not possible, then M has no perfect phylogeny. • If T´ was successfully created in step 2: • Attempt to label its edges with the m characters of M. • If not possible, then M has no perfect phylogeny. • O/w the modified T´ is the perfect phylogeny T. Note: T´ may be min-ultrametric but M may not have a perfect phylogeny, hence the check in step 3

  26. Ultrametric Problem: Perfect Phylogeny Final notes on the centrality ultrametric problem. We can see that the following problems: • perfect phylogeny • tree compatibility can be cast as ultrametric problems. This is not an efficient way to address these problems.

  27. Maximum Parsimony Maximum parsimony: • Perfect phylogeny is a special instance • Can be viewed as a Steiner tree problem on a hypercube Presentation Approach: • Introduce Steiner trees • Hypercube graphs • Maximum parsimony as a Steiner tree problem

  28. Maximum Parsimony Definitions: Let N be a set of nodes Let E be a set undirected edges with non-negative weight Let G = (N, E) be an undirected graph Let X  N be a subset of nodes. A Steiner tree ST for X is any connected subtree of G that contains all nodes of X and possibly nodes in N-X. Weighted Steiner Tree Problem: Given G and X, find the Steiner tree of minimum total weight.

  29. Maximum Parsimony More Definitions: A hypercube of dimension d is an undirected graph with 2d nodes, labeled 0..2d-1. Adjacent nodes differ in only one label bit position. The weighted Steiner tree problem on hypercubes: G must be a hypercube.

  30. Maximum Parsimony More Definitions: Maximum Parsimony: Occam’s razor applied to phylogenetic reconstruction. A preference for trees requiring fewer evolutionary events to explain data. Gusfield’s definition: The Maximum Parsimony problem is the unweighted Steiner tree problem on a d-dimensional hypercube.

  31. Maximum Parsimony More about the hypercube formulation of MP: • The X input taxa are described as d-length binary vectors. • Recall: adjacent nodes differ in only one label bit position. • Correspondingly, taxa that differ by a single mutation will be adjacent.  Steiner tree of X nodes and l edges iff  a corresponding phylogenetic tree that entails l character-state mutations.

  32. Steiner interpretation of Perfect Phylogeny Define a nontrivial binary character to be a character contained by some taxa but not all. Consider an MP dataset of d nontrivial binary characters Q: what is the minimal number of mutations in the MP tree? A: at least d.

  33. Steiner interpretation of Perfect Phylogeny Q: What is the relation to binary perfect phylogeny? A: the binary perfect phylogeny problem is equivalent to asking if there is an MP solution with a cost of exactly d. Q: What about generalized perfect phylogeny? A: It’s similar. The lower bound must reflect: • the number of character states in the input taxa. • a character having r states in the input taxa is allowed only r-1 transitions.

  34. Steiner interpretation of Perfect Phylogeny Complexity: • No known efficient solution for Steiner tree problem on unweighted graphs. • Polynomial time solution for generalized perfect phylogeny problem when r is fixed.  this particular Steiner tree problem can be answer in polynomial time.

  35. Steiner interpretation of Perfect Phylogeny MP approximations: • The weighted Steiner tree problem on hypercubes is NP-hard. • There is an approximate method with an error bound of a factor of 11/6. • Also MST can be used to find a Steiner tree with weight less than twice the optimal Steiner tree.

  36. Phylogenetic Alignment Recall: • phylogenetic alignment was discussed in section 14.8 • The focus was on deriving a multiple alignment enlightened by evolutionary history. • The tree focused emphasis on specific alignment groupings • Internal node sequences were a secondary artifact

  37. Phylogenetic Alignment Phylogenetic alignment as a parsimony problem: In contrast: • we are now interested in the internal sequences • These sequences are waypoints in the evoutionary trajectory leading to the extant taxa • phylogenetic alignment is thus a parsimony problem

  38. Phylogenetic Alignment Hypothesis: optimal phylogenetic alignment describes evolutionary history. Assumptions: • Edit distance realistically models evolutionary distance • Globally optimal phylogenetic alignment captures essence of the evolutionary process We will look at minimum mutation,a variant of phylogenetic alignment

  39. Fitch-Hartigan minimum mutation problem Defn. minimum mutation problem – variant of phylogenetic alignment problem. Input comprised of: • Tree • Strings labeling the leaves • A multiple alignment of those strings

  40. Fitch-Hartigan minimum mutation problem Q: If you are given the tree and the multiple alignment, what is left to compute? A: the mutations that accounts for the input data. These mutations should be: • minimum sequence of site mutations that is • compatible with the given tree and • the given multiple alignment.

  41. Fitch-Hartigan minimum mutation problem Q: How is the input data used to determine the minimum sequence of mutations? • The multiple alignment associates each amino acid with a specific position. • The evolutionary history of the sequences is then treated as a combined but independent evolutionary history of each position. • The tree guides the order of mutations for each position.

  42. Fitch-Hartigan minimum mutation problem Assumptions: • Each column of the alignment can be solved separately • The strings labeling inner nodes adhere to the same alignment The problem reduces to a computation at a single position.

  43. Fitch-Hartigan minimum mutation problem Minimum mutation for a single position: Input: • rooted tree with n nodes • Each leaf is labeled by a single character Output: • Each interior node is labeled by a single character • The labeling minimizes the number of edges between nodes with different labels.

  44. Fitch-Hartigan minimum mutation problem Algorithmic approach: Dynamic Programming • Let Tv denote the subtree rooted at node v • Let C(v) be the cost of the optimal solution for Tv • Let C(v, x) be the cost when v must be labeled by x • Let vi denote the ith child of node v • Base case: for each leaf specify C(v) & C(v, x) x  S. • C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x. • C(v, x) =  if leaf v is not labeled by x.

  45. Fitch-Hartigan minimum mutation problem When v is an internal node: • The recurrence relations start from the base cases. • Bottom up from leaves • Backtracking is used to after all C(v,x) computed to extract the solution.

  46. Fitch-Hartigan minimum mutation problem Backtracking process: • The root is labeled by the character x s.t. C(r) = C(r,x) • The traversal is then top-down • If v is labeled x, then vi is labeled: • character x if C(vi) + 1 > C(vi,x) • o/w character y such that C(vi) = C(vi,y)

  47. Fitch-Hartigan minimum mutation problem Let’s evaluate an example: C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x, o/w C(v, x) =  if leaf v is not labeled by x.

  48. Fitch-Hartigan minimum mutation problem Time complexity: Bottom-up portion • Let s = |S| • Each node is evaluate wrt each x  S. • For n nodes this gives O(ns) The backtracking portion is O(n) Overall O(ns)

  49. Maximum Parsimony • Most widely used tree building algorithm • Differs from distance-based algorithms: • Does not actually build trees from distances • Parsimony is used to compute the cost of a tree • A search strategy is used to search through all topologies • Goal: find the tree topology with the overall minimum cost

  50. Traditional Parsimony Algorithm: Traditional parsimony [Fitch 1971] • Goal: count the number of substitutions at a site. • Method: recursion, keeping track of • C, the current cost • Rk, the residues at k, the current node

More Related