540 likes | 545 Views
Learn about the ultrametric problem in additive trees and how it can be solved using matrix transformations and tree reductions.
E N D
Bioinformatics Algorithms and Data Structures Chapter 17.4-6: Strings and Evolutionary Trees Lecturer: Dr. Rose Slides by: Dr. Rose April 17, 2003
Ultrametric Problem Centrality Four related tree problems: • Ultrametric • Additive • Binary perfect phylogeny • Tree compatibility • All can be solved as ultrametric tree problems. • Recall tree compatibility reduces to perfect phylogeny. • Now we reduce additive tree & (binary) perfect phylogeny problems to the ultrametric tree problem.
Ultrametric Problem: Additive Trees • Goal: reduce additive tree problem to ultrametric problem • Complexity: O(n2) reduction • Approach: create a matrix D that is ultrametric D is additive. • We will start by describing a reduction that involves a tree T for D and T for D. • We will then describe a direct reduction of D to D.
Ultrametric Problem: Additive Trees • Assume that D is additive. • Assume that we know of an additive tree T for D • Assume that each of the n taxa in D labels a leaf of T. • Idea: label the nodes of T to create an ultrametric tree T. • Q: How can we do this?
Ultrametric Problem: Additive Trees A: we will do the following: • Select one node as the root • Stretch the leaf edges so that they are equidistant from the root. • Let v be the row of D containing the largest entry. • Let mv denote the value of this entry. • Select node v as the root of T. • This creates a directed tree.
Ultrametric Problem: Additive Trees Example: • A is the row of D containing the largest entry. • Select node A as the root of T.
Ultrametric Problem: Additive Trees Stretch leaf edges: • for each leaf i, add mA – D(A, i) to the leaf edge. • Leaf edges are now equidistant from A.
Ultrametric Problem: Additive Trees The resulting tree T is: • a rooted edge-weighted tree • distance mv from root to every leaf • each internal node is equidistant to leaves in its subtree.
Ultrametric Problem: Additive Trees Since each internal node is equidistant to the leaves in its subtree: • Label each internal node by this unique distance. • These labels can be used to define an ultrametric matrix D. • D(i, j) is the label at the least common ancestor of leaves i and j in T. Q: How can we go directly from matrix D to matrix D without involving T and T?
Ultrametric Problem: Additive Trees Consider leaves i & j in T: • Let node w be their least common ancestor • Let x be the distance from the root v to w. • Let y be the distance from node w to leaf i.
Ultrametric Problem: Additive Trees Q: What is the distance from w to iin T? A: y + mv - D(v, i) in T. Q: Where does mv - D(v, i) come from? A: Recall we add mv - D(v, i) to stretch the leaf edges.
Ultrametric Problem: Additive Trees Gusfield presents the following lemma: Without knowing T or T´ explicitly, we can deduce that D´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2 Q: Is this equation correct? D´(i, j) = mv + ((y + z) - (x + y) - (x + z))/2 ? D´(i, j) = mv + -2x/2 ? Should it instead be: D´(i, j) = 2mv + D(i, j) - D(v, i) - D(v, j)? i.e., D´(i, j) = 2mv - 2x? Probably, but it is not necessary for the reduction (slide 9)
Ultrametric Problem: Additive Trees This brings us to the following Theorem: If D is an additive matrix, then D´ is ultrametric, whereD´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2 Proof. We’ve shown that: D´(i, j) = y + mv - D(v, i) y = D(v, i) – x x = (D(v, i) + D(v, j) - D(i, j))/2 Putting it altogether establishes the equation in the theorem. D´ satisfies the ultrametric requirement.
Ultrametric Problem: Additive Trees Q: What is the value of y? A: y = D(v, i) - x. Q: What is the value of x in terms of values in D? A: x = (D(v, i) + D(v, j) - D(i, j))/2
Ultrametric Problem: Additive Trees So: D additive D´ ultrametric By contraposition: D´ ultrametric D additive Q:does D´ ultrametric D additive? A: Theorem: D´ ultrametric D additive Proof. (constructive) • Let T´´ be the ultrametric tree for D´ • Assign weights to edges of T´´ • Note: the sum of edges from a leaf to an ancestor must match the ancestor’s label. • For each edge (p, q), assign the weight |p-q|
Ultrametric Problem: Additive Trees • Assign weights to edges of T´´ continued • Note the path distance between leaves (i, j) is twice the value labeling the least common ancestor • Hence, 2D´(i, j) = 2mv + D(i, j) - D(v, i) - D(v, j) • Now shrink the edge into each leaf i by mv - D(v, i) • The path from leaf i to leaf j is now D(i, j) The result is an additive tree for matrix D from D´’s ultrametric tree. Putting all of this together results in a method for contructing and additive tree for an additive matrix.
Ultrametric Problem: Additive Trees Additive Tree Algorithm • Create matrix D´ from D. • Create ultrametric tree T´´ from D´ • Create T from T´´ • Label edge (p, q) with the value |p-q| • For each leaf i, shrink the leaf edge by mv - D(v, i) Note: no step takes more than O(n2) time. Thm. An additive tree for an additive matrix can be constructed in O(n2) time.
Ultrametric Problem: Additive Trees Example: Given D, first find D´ Recall: D´(i, j) = mv + (D(i, j) - D(v, i) - D(v, j))/2
Ultrametric Problem: Additive Trees Example: From D´ find T´´ Recall: label edge inner edges (p, q) by |p-q|
Ultrametric Problem: Additive Trees Example: From T´´ find T Recall: shrink leaf edge i by mv - D(v, i)
Ultrametric Problem: Additive Trees Example: Finally compare the derived T with the original tree as a sanity check.
Ultrametric Problem: Perfect Phylogeny We now recast perfect phylogeny in terms of an ultrametric tree problem. Defn.DM – the n by n matrix of shared characters More formally: Given the n by m character matrix M, define the n by n matrix DM: for each pair of objects, set DM(p, q) to be the number of characters that p and q both possess.
Ultrametric Problem: Perfect Phylogeny Lemma: If M has a perfect phylogeny, then DM is a min-ultrametric matrix. Proof: convert M’s perfect phylogeny T to a min-ultrametric tree for DM • Let T be the perfect phylogeny for M. • Label T’s root be zero. • Traverse T from top to bottom, for each node v: • Let pv be the number labeling node v’s parent. • Let ev be the # of characters labeling the edge into v. • Label node v with the sum pv + ev
Ultrametric Problem: Perfect Phylogeny • The label of node v is the number of characters common to all leaves in the subtree rooted at v. • if v is the immediate parent of leaves p and q, then the label of v is DM(p, q) • The numbers labeling nodes on any path from the root are strictly increasing. • The result is an ultrametric tree for matrix DM.
Ultrametric Problem: Perfect Phylogeny Algorithm: perfect phylogeny via ultrametrics: • Create matrix DM from M. • Attempt to create a min-ultrametric tree T´ from DM. If not possible, then M has no perfect phylogeny. • If T´ was successfully created in step 2: • Attempt to label its edges with the m characters of M. • If not possible, then M has no perfect phylogeny. • O/w the modified T´ is the perfect phylogeny T. Note: T´ may be min-ultrametric but M may not have a perfect phylogeny, hence the check in step 3
Ultrametric Problem: Perfect Phylogeny Final notes on the centrality ultrametric problem. We can see that the following problems: • perfect phylogeny • tree compatibility can be cast as ultrametric problems. This is not an efficient way to address these problems.
Maximum Parsimony Maximum parsimony: • Perfect phylogeny is a special instance • Can be viewed as a Steiner tree problem on a hypercube Presentation Approach: • Introduce Steiner trees • Hypercube graphs • Maximum parsimony as a Steiner tree problem
Maximum Parsimony Definitions: Let N be a set of nodes Let E be a set undirected edges with non-negative weight Let G = (N, E) be an undirected graph Let X N be a subset of nodes. A Steiner tree ST for X is any connected subtree of G that contains all nodes of X and possibly nodes in N-X. Weighted Steiner Tree Problem: Given G and X, find the Steiner tree of minimum total weight.
Maximum Parsimony More Definitions: A hypercube of dimension d is an undirected graph with 2d nodes, labeled 0..2d-1. Adjacent nodes differ in only one label bit position. The weighted Steiner tree problem on hypercubes: G must be a hypercube.
Maximum Parsimony More Definitions: Maximum Parsimony: Occam’s razor applied to phylogenetic reconstruction. A preference for trees requiring fewer evolutionary events to explain data. Gusfield’s definition: The Maximum Parsimony problem is the unweighted Steiner tree problem on a d-dimensional hypercube.
Maximum Parsimony More about the hypercube formulation of MP: • The X input taxa are described as d-length binary vectors. • Recall: adjacent nodes differ in only one label bit position. • Correspondingly, taxa that differ by a single mutation will be adjacent. Steiner tree of X nodes and l edges iff a corresponding phylogenetic tree that entails l character-state mutations.
Steiner interpretation of Perfect Phylogeny Define a nontrivial binary character to be a character contained by some taxa but not all. Consider an MP dataset of d nontrivial binary characters Q: what is the minimal number of mutations in the MP tree? A: at least d.
Steiner interpretation of Perfect Phylogeny Q: What is the relation to binary perfect phylogeny? A: the binary perfect phylogeny problem is equivalent to asking if there is an MP solution with a cost of exactly d. Q: What about generalized perfect phylogeny? A: It’s similar. The lower bound must reflect: • the number of character states in the input taxa. • a character having r states in the input taxa is allowed only r-1 transitions.
Steiner interpretation of Perfect Phylogeny Complexity: • No known efficient solution for Steiner tree problem on unweighted graphs. • Polynomial time solution for generalized perfect phylogeny problem when r is fixed. this particular Steiner tree problem can be answer in polynomial time.
Steiner interpretation of Perfect Phylogeny MP approximations: • The weighted Steiner tree problem on hypercubes is NP-hard. • There is an approximate method with an error bound of a factor of 11/6. • Also MST can be used to find a Steiner tree with weight less than twice the optimal Steiner tree.
Phylogenetic Alignment Recall: • phylogenetic alignment was discussed in section 14.8 • The focus was on deriving a multiple alignment enlightened by evolutionary history. • The tree focused emphasis on specific alignment groupings • Internal node sequences were a secondary artifact
Phylogenetic Alignment Phylogenetic alignment as a parsimony problem: In contrast: • we are now interested in the internal sequences • These sequences are waypoints in the evoutionary trajectory leading to the extant taxa • phylogenetic alignment is thus a parsimony problem
Phylogenetic Alignment Hypothesis: optimal phylogenetic alignment describes evolutionary history. Assumptions: • Edit distance realistically models evolutionary distance • Globally optimal phylogenetic alignment captures essence of the evolutionary process We will look at minimum mutation,a variant of phylogenetic alignment
Fitch-Hartigan minimum mutation problem Defn. minimum mutation problem – variant of phylogenetic alignment problem. Input comprised of: • Tree • Strings labeling the leaves • A multiple alignment of those strings
Fitch-Hartigan minimum mutation problem Q: If you are given the tree and the multiple alignment, what is left to compute? A: the mutations that accounts for the input data. These mutations should be: • minimum sequence of site mutations that is • compatible with the given tree and • the given multiple alignment.
Fitch-Hartigan minimum mutation problem Q: How is the input data used to determine the minimum sequence of mutations? • The multiple alignment associates each amino acid with a specific position. • The evolutionary history of the sequences is then treated as a combined but independent evolutionary history of each position. • The tree guides the order of mutations for each position.
Fitch-Hartigan minimum mutation problem Assumptions: • Each column of the alignment can be solved separately • The strings labeling inner nodes adhere to the same alignment The problem reduces to a computation at a single position.
Fitch-Hartigan minimum mutation problem Minimum mutation for a single position: Input: • rooted tree with n nodes • Each leaf is labeled by a single character Output: • Each interior node is labeled by a single character • The labeling minimizes the number of edges between nodes with different labels.
Fitch-Hartigan minimum mutation problem Algorithmic approach: Dynamic Programming • Let Tv denote the subtree rooted at node v • Let C(v) be the cost of the optimal solution for Tv • Let C(v, x) be the cost when v must be labeled by x • Let vi denote the ith child of node v • Base case: for each leaf specify C(v) & C(v, x) x S. • C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x. • C(v, x) = if leaf v is not labeled by x.
Fitch-Hartigan minimum mutation problem When v is an internal node: • The recurrence relations start from the base cases. • Bottom up from leaves • Backtracking is used to after all C(v,x) computed to extract the solution.
Fitch-Hartigan minimum mutation problem Backtracking process: • The root is labeled by the character x s.t. C(r) = C(r,x) • The traversal is then top-down • If v is labeled x, then vi is labeled: • character x if C(vi) + 1 > C(vi,x) • o/w character y such that C(vi) = C(vi,y)
Fitch-Hartigan minimum mutation problem Let’s evaluate an example: C(v) = 0 & C(v, x) = 0 if leaf v is labeled by x, o/w C(v, x) = if leaf v is not labeled by x.
Fitch-Hartigan minimum mutation problem Time complexity: Bottom-up portion • Let s = |S| • Each node is evaluate wrt each x S. • For n nodes this gives O(ns) The backtracking portion is O(n) Overall O(ns)
Maximum Parsimony • Most widely used tree building algorithm • Differs from distance-based algorithms: • Does not actually build trees from distances • Parsimony is used to compute the cost of a tree • A search strategy is used to search through all topologies • Goal: find the tree topology with the overall minimum cost
Traditional Parsimony Algorithm: Traditional parsimony [Fitch 1971] • Goal: count the number of substitutions at a site. • Method: recursion, keeping track of • C, the current cost • Rk, the residues at k, the current node