720 likes | 741 Views
Explore the sum-of-pairs scoring scheme in multiple sequence alignment, its theoretical justification, computation using dynamic programming, and the impact of different parameters on the alignment.
E N D
Bioinformatics Algorithms and Data Structures Chapter 14.6-8: Multiple Alignment Lecturer: Dr. Rose Slides by: Dr. Rose February 28, 2003
Sum-of-Pairs Defn. The sum-of-pairs (SP) score of a multiple alignment is the sum of the score of all induced pairs in a global alignment. From the previous example: 1 A A T - G G T T T 2 A A - C G T T A T • T A T C G - A A T SP = 4 + 5 + 4 = 13
Sum-of-Pairs Q: What theoretical justification is there for adopting the SP score? Wait for response….. A: None. Or rather none more than for any other multiple alignment scoring scheme. In practice it is a good heuristic and is popular. Q: How can we compute a global alignment M using a minimum sum-of-pairs score? A: Why dynamic programming of course!
Sum-of-Pairs Assuming that we want to align k strings Q: What time complexity for the DP solution? A: Q(nk), exact SP aligment has been shown to be NP-complete. Q: So what should we do? A: Choose small a k. In practice, the NP-completeness of a problem often does not mean that the sky is falling.
Sum-of-Pairs Q: How will k affect the recurrence relation? The recurrence relation for k = 3 is: D(i, j, k) = min[ D(i -1, j - 1, k - 1) + ?, D(i -1, j - 1, k ) + ?, D(i -1, j, k - 1) + ?, D(i, j - 1, k - 1) + ?, D(i -1, j , k ) + ?, D(i, j - 1, k ) + ?, D(i, j , k - 1) + ?]
Sum-of-Pairs Let’s consider each term of the recurrence in turn: • D(i -1, j - 1, k - 1) is the diagonal cell in all three dimensions. Q: What should be the SP transition cost for D(i-1,j-1,k-1) D(i, j, k) ? Recall for k = 2, if S1(i) = S2(j) the cost is the match cost, o/w S1(i) S2(j) and we incur the mismatch cost. A: the sum of pairwise match comparisons, i.e., ij, jk, ik.
Sum-of-Pairs Let m(i, j) denote the pairwise character match function defined as: m(i, j) = matchCost if the characters match m(i, j) = mismatchCost if the characters mismatch Then the SP transition cost for D(i - 1, j - 1, k - 1) D(i, j, k) is m(i, j) + m(j, k) + m(i, k) Hence the term cost is : • D(i - 1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k)
Sum-of-Pairs The next term: • D(i -1, j - 1, k ) is the diagonal cell in the first two dimensions. Q: What should be the SP transition cost for D(i-1, j-1, k) D(i, j, k) ? We have two types of cases to consider: • The pairwise diagonal case: i-1, j-1 i, j • The two pairwise space insertion cases: i-1, k i, k andj-1, k j, k
Sum-of-Pairs The cost will be the sum of the pairwise match and space insertion costs. • m(i, j) for (i-1, j-1 i, j) and • spacecost for i-1, k i, k andspacecost forj-1, k j, k Then the SP transition cost for D(i - 1, j - 1, k) D(i, j, k) is m(i, j) + 2 * spacecost Hence the term cost is : • D(i - 1, j - 1, k) + m(i, j) + 2 * spacecost
Sum-of-Pairs Similarly, the third and fourth term costs are: • D(i - 1, j, k - 1) + m(i, k) + 2 * spacecost, • D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost Note the similarity in the fifth, sixth, and seventh terms: • D(i -1, j , k ) + ? • D(i, j - 1, k ) + ? • D(i, j , k - 1) + ? Q: What should be the cost for transitions from them?
Sum-of-Pairs For D(i -1, j , k)we have two types of cases to consider: • The pairwise no change case: j, k j, k • The two pairwise space insertion cases: i-1, j i, j andi-1, k i, k Then the SP transition cost for D(i - 1, j , k) D(i, j, k) is 0 + 2 * spacecost Hence the term cost is : • D(i - 1, j, k) + 2 * spacecost
Sum-of-Pairs Similarly, the sixth and seventh term costs are: • D(i - 1, j, k) + 2 * spacecost, • D(i, j, k) + 2 * spacecost Hence D(i, j, k) = min[ D(i -1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k), D(i -1, j - 1, k ) + m(i, j) + 2 * spacecost, D(i -1, j, k - 1) + m(i, k) + 2 * spacecost, D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost, D(i -1, j , k ) + 2 * spacecost, D(i, j - 1, k ) + 2 * spacecost, D(i, j , k - 1) + 2 * spacecost]
Sum-of-Pairs Q: What about the boundary cells on the 3 faces of the table? • D(i, j, 0), • D(i, 0, k), • D(0, j, k) Observation: Each case degenerates into the familiar two-string alignment distance + space costs for the empty string argument. Approach: represent these cases in terms of pair-wise distance + space costs.
Sum-of-Pairs Let D1,2(i, j) denote the pairwise distance between S1[1..i] and S2[1..j]. D1,3(i, k) and D2,3(j, k) are analogously defined. Consider D(i, j, 0): D(i, j, 0) = D1,2(i, j) + ? * spaceCost Q: What is the space cost, i.e., how many spaces? A: i for S1 and j for S2 hence: D(i, j, 0) = D1,2(i, j) +(i + j) * spaceCost
Sum-of-Pairs By this argument, the boundary cells are given by: • D(i, j, 0) = D1,2(i, j) + (i + j) * spaceCost , • D(i, 0, k) = D1,3(i, k) + (i + k) * spaceCost , • D(0, j, k) = D2,3(j, k) + (j + k) * spaceCost, • D(0,0,0) = 0
Sum-of-Pairs Speedup Q: How can we speedup our DP approach? A: Use forward dynamic programming. Note: so far we have used backward dynamic programming, i.e., cell (i, j, k) looks back to the seven cells that can influence its value. In contrast: forward DP sends the result of cell (i, j, k) forward to the seven cells whose value it could influence.
Sum-of-Pairs Speedup Q: How does this speed things up? A: it doesn’t, if we always sendcell(i, j, k)’svalue forward. The only significant way to speed up the Q(nk) is to avoid computing all nk cells in the DP table. We will use forward DP to reduce the number of cells that we compute in the DP table.
Sum-of-Pairs Speedup Let’s rethink this problem: • View the optimal alignment problem as the shortest path through the weighted edit distance graph. • We are looking for the shortest path from (0,0,0) to (n,n,n). • When node (i, j, k) is computed, we have the shortest path from (0,0,0) to (i, j, k). • The value of node (i, j, k) is sent forward to the seven neighboring nodes that it can influence
Sum-of-Pairs Speedup Let w be reached by an outgoing edge from (i, j, k) • the true shortest distance from (0,0,0) to w is the value computed after it has been updated by every node with a ingoing edge to it. • A queue is used to order the nodes for processing. • The final shortest distance for the node v at the head of the queue is set and node v is removed. • Every neighbor w of v is then updated, w is placed in the queue if it is not already there.
Sum-of-Pairs Speedup At this point we borrow an A*-likeidea: IF (i, j, k) is not on the shortest path from (0,0,0) to (n,n,n) then avoid passing its value forward. More importantly, avoid putting its neighbors, not already in the queue, into the queue. The trick is deciding (i, j, k) is not on the shortest path from (0,0,0) to (n,n,n). Q: How do we pull this rabbit out of our hat?
Sum-of-Pairs Speedup Define d1,2(i, j) to be the edit distance between suffixes S1[i..n] and S2[j..n]. Define d1,3(i, k) & d2,3(j, k), analogously. Note: these edit distances can be computed in O(n2) via DP on the reversed strings. Observation: any shortest path from (i, j, k) to (n,n,n) must have distance at least d1,2(i, j) + d1,3(i, k) + d2,3(j, k)
Sum-of-Pairs Speedup Suppose we have an alignment (from somewhere) with an SP distance score z. Core idea: if D(i, j, k) + d1,2(i, j) + d1,3(i, k) + d2,3(j, k) > z, then node (i, j, k) can not be on any shortest path. • Do not pass its value forward. • Do not put its neighbors reached by outgoing edges onto the queue.
Sum-of-Pairs Speedup Benefits of being able to prune cell (i, j, k): • We automatically prune many of its descendants. • We don’t process all nk cells in a k-string problem. Big win!!!! • The computation is still exact & will find the optimal alignment.
Sum-of-Pairs Speedup The program called MSA implements the speedup we are discussing. Cold shower: • MSA can align 6 strings with n = ~200 • Unlikely to be able to align tens or hundreds of strings. Still, 2006 cells (= 6.4 * 1013 cells), otherwise impossible.
Bounded-Error Approximation for SP-Alignment Q: Where do we get z from? A: We will use a bounded-error approximation method. Properties of the specific method we will discuss: • Polynomial worst-case time complexity • The SP-score is less than twice the optimal value.
Bounded-Error Approximation for SP-Alignment Idea: focus on alignmentsconsistent with a tree. Q: What do we mean by “consistent with a tree”? Informal explanation: • A graph edge denotes a relation between two nodes. • Recall that D(Si, Sj)is the optimal weighted distance between Si and Sj. • We could let D(Si, Sj) be the edge relation.
Bounded-Error Approximation for SP-Alignment Informal explanation: • A graph edge denotes a relation between two nodes. • Recall that D(Si, Sj)is the optimal weighted edit distance between Si and Sj. • We could let D(Si, Sj)be the edge relation between the node labeled Si and the node labeled Sj.
Bounded-Error Approximation for SP-Alignment Informal explanation continued: • Suppose we have a multiple alignment M. • Suppose we construct an unrooted tree from a subset of such edges between nodes labeled with strings from M. • We call the alignment of the strings represented in the tree consistent with the tree. recall D(Si, Sj)is the edge relation.
Bounded-Error Approximation for SP-Alignment Example from text: • A X X _ Z • A X _ _ Z • A _ X _ Z • A Y _ _ Z • A Y X X Z
Bounded-Error Approximation for SP-Alignment Defn. More formally, let: • S be a set of distinct strings. • T be an unrooted tree comprised of nodes labeled with strings from set S. • M be multiple alignment of the strings in S. M is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si, Sj) for each pair of strings (Si, Sj) that label adjacent nodes in T.
Bounded-Error Approximation for SP-Alignment Thm. For any set of strings S and for and tree T whose nodes are labeled by distinct strings from set S, we can efficiently find a multiple alignmentM(T) of S that is consistentwith T. Proof sketch: construct M(T) of S one string at a time. Base case: • Pick two strings Si and Sjlabeling nodes adjacent in T. • Create M2(T) a two string alignment with distance D(Si,Sj).
Bounded-Error Approximation for SP-Alignment Inductive Hypothesis: Assume the theorem holds for 2 < k strings, i.e., Mk(T) is consistent with T. Inductive Step: show that the theorem holds for k + 1 strings. • Pick a string Sjnot in Mk(T) such that it labels a node adjacent to a node labeled Si already in Mk(T). • Optimally align Sj with Si (Si with spaces in Mk(T)). • Add Sj (Sjwith spaces) to Mk(T) creating Mk+1(T). Look at detailed proof (pg. 348) to see how the issue of inserted spaces is handled.
Bounded-Error Approximation for SP-Alignment By construction: • Sj and Si have distance D(Si, Sj) • Mk+1(T) is consistent with T. By induction, M(T) of S is consistent with T and is efficiently computed.
Bounded-Error Approximation for SP-Alignment We need some more definitions at this point: Defn. the center stringSc S, a set of k strings, is the string that minimizes M = SSjSD(Sc, Sj). Defn. the center star is a star tree of k nodes, with the center node Sc and each of the k-1 remaining nodes labeled by a distinct string in S – Sc.
Bounded-Error Approximation for SP-Alignment Defn. the multiple alignmentMcof strings in S is the multiple alignment consistent with the center star. Defn. let d(Si, Sj) denote the score of the pairwise alignment of strings Sj and Si induced by Mc. Defn. let d(M) denote the score of the alignment M. Observations: • d(Si, Sj) D(Si, Sj) • d(Mc) = Si<jd(Si, Sj).
Bounded-Error Approximation for SP-Alignment Defn. the triangle inequality wrt a scoring scheme is defined as the relation s(x, z) s(x, y) + s(y, z) for any three characters x, y, and z. We can extend the triangle inequality from the scoring scheme for characters to string alignment.
Bounded-Error Approximation for SP-Alignment Lemma. If a 2-string scoring scheme that satisfies the triangle inequality is used, then for any Si& Sj : d(Si, Sj) d(Si, Sc) + d(Sc, Sj) = D(Si, Sc) + D(Sc, Sj) Proof sketch: Notice that for each column we have: s(x, z) s(x, y) + s(y, z) The inequality in the lemma follows immediately. The equality holds since all strings are optimally aligned with Sc.
Bounded-Error Approximation for SP-Alignment We can now establish the bounded-error approximation: Defn. Let M* denote the optimal alignment of the k string of S. Defn. Let d*(Si, Sj) denote the pairwise alignment score of the strings Si and Sjinduced byM*.
Bounded-Error Approximation for SP-Alignment Thm.d(Mc)/d(M*) 2(k – 1)/k < 2 See proof on page 350 for details. (basically depends on the previous lemma) Corollary: kMSi<jD(Si, Sj) d(M*) d(Mc) [2(k – 1)/k] Si<jD(Si, Sj) • Recall that M = SSjSD(Sc, Sj) • The alignment score D(Si, Sj) is not based on Mc or M* Observation: d(Mc)/Si<jD(Si, Sj) gives a measure of the goodness of Mc and is guaranteed to be less than 2.
Consensus Objective Functions First fact of consensus representations: There is no consensus as to how to define consensus. Consequently, we will look at several definitions. Steiner consensus strings: Defn. Given a set of string S and a string S´, the consensus error of S´ relative to S is E(S´)= SSjSD(S´, Sj). S´ is not required to be a member of S.
Consensus Objective Functions Defn. Given a set of strings S, an optimal Steiner string S* for S minimizes the consensus error E(S*). S* is not required to be a member of S. Observations: • in S* we are trying to capture the essential common features in S. • Computing E(S*) appears to be a hard problem.
Consensus Objective Functions No known efficient method for finding S*. • We will consider an approximate method. Lemma: Assume that S contains k strings and that the scoring scheme satisfies the triangle inequality. There exists a string S´ S such that E(S´)/E(S*) 2. Q: What does this lemma say? (Proof sketch next slide)
Consensus Objective Functions Proof sketch: For any i, D(S´, Si) D(S´, S*) + D(S*, Si) so, E(S´) = SSjSD(S´, Sj) and SSjSD(S´, Sj) SSjS*[ D(S´, S*) + D(S*, Sj)] But SSjS*[ D(S´, S*) + D(S*, Sj)] = (k-2) D(S´, S*) + E(S*) Therefore E(S´) (k-2) D(S´, S*) + E(S*)
Consensus Objective Functions Q:Where do we find a good candidate for S´? A: Sc, the center string. Recall Sc minimizes SSjSD(Sc, Sj). Thm.E(Sc)/E(S*) 2 - 2/k, assuming the scoring scheme satisfies the triangle inequality. Proof. Follows immediately from the previous lemma and the observation that E(Sc) E(S´)
Consensus Objective Functions Consensus strings from multiple alignment Defn. Let M be a multiple alignment of strings S, the consensus character of column i of M is the character that minimizes the summed distance to all the characters in column i. Note: • the summed distance depends on the pairwise scoring scheme. • The plurality character is the consensus character for some scoring schemes.
Consensus Objective Functions Defn. Let d(i) denote the minimum sum in column i. Defn. The consensus string SM derived from alignment Mis the concatenation of consensus characters for each column of M. Q: How can we evaluate the goodness of SM? A: One possibility is Goodness(SM) = SiD(SM, Si), i.e., see how good of a Steiner string SM is. Consider a different approach…..
Consensus Objective Functions Defn. The alignment errorof SM, a consensus string containing q characters, is Sqi=1d(i). Defn. The alignment error of M is defined as the -alignment error of SM, its consensus string. • Example: • 1 A A T - G - T T T • 2 A A - C G T T A T • T A T C G - A A T • A A T C G - T A TConsensus (alignment error of ?)
Consensus Objective Functions Defn. The optimal consensus multiple alignment is a multiple alignment M whose consensus string SMhas the smallest alignment error over all possible multiple alignments of S.
Consensus Objective Functions The 3 notions of consensus we have discussed are: • The Steiner string S* defined from S. • The consensus string SM derived from M, with goodness related to its function as a Steiner string. • The consensus string SM derived from M, with goodness related to is ability to reflect the column-wise properties of M. Surprisingly (or not) they lead to the same multiple alignment.
Consensus Objective Functions Let’s investigate the assertion these concepts result in the same multiple alignment. Let S be a set of k strings. Let T be the star tree with Steiner string S* at the root and each of the k strings of S at distinct leave of T, then: Defn. the multiple alignment consistent withS* is the multiple alignment of SS*consistent withT.