470 likes | 484 Views
Bioinformatics Algorithms and Data Structures. Chapter 11 sections4-7 Lecturer: Dr. Rose Slides by: Dr. Rose February 4 & 6, 2003. Edit Graphs. Key idea: weighted edit graph
E N D
Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer: Dr. Rose Slides by: Dr. Rose February 4 & 6, 2003
Edit Graphs • Key idea: weighted edit graph • Defn. Given strings S1 and S2 of lengths n and m, respectively, a weighted edit graph has (n+1) by (m+1) nodes, labelled (i,j) , 0 i n, 0 j m. The edges & edge weights are problem specific.
Edit Graphs • Example: edit distance problem • The weighted graph for the edit distance problem has directed edges from node (i, j) to the nodes (i + 1, j) , (i, j + 1) , and (i + 1, j + 1), provided they exist. • The weight of the directed edges to nodes (i + 1, j) , (i, j + 1) is 1. • The weight of the directed edge to (i + 1, j + 1) is t(i + 1, j + 1). • Figure 11.4 in the textbook shows an edit graph.
Edit Graphs • Thm. An edit transcript for strings S1 and S2 has the minimum number of edit operations it corresponds to a shortest path from 0,0 to n,m in the edit graph. • Cor. The set of all shortest paths from 0,0 to n,m in the edit graph specifies all optimal edit transcript of S1 to S2.
Weight Edit Distance • There are two ways of assigning weight or costs to calculate edit distance: • By edit operation • By alphabet, i.e., different costs for different characters • Our initial approach was to assign weight by edit operation, i.e., 1 for insert, delete, replace, and 0 for match. • We can generalize our approach by assigning the weight d for an insertion or deletion, r for a replacement, and e for a match.
Weight Edit Distance • Q:What values for d, r, and e have we been using? • A: d = 1 r = 1, and e = 0. • Q: What would happen if r > 2*d? • A: Replacements would never occur. • Defn. The operation-weight distance problem entails finding an edit transcript transforming S1 to S2 with the minimum total operation weight.
Weight Edit Distance • Q: What changes should we make to the definition of edit distance, D(i,j), to reflect operation weight? • We have to specify an operation-specific definition. • The base conditions become: • D(i,0) = i * d. Why? • D(0,j) = j * d. Why?
Weight Edit Distance • The general recurrence becomes: • D(i,j) = min[D(i,j-1) + d, D(i-1,j) + d, D(i-1,j-1) + t(i,j)] • Where t(i,j) = e if S1(i) = S2 (j) o/w t(i,j) = r • Q: Why? • A: the cost of • Delete (from i-1,j) is d • Insert (from i,j-1) is d • Match (from i-1,j-1) is e • Replace (from i-1,j-1) is r
Weight Edit Distance • The alternative to operation-weight edit distance is alphabet-weight edit distance. • Idea: different characters have different cost. • Q: How would we modify the edit distance function, D(i,j), to support alphabet-weight edit distance? • A: Let weight(x) denote the weight associated with character x for all x in the alphabet. • Then D(i,0) = weight(S1(i)) • And D(0,j) = weight(S2(j)) • Q: what about the general recurrence D(i,j)?
Weight Edit Distance • A: D(i,j) = min[D(i,j-1) + weight(S2(j)), D(i-1,j) + weight(S1(i)), D(i-1,j-1) + t(i,j)] • Where t(i,j)] = weight(S2(j)), ifS1(i) S2(j), o/w 0. • Note: for proteins, edit distance usually refers to alphabet-weight edit distance. • As the text mentions: the weights are usually derived from the PAM matrices of Dayhoff or the BLOSUM matrices of Henikoff. • Edit distance for DNA strings is usually either unweighted or operation-weighted edit distance.
String Similarity • The relatedness of two strings can be expressed in terms of similarity. • This similarity is usually expressed in terms of alignment rather than in terms of edit distance. • Defn. Let S be the alphabet for strings S1 and S2. Let S be S with the additional character ‘-’ denoting space. Let s(x,y) denote the value obtained by aligning character x with character y.
String Similarity Defn. The value of alignment A is defined as: Where S1´ and S2 ´denote strings after the insertion of spaces and their length is denoted by l. If s(x,y) is greater than or equal to zero if x & y match and negative if they mismatch, then we look for the alignment with the largest score
String Similarity Example: S = {a, g, c, t}. Let s(x,y) be defined by: Q: What is the value of the following alignment? a t a - a c t g t g t a g a c - g t
String Similarity Defn. Given a scoring matrix over S, define the similarity of two strings S1 and S2 as the value of the alignment A that maximizes the total alignment value of S1´ and S2 ´. This also defines the optimal alignment value of the strings S1 and S2.
Computing Similarity Q: How can we compute the optimal alignment value of the strings S1 and S2? A: Use dynamic programming. Defn. Let V(i,j) denote the value of the optimal alignment of prefixes S1[1..i] and S2[1..j]. If strings S1 and S2 have lengths n and m, respectively, then the value of the optimal alignment of these strings is given by V(n,m). Q: What do you guess the time complexity will be? A:O(n,m)
Computing Similarity Define the general recurrence relation as: V(i,j) = max[V(i - 1, j - 1) + s(S1(i), S2(j)), V(i - 1, j ) + s(S1(i),_), V(i, j - 1) + s(_, S2(j))] The optimal alignment value relation is defined similar to the edit distance relation. Base Conditions:
Computing Similarity V(i,j) = max[V(i - 1, j - 1) + s(S1(i), S2(j)), V(i - 1, j ) + s(S1(i),_), V(i, j - 1) + s(_, S2(j))] Q: What does this recurrence relation say? A: The optimal alignment of the prefixes S1[1..i] and S2[1..j] is the maximum of: • The optimal alignment of S1[1..i-1] and S2[1..j-1] extended by aligning S1(i) and S2(j). • The optimal alignment of S1[1..i-1] and S2[1..j] extended by aligning S1(i) with a space. • The optimal alignment of S1[1..i] and S2[1..j-1] extended by aligning a space with S2(j).
Longest Common Subsequence Defn.A subsequence of a string S, is a subset of characters arranged in their original relative order. Example: S = interdepartmentaladministratorstaskforce subsequence => idiots interdepartmentaladministratorstaskforce Obviously every substring of S is also a subsequence of S. Defn. a common subsequence of two strings is a subsequence that appears in both strings.
Longest Common Subsequence Defn. Thelongest subsequence problem entails finding the longest common subsequence (lcs) of two strings. Thm. The optimal alignment of A forms a longest common subsequence, if a scoring scheme is use in which each matching pair of characters scores a 1 and a mismatch or space scores 0.
Alignment Graphs Like distance, similarity can be viewed as a path problem: the graph that is analogous to the edit graph (section 11.4) is called an alignment graph. Defn. Analignment graph is a DAG similar to an edit graph in which the edge weights correspond to costs for aligning specific character pairs. The optimal alignment corresponds to the longest path, in terms of sum of edge costs, from 0,0 to n,m of the dynamic programming table. The longest paths (optimal alignments) can be found in O(nm).
End-Space Free Alignment End-space free alignment: an alignment variant in which leading and trailing spaces contribute zero weight. Example: e x a m p l e - h e c o u l d a - - - h a d a - - b e e r - - - - - - - - h e w o u l d n t a s h o t h i s d e a r The first eight spaces are free. This encourages (biases towards): • Alignment of one string inside the other or • Alignment of the prefix of one string with the suffix of the other
End-Space Free Alignment Q: When should interior or prefix/suffix matching be preferred? A: When it matches the nature of the problem being modeled. An example is shotgun sequence assembly: Explain! • Start with a large collection of partially overlapping substrings that come from multiple copies of one original, but unknown string. • Use comparisons of pairs of substrings to infer the original string.
End-Space Free Alignment Q: Would you expect substrings that overlap in the original string to show significant alignment? A: Perhaps. In any case, with some slop for sequencing errors, either: • one string would align inside the other or • the prefix of one string would align with the suffix of the other In contrast, a significant alignment of randomly selected substrings from this collection is unlikely. An End-Space Free Alignment would detect this difference and score overlapping substrings higher.
End-Space Free Alignment We can deduce candidate neighbor pairs by: • Computing End-Space Free Alignment for every pair of substrings. • High scoring alignments are likely neighbors. To compute this: • Use a recurrence for global alignment where spaces count. • Change the definition of V(i,0), V(0,j) to address leading spaces: V(i,0) = V(0,j) = 0 for all i and j. • Compute the alignment graph in O(mn) How?
End-Space Free Alignment Unlike global alignment the value of optimal alignment is not necessarily in cell (n,m). The optimal alignment will now be found in • A cell in row n, if the last character of S1 contributes to the value of the alignment but the last characters of S2 do not. • A cell in column m, if the last character of S2 contributes to the value of the alignment but the last characters of S1 do not. • The optimal alignment will be the cell in row n or column m that has the largest value.
And now for something completely different: Approximate Matching
Approximate Matching Basic idea: Threshold-hold defined similarity Defn. A substring T´ of T is an approximate occurrence of P the optimal alignment of P to T´ has value at least , the threshold parameter. Approach: • Use the standard recurrence for global alignment. • Do not charge preceding spaces: V(i,0) = V(0,j) = 0 for all i and j. • Leave backpointers while computing the table
Approximate Matching Q: How can we recognize an approximate occurrence of P in T from the table computation? A: If the length of P is n, then for some j, V(n,j) More specifically: Thm. The approximate occurrence of P in T ends at position j of T V(n,j) This tells us where in T the approximate occurrence ends. Where in T does it start?
Approximate Matching Thm.(version 0) The approximate occurrence of P in T ends at position j of T V(n,j) This tells us where in T the approximate occurrence ends. Where in T does it start? We can find the start by following the path from cell (n,j) back to (0,k). k is the starting position in T. Thm.(version 1)T[k..j] is an approximate occurrence of P in T V(n,j) and there is a path of backpointers from (n,j) to (0,k).
Approximate Matching The table computation takes O(nm). Consider: depending on the threshold d, T may contain a great many approximate occurrences of P. Q: Can all approximate occurrences be explicitly output in O(nm)? A: Perhaps not. Textbook suggest locating all j s.t. V(n,j) and explicitly outputting a shortest approximate occurrence. • Traverse backpointers from (n,j) until reaching (0,k) • Choose vertical pointers over diagonal pointers • Choose diagonal pointers over horizontal pointers.
Approximate Matching How does this particular preference produce a shortest path? • Choose vertical pointers over diagonal pointers • Choose diagonal pointers over horizontal pointers. Recall: • Horzontal edges correspond to inserting space in P, this lengthens the path. Clearly this is to be avoided. • Diagonal edges correspond to matches or mismatches. • Vertical edges correspond to inserting space in T . There is no obvious reason for choosing diagonal over vertical edges, however, some preference must be made for tie-breaking. Except choosing vertical results in match that is shortest in T.
Local Alignment • So far we have focused on global alignment. This makes sense if • We expect one string to be contained in the other or • We expect the strings to be close related. • Example: comparing amino acid sequences from the same protein family.
Local Alignment • Local alignment exposes regions of high similarity. • This may be interesting even if we expect the strings to be globally dissimilar. • Can you think of examples? • Comparing proteins from different protein families • How about searching for lateral gene transfer from prokaryotic genomes to eukaryotic genomes? • Huh????
Local Alignment • Local alignment problem. Find maximally similar (optimal global alignment) substrings a and b of S1 and S2, respectively. • Example from text: S1 = pqraxabcstvq, S2 = xyaxbacsll a = a x a b - c s b = a x - b a c s • This global alignment is predicated on: • a score of 2 for a match • a score of –2 for a mismatch • a score of –1 for a space • Resulting in a value of 8.
Computing Local Alignment • Q: How can local alignment be computed? • Q: Can global alignment be used to find local alignment? • A: Not efficiently. Global alignment effectively averages out local similarity. • Use explicit search for local similarity.
Computing Local Alignment • Q: Assuming S1 and S2 have respective lengths n and m, how many pairs of substrings are there? • A: There are O(n2m2) pairs of substrings. • Q: If we wanted to, how could we show there are this many substrings?
Computing Local Alignment • Observation: Computing global alignment for each of the O(n2m2) pairs of substrings >O(nm). • Surprisingly, we can compute local alignment in O(nm) even though there are O(n2m2) pairs of substrings. • Assumption: the global alignment of two empty strings has value zero.
Computing Local Alignment • First consider a restricted version of local alignment. • Defn. The local suffix alignment problem entails finding a suffix a of S1[1..i] and a suffix b of S2[1..j] s.t. V(a,b) is the maximum over all pairs of suffixes of S1[1..i] and S2[1..j]. • Let v(i,j) denote the value of the optimal suffix alignment for the index pair i,j.
Computing Local Alignment • Local suffix alignment example: • S1 = abcxdex, S2 = xxcxdeabc, Score 2 for matches and –1 for mismatches or spaces • v(3,4) = 1, how? • The c’s match but there is an additional ‘-’ aligned with x. • v(4,4) = 4, how? • The c’s match and the final x’s match • v(5,4) = 3, how? • Same as v(4,4) but extended with d aligned with ‘-’
Computing Local Alignment • Observation: v(i,j) 0. • Q: Why is this true? • A: We can always choose a and/or b to be the empty string. • Let v* denote the value of optimal local alignment for strings of length n and m. • Thm.v* = max[v(i,j): in, jm]
Computing Local Alignment • We need to understand why this theorem, v* = max[v(i,j): in, jm] , is true. • Proof: • v* max[v(i,j): in, jm] since any local optimal suffix alignment is also a local alignment.
Computing Local Alignment • • WLOG assume v* is derived from the optimal solution involving substrings a and b with end indices i* and j*, a and b define the local suffix alignment for indices i* and j*, thus v* v(i*,j*) max[v(i,j): in, jm] • From this it is clear that a solution to the local suffix alignment problem also solves the local alignment problem.
Computing Local Alignment Thm.v(i,j) = max[0, v(i – 1, j - 1) + s(S1(i), S2 (j)), v(i – 1, j) + s(S1(i), _), v(i, j - 1) + s(_, S2 (j))] • Where v(i, 0) = 0 and v(0, j) = 0 for all i,j Q: What does this recurrence say? A: The solution to the local alignment problem v(i, j) is the larger of: • 0, punt and choose a and b to be empty strings • v(i – 1, j - 1) extended by aligning S1(i) and S2 (j) • v(i – 1, j) extended by aligning S1(i) with ‘_’ • v(i, j - 1) extended by aligning ‘_’ with S2 (j)
Computing Local Alignment Q: What is the difference between the equations for global alignment and local suffix alignment? A: There are two differences: • The inclusion of 0 in the local local suffix alignment • The base conditions for local suffix alignment v(i,0) = 0 and v(0,j) = 0 for all i,j.This is similar for finding approximate occurrences but not for general global alignment.
Computing Local Alignment Approach to computing v*: • Compute the table for v(i, j). • Search the entire table for the largest value, let (i*, j*) denote the cell containing the largest value. • Follow backpointers from cell (i*, j*) to cell (i´, j´) which has the value zero. This gives the optimal local alignment. • The local optimal alignment substrings are then a = S1([i´..i*] and b = S2([j´..j*]
Computing Local Alignment Analysis of computing v*: • We know that computing the table to solve v* takes time O(nm). • The table contains all optimal local alignments for v(i, j). An alignment can be found by locating a cell with v* and tracing back from it.