550 likes | 926 Views
Longest Common Subsequence. Strand of DNA: a string of molecules called bases Four possible bases: adenine (A), guanine (G), cytosine (C) and thymine (T) DNA strand expressible as string over alphabet {A,C,G,T} Examples organism 1: ACCGGTCGAGTGCGCGGAAGCCGGCCGAA
E N D
Longest Common Subsequence • Strand of DNA: a string of molecules called bases • Four possible bases: adenine (A), guanine (G), cytosine (C) and thymine (T) • DNA strand expressible as string over alphabet {A,C,G,T} • Examples • organism 1: ACCGGTCGAGTGCGCGGAAGCCGGCCGAA • organism 2: GTCGTTCGGAATGCCGTTGCTCTGTGTAAA • “Similarity” of strands used as measure of similarity of organisms • Similar could mean • one string is a substring of the other (text-matching algorithms) • number of letter changes to go from one to the other(edit difference – exercise 15-3) • longest common subsequence:a longest sequence that appears as a (not necessarily consecutive) subsequence of both sequences
Longest Common Subsequence • Strand of DNA: a string of molecules called bases • Four possible bases: adenine (A), guanine (G), cytosine (C) and thymine (T) • DNA strand expressible as string over alphabet {A,C,G,T} • Examples • organism 1: ACCGGTCGAGTGCGCGGAAGCCGGCCGAA • organism 2: GTCGTTCGGAATGCCGTTGCTCTGTGTAAA • “Similarity” of strands used as measure of similarity of organisms • Similar could mean • one string is a substring of the other (text-matching algorithms) • number of letter changes to go from one to the other(edit difference – exercise 15-3) • longest common subsequence:a longest sequence that appears as a (not necessarily consecutive) subsequence of both sequences • GTCGTCGGAAGCCGGCCGAAis the LCS of above strands.
Formal Definitions • SubsequenceSequence Z = z1, z2, ... , zk is a subsequence of sequence X = x1, x2, ... , xm if there is an increasing sequence of indices 1 ≤ i1 < i2 < < ik≤ m such that for j = 1, …, n. • Example: • Z = B,C,D,B is a subsequence of X = A,B,C,B,D,A,B
Formal Definitions • SubsequenceSequence Z = <z1, z2, ... , zk is a subsequence of sequence X = x1, x2, ... , xm if there is an increasing sequence of indices 1 ≤ i1 < i2 < < in≤ m such that for j = 1, …, n. • Example: • Z = B,C,D,B> is a subsequence of X = A,B,C,B,D,A,B • Index subsequence: 2,3,5,7
Formal Definitions • Common SubsequenceSequence Z is a common subsequence of sequences X and Y if Z is a subsequence of X and Z is a subsequence of Y • Example: • Z = B,C,A is a common subsequence of X = A,B,C,B,D,A,BY = B,D,C,A,B,A • The Z given above is not a longest common subsequence of X and Y: B,C,B,A is a longer common subsequence of X and Y • In fact B,C,B,A is a longest common subsequence of X and Y.
Longest Common Subsequence Problem • Common Subsequence Problem Input: Finite sequences X and Y Output: A maximum-length common subsequence of X and Y • We will examine an efficient dynamic programming algorithm for this problem.
Longest Common Subsequence Characterization • Definition For each i with 0 ≤ i ≤ m, the ith prefix of sequence X = x1, x2, ... , xmis the sequence Xi = x1, x2, ... , xi • Note that the 0th prefix is the empty sequence, which has length 0 • Example If X = A,B,C,B,D,A,B, then X4 =
Longest Common Subsequence Characterization • Definition For each i with 0 ≤ i ≤ m, the ith prefix of sequence X = x1, x2, ... , xmis the sequence Xi = x1, x2, ... , xi • Note that the 0th prefix is the empty sequence, which has length 0 • Example If X = A,B,C,B,D,A,B, then X4 = A,B,C,B
Optimal Substructure for LCS • Theorem Let X = x1, x2, ... , xm-1 , xm and Y = y1, y2, ... , yn-1, yn be sequences and let Z = z1, z2, ... , zk-1 , zk be any LCS of X and Y
Optimal Substructure for LCS zk = = • Theorem Let X = x1, x2, ... , xm-1, xm and Y = y1, y2, ... , yn-1, yn be sequencesand let Z = z1, z2, ... , zk-1 , zk be any LCS of X and Y 1. If xm = yn, then zk = xm = yn and Zk-1is an LCS of Xm-1 and Yn-1.
Optimal Substructure for LCS zk yn • Theorem Let X = x1, x2, ... , xm-1, xm and Y = y1, y2, ... , yn-1, yn be sequencesand let Z = z1, z2, ... , zk-1 ,zk be any LCS of X and Y 1. If xm = yn, then zk = xm = yn and Zk-1 is an LCS of Xm-1 and Yn-1. 2. If xm yn and zk xm then Z is an LCS of Xm-1 and Y.
Optimal Substructure for LCS zk xm • Theorem Let X = x1, x2, ... , xm-1, xm and Y = y1, y2, ... , yn-1, yn be sequencesand let Z = z1, z2, ... , zk-1 ,zk be any LCS of X and Y 1. If xm = yn, then zk = xm = yn and Zk-1 is an LCS of Xm-1 and Yn-1. 2. If xm yn and zk xm then Z is an LCS of Xm-1 and Y. 3. If xm yn, then zk yn implies Z is an LCS of X and Yn-1.
Optimal Substructure for LCS • Theorem Let X = x1, x2, ... , xm-1, xm and Y = y1, y2, ... , yn-1, yn be sequences and let Z = z1, z2, ... , zk-1 ,zk be any LCS of X and Y 1. If xm = yn, then zk = xm = yn and Zk-1 is an LCS of Xm-1 and Yn-1. 2. If xm yn and zk xm then Z is an LCS of Xm-1 and Y. 3. If xm yn, then zk yn implies Z is an LCS of X and Yn-1. • The importance of the above theorem is that it shows that an LCS of two sequences contains an LCS of prefixes of the sequences. • Therefore, the LCS problem has the optimal-substructure property. • A recurrence characterizing the LCS of two sequences also follows from the theorem
Optimal Substructure for LCS • Given sequences X = x1, x2, ... , xm, Y = y1, y2, ... , yn and integers i,j with 0 ≤ i ≤ m and 0 ≤ j ≤ n, let c[i,j] denote the length of a longest common subsequence of Xi and Yj • Then c[i,j] satisfies the following recurrence: • A direct recursive implementation of the above recurrence would produce an exponential-time algorithm (overlapping sub-problems) • Since there are (mn) sub-problems, we can use dynamic programming to compute the solutions bottom up.
Dynamic Programming Algorithm for LCS • Input: Sequences a and b • Output Two-dimensional table c • Postcondition: c[i,j] is the length of a longest common subsequence of ai and bj
Dynamic Programming Algorithm for LCS LCS(a,b,c) m = a.lastn = b.last for i = 0 to m c[i][0] = 0 for j = 1 to n c[0,j] = 0 for i = 1 to mfor j = 1 to n if ( a[i] b[j] ) c[i][j] = max { c[i-1][j], c[i][j-1] } else c[i,j] = 1 + c[i-1,j-1] Running time: (mn)
Dynamic Programming Algorithm for LCS • Printing the longest common subsequence LCS_print(a,m,n,c) { if c[m][n] == 0 return if ( c[m][n] == c[m-1][n] ) LCS_print(a,m-1,n,c) else if (c[m][n] == c[m][n-1] ) LCS_print(a,m,n-1,c) else { LCS_print(a,m-1,n-1,c) print( a[m] )} } Running time: O(m+n)
Example • In the example that follows we will use the following symbols to indicate how the value in a given cell is computed: c[m][n] = c[m][n-1] c[m][n] = c[m-1][n] c[m][n] = 1 + c[m-1][n-1]
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example: Computing the LCS • We next use the previous table to find the longest common subsequence of the strings a = A, B, C, B, D, A, B and b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A
Example a = A, B, C, B, D, A, B b = B, D, C, A, B, A Longest Common Subsequence:BCBA
LCS Homework Page 348, # 2,3