530 likes | 611 Views
The Longest Common Subsequence Problem for Arc-annotated Sequences Tao Jiang, Guo-Hui Lin, Bin Ma, Kaizhong Zhang. B89902003 林聖凱 B89902005 高海峰 B89902027 謝俊瑋. Overview. Arc-annotated sequence usage The secondary and tertiary structure of RNA Protein sequence Solve the open questions in
E N D
The Longest Common Subsequence Problem for Arc-annotated SequencesTao Jiang, Guo-Hui Lin, Bin Ma, Kaizhong Zhang B89902003 林聖凱 B89902005 高海峰 B89902027 謝俊瑋 2004 L.K.H@NTUCSIE
Overview • Arc-annotated sequence usage • The secondary and tertiary structure of RNA • Protein sequence • Solve the open questions in • P.A. Evans, Algorithms and Complexity for Annotated Sequence Analysis, Ph.D. Thesis, University of Victoria 1999. • P.A. Evans, Finding common subsequences with pseudoknots, in Proceedings of 10th Annual Symposium on Combinatorial Pattern matching (CPM’99), LNCS 1645, pp. 270-280 2004 L.K.H@NTUCSIE
Definitions (I) • Symbol definition • S: Sequence • P: arc set • Arc defintion • (S, P) pair is called arc-annotated sequence 2004 L.K.H@NTUCSIE
Definitions (II) • Arc-preserving • Arc mapping is kept when performing LCS • Cutwidth • The number arcs crossing the position • Arc-cutwidth • The max cutwidth of the sequence 2004 L.K.H@NTUCSIE
Restrictions (I) • NP-hard problems if there’re no restrictions on arc annotations • Fortunately, RNA and protein sequences contains some contraints 2004 L.K.H@NTUCSIE
Restrictions (II) • No sharing of endpoints 2. No crossing 3. No nesting 4. No arcs 2004 L.K.H@NTUCSIE
Restrictions (III) • Five levels • Unlimited • No restrictions • Crossing • Restriction 1 • Nested • Restriction 1, 2 • Chain • Restriction 1, 2, 3 • Plain • Restriction 4 2004 L.K.H@NTUCSIE
Result (I) |S1| = n |S2| = m 2004 L.K.H@NTUCSIE
Result (II) • LCS (crossing, crossing) • 2-approximation algorithm • LCS (crossing, plain) • MAX SNP-hard • LCS (nested, plain) • Dynamic programming algorithm 2004 L.K.H@NTUCSIE
LCS (crossing, crossing)-def (I) • (S1, P1), (S2, P2) • Arc-annotated sequences • Y • The result of common LCS() • |Y| = L • M • The mapping between S1 and S2 induced by Y • M={(i1,j1),…,(i2,j2)} 2004 L.K.H@NTUCSIE
LCS (crossing, crossing)-def (II) • Graph GM • (ik, jk), (il, jl): vertex • Max( deg( vertex of GM)) <= 2 2004 L.K.H@NTUCSIE
LCS (crossing, crossing)- Algo 2004 L.K.H@NTUCSIE
LCS (crossing, crossing)- Result • LCS (crossing, crossing) has a 2-approximation algorithm with time complexity O(nm) • LCS (crossing, nested), LCS (crossing, chain), and LCS (crossing, plain) has a 2-approximation algorithm with time complexity O(nm) 2004 L.K.H@NTUCSIE
LCS (unlimited, plain) • Prove that it can’t be approximated within ratio • Lemma 1 • MaxIS-B is Max SNP-complete when B >= 3 • Lemma 2 • MaxIS-Cubic is SNP-complete 2004 L.K.H@NTUCSIE
Proof of Lemma 2 (I) • L-reduction from MaxIS-3 to MaxIS-Cubic • G(V,E): Instance of MaxIS-3 • i:deg=1 • j: deg=2 • n-i-j: deg=3 • V’: the max IS set • opt(G) = |V’| 2004 L.K.H@NTUCSIE
Proof of Lemma 2 (II) • trivially, opt(G) >= n/4 • i+j <= 4*opt(G) • G’ : instance of MaxIS-Cubic • opt(G’): the max IS of G • Goal: Construct G’ via G and a special graph H 2004 L.K.H@NTUCSIE
Proof of Lemma 2 (III) • Graph H is like this • triangle # • 2i+j • cycle size • 2(2i+j) 2004 L.K.H@NTUCSIE
Proof of Lemma 2 (IV) • H has a maximal IS of size 2(2i+j) • Construct G’ • Connect vertex of deg=1 of G to two free vertices in H • Connect vertex of deg=2 of G to one free vertices in H • G’ is cubic graph 2004 L.K.H@NTUCSIE
Proof of Lemma 2 (V) • k’ = opt(G) + 2(2i+j) • k’: one max IS of G’ • opt(G’) >= opt(G) +2(2i+j)……(1) 2004 L.K.H@NTUCSIE
Proof of Lemma 2 (VI) • Another thoughts • V’’: the IS set of G’, |V’’| = k’ • Deleting the vertices of V’’ which are in H will get a IS set of G with size k • At most 2(2i+j) vertices of V’’ is in H • k>= k’– 2(2i+j)…………..(2) 2004 L.K.H@NTUCSIE
Proof of Lemma 2 (VII) • From (1) • From (2) • L-reduction o.k. • MaxIS-Cubic is Max SNP-complete 2004 L.K.H@NTUCSIE
Proof of LCS(unlimited, plain)(I) • Show that MaxIs can be L-reduce to LCS(unlimited, plain) • MaxIS can’t be approximated 2004 L.K.H@NTUCSIE
Proof of LCS(unlimited, plain) (II) • G(V,E): instance of MaxIS • I: instance of LCS consists • S1=an with P1 = E • S2=an with P2 = Ф • V={vi ,.., vk}, IS, 1-1 corresponds to arc-preserving common subsequences consisting of i1th,..,ikth a’s from S1 • So, LCS() includes MaxIS as a subproblem. 2004 L.K.H@NTUCSIE
Corollary • LCS(unlimited, chain), LCS(unlimited, nested), and LCS(unlimited, unlimited) can’t be approximated within ratio 2004 L.K.H@NTUCSIE
LCS(crossing, plan) is MAX SNP-hard • Use L-reduction to reduce MAXIS-Cubic to problem LCS(crossing, plan) • G(V, E) is a cubic graph, n = |V| • For S1 Construct a segment Tu of letters aaaabbccc for each vertex u V • For edge (u, v), introduce an arc between “c” from Tu to “c” from Tv, each letter c can be used only once 2004 L.K.H@NTUCSIE
Instance I constructedfrom cubic graph G • S2 is obtained by concatenating n identical segments of aaaacccbb 2004 L.K.H@NTUCSIE
Proof(1) • Opt(I) ≥ Opt(G) + 6n • Assume Y is an arc-preserving common subsequence of length k’ for (S1, P1) and (S2, P2) • (1) four “a” should be matched • (2) if a “b” is matched then no “c” is matched and vice versa 2004 L.K.H@NTUCSIE
Proof(2) • Define a subset V’ of vertices of G: for every segment Tu in sequence S1, if all its three “c” is matched, we put u in V’ • V’ is an independent set for G, let k = |V’| • K>k’ -6n, n/4 ≤ opt(G) ≤ n/2 • Opt(I) = Opt(G) + 6n ≤ 25n (a) • |k – opt(G)| ≤ |k’– opt(I)| (b) 2004 L.K.H@NTUCSIE
Proof(3) • Inequalities (a) (b) show the reduction is L-reduction, thus problem LCS(crossing, plain) is MAX SNP-hard • LCS(crossing, chain), LCS(crossing, nested), LCS(crossing, crossing) are all MAX SNP-hard 2004 L.K.H@NTUCSIE
Notes • if with additional constrain: • for any (i1, j1) in the mapping, if (i1, i2) P1 then, for some j2, (i2, j2) is in the mapping, and if (j1, j2) P2 then, for some i2, (i2, j2) is in the mapping. • For this definition, LCS(crossing, crossing) is NP-hard and LCS(crossing, nested) is solvable in polynomial time 2004 L.K.H@NTUCSIE
LCS(nested, plain) • Input: Given a pair (S1, P1) and (S2, Ø) of arc-annotated sequences with P1 being nested • Output: The length of a longest arc-preserving common subsequence for the pair(no arc on the LAPC subsequence) 2004 L.K.H@NTUCSIE
Denote u(i) • n= |S1| • m= |S2| • u(i) denote the arc in P1 incident on position i of sequence S1 • If u(i) not exist, we call i “free” • x(S1[i], S2[j]) = 1 if S1[i] = S2[j], or 0 otherwise i u(i)l u(i)r 2004 L.K.H@NTUCSIE
Dynamic Programming Algorithm -Alas, I know little about dynamic programming -but I know divide-and- conquer -pang feng says DP is bottom up, D&C is top-down 2004 L.K.H@NTUCSIE
Divide and Conquer algorithm • Two function: • 弧DP(i1,i2;j,j’) knows the length of a LARC subsequence for the pair (S1[i1, i2]) and (S2[j,j’], Ø), if and only if i1 = u(i2)l • 無DP(i,i’;j,j’) knows the length of a LARC subsequence for the pair (S1[i, i’]) and (S2[j,j’], Ø), if and only if i < u(i’)l or i’ free S1 S1 i i’ i i’ -how? S1 S1 i1 i2 i i’ 2004 L.K.H@NTUCSIE
Divide and Conquer algorithm • 無DP(i,i’;j,j’): If i’ is free 無DP(i,i’;j,j’) = max -simple LCS algorithm 無DP(i,i’-1;j,j’-1)+x(S1[i’], S2[j’]) 無DP(i,i’-1;j,j’) 無DP(i,i’;j,j’-1) 2004 L.K.H@NTUCSIE
無DP(i,i’;j,j’) • Else if i’= u(i’)rand i < u(i’)l 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} j j’’ j’ S1 i u(i’)l i’ S2 S1 j j’ i i’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ 2004 L.K.H@NTUCSIE S2 j j’
無DP(i,i’;j,j’) • Else (i = u(i’)l ) • Just Call 弧DP(i,i’;j,j’) S1 i1 i2 2004 L.K.H@NTUCSIE
弧DP(i1,i2;j,j’) S1 i1 i2 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j]) S2 j j’ 無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’]) 無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) 弧DP(i1, i2; j + 1, j’) -merge 弧DP and 無DP into DP 2004 L.K.H@NTUCSIE
Example • Top down approach S1 (1,8) A T G C A T G C 1 2 3 4 5 6 7 8 S2 A T (1,1) (2,8) A T G C A T G C (3,7) G C A T G (3,3) (4,7) G C A T G A T (5,6) 2004 L.K.H@NTUCSIE A (5,5)
Example: bottom up (1,8) (5,5): 1 2 T[1,1]表DP(5,5;1,1) T[1,2]表DP(5,5;1,2) T[2,2]表DP(5,5;2,2) (1,1) (2,8) T 1 2 1 1 0 (3,7) 無DP(i,i’-1;j,j’-1)+x(S1[i’], S2[j’]) (3,3) (4,7) 無DP(i,i’;j,j’) = max 無DP(i,i’-1;j,j’) (5,6) 無DP(i,i’;j,j’-1) (5,5) (5,6): 6 is free 1 2 DP(5,6;1,2) = max{ DP(5, 5; 1, 1)+x(S1[2], S2[2]) DP(5, 5; 1, 2 ) DP(5, 6; 1, 1 )} T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T
Example: bottom up (1,8) (5,6): 1 2 T 1 2 1 2 (1,1) (2,8) 1 (3,7) 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j]) 無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’]) (3,3) (4,7) 無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) (5,6) 弧DP(i1, i2; j + 1, j’) (5,5) (4,7): arc, 用弧DP 1 2 弧DP(4,7;1,2) = max{ DP(5, 6; 2, 2) +x(S1[4], S2[1]) DP(5, 6; 1, 1) +x(S1[7], S2[2]) DP(5, 6; 1, 2) DP(4, 7; 1, 1) DP(4, 7; 2, 2) } T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T
Example: bottom up 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} (3,7): 7= u(7)r and 3 < u(7)l (1,8) (3,3): (4,7): 1 2 1 2 T 1 2 1 2 T 1 2 0 0 (1,1) (2,8) 1 0 (3,7) (3,3) (4,7) (5,6) (5,5) DP(3,7;1,2) = max{ DP(3, 3; 1, 0) + DP(4, 7; 1, 2) DP(3, 3; 1, 1) + DP(4, 7; 2, 2) DP(3, 3; 1, 2) + DP(4, 7; 3, 2) } 1 2 T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T
Example: bottom up (1,8) (3,7): 1 2 T 1 2 1 2 (1,1) (2,8) 1 (3,7) 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j]) 無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’]) (3,3) (4,7) 無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) (5,6) 弧DP(i1, i2; j + 1, j’) (5,5) (2,8): arc, 用弧DP 1 2 弧DP(2, 8; 1, 2) = max{ DP(3, 7; 2, 2) +x(S1[2], S2[1]) DP(3, 7; 1, 1) +x(S1[8], S2[2]) DP(3, 7; 1, 2) DP(2, 8; 1, 1) DP(2, 8; 2, 2) } T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T
Example: bottom up 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} (1,8): 8= u(8)r and 1 < u(8)l (1,8) (1,1): (2,8): 1 2 1 2 T 1 2 1 2 T 1 2 1 1 (1,1) (2,8) 1 0 (3,7) (3,3) (4,7) (5,6) (5,5) ANS: DP(1,8;1,2) = max{ DP(1, 1; 1, 0) + DP(2, 8; 1, 2) DP(1, 1; 1, 1) + DP(2, 8; 2, 2) DP(1, 1; 1, 2) + DP(2, 8; 3, 2) } 1 2 T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T
Time Complexity (1,8) (1,1) (2,8) (3,7) • Table Size: m*(m-1)/2 = O(m2) • Number of Tables: • Possible (i,j): • Arc: at most n/2 = O(n) • Inside Arc: at most as many as arc • Free: at most O(n) • Table Entry: • O(n) * O(m2) = O(nm2) (3,3) (4,7) m (5,6) m (5,5) A T G C A T G C A T G C A T G C A T G C A T G C 2004 L.K.H@NTUCSIE
Time Complexity • Compute a entry at most cost: • O(m) • Time Complexity: • O(m)*O(nm2 ) = O(nm3 ) 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} 2004 L.K.H@NTUCSIE
Extend LCS(nested, plain) Algorithm • Extend to LCS(nested, chain) • Add two new value α,β to DP(i,i’;j,j’) • DP(i,i’;j,j’; α,β) • Extend to LCS(crossing, nested) • Restrict the cut-width to a constant k • Add k (αi,βi) to DP(i,i’;j,j’) 2004 L.K.H@NTUCSIE
LCS(nested, chain)Notation • -: denote nothing • ρ: the rightmost position of [j,j’-1] except α,β j’ j β α 2004 L.K.H@NTUCSIE
Modification (I) • If i is free and j’ = u(j’)l, • DP(i,i’;j,j’; α,-) = max • DP(i,i’-1;j, ρ; α’,-) + x(S1[i’],S2[j’]) • DP(i,i’-1;j, j’; α,-) • DP(i,i’;j, ρ; α’,-) • DP(i,i’;j,j’; α,j’) = DP(i,i’;j,j’; α,-) • If α< ρ, • α’= α • else α’ = -; 2004 L.K.H@NTUCSIE
Modification (II) • If i is free and j’ = u(j’)r (!= α), • DP(i,i’;j,j’; α,-) = max • DP(i,i’-1;j, ρ; α, β’) + x(S1[i’],S2[j’]) • DP(i,i’-1;j, j’; α,-) • DP(i,i’;j, j’-1; α,-) • DP(i,i’;j,j’; α,j’) = DP(i,i’;j,j’-1; α,-) • If j<= u(j’)l < ρ, • β’ = u(j’)l • else β’ = -; 2004 L.K.H@NTUCSIE