Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao

Dynamic Programming • Dynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure. • Richard Bellman was one of the principal founders of this approach.

Two key ingredients • Two key ingredients for an optimization problem to be suitable for a dynamic-programming solution: 2. overlapping subproblems 1. optimal substructures Subproblems are dependent. (otherwise, a divide-and-conquer approach is the choice.) Each substructure is optimal. (Principle of optimality)

Three basic components • The development of a dynamic-programming algorithm has three basic components: • The recurrence relation (for defining the value of an optimal solution); • The tabular computation (for computing the value of an optimal solution); • The traceback (for delivering an optimal solution).

= F 0 0 = F 1 1 = + F F F for i>1 . - - i i 1 i 2 Fibonacci numbers The Fibonacci numbers are defined by the following recurrence:

F8 F9 F7 F10 F7 F8 F6 How to computeF10？ ……

Tabular computation • The tabular computation can avoid recompuation.

Maximum-sum interval • Given a sequence of real numbers a1a2…an, find a consecutive subsequence with the maximum sum. 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

O-notation:an asymptotic upper bound • f(n) = O(g(n)) iff there exist two positive constant c and n0 such that 0 f(n) cg(n) for all n n0 cg(n) f(n) n0

How functions grow? function For large data sets, algorithms with a complexity greater than O(n log n) are often impractical! n (Assume one million operations per second.)

ai Maximum-sum interval(The recurrence relation) • Define S(i) to be the maximum sum of the intervals ending at position i. If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

Maximum-sum interval(Tabular computation) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum sum

Maximum-sum interval(Traceback) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum-sum interval: 6 -2 8 4

Two fundamental problems we recently solved (I) (joint work with Lin and Jiang) • Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. U = 3 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

Joint work with Huang, Jiang and Lin • Xiaoqiu Huang (黃曉秋)Iowa State University, USA • Tao Jiang (姜濤) University of California Riverside, USA • Yaw-Ling Lin (林耀鈴)Providence University, Taiwan

Two fundamental problems we recently solved (II) (joint work with Lin and Jiang) • Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm. L = 4 3 2 14 6 6 2 10 2 6 6 14 2 1

Another example Given a sequence as follows: 2, 6.6, 6.6, 3, 7, 6, 7, 2 and L = 2, the highest-average interval is the squared area, which has the average value 20/3. 2, 6.6, 6.6, 3, 7, 6, 7, 2

C+G rich regions • Our method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time. ATGACTCGAGCTCGTCA 00101011011011010 Search for an interval of length at least L with the highest average.

After I opened this problem in Academia Sinica in Aug. 2001 ... • JCSS (2002) – Lin, Jiang, Chao • A linear-time algorithm for a maximum-sum segment with a lower bound L and an upper bound U. • An O(n log L) time algorithm for a maximum-average segment with a lower bound L. • Two open problems raised: • O(n log L)  O(n)? • average (with an upper bound U)?

After I opened this problem in Academia Sinica in Aug. 2001 ... • Comb. Math. (2002) – Yeh et. al • An expected O(nL)-time algorithm • Comb. Math. (2002) – Lu • A linear time algorithm for a maximum-average segment with a lower bound L in a binary sequence.

After I opened this problem in Academia Sinica in Aug. 2001 ... • WABI (2002) & JCSS – Goldwasswer, Kao, and Lu • O(n) for a lower bound L only • O(n log (U-L+1)) for L and U • IPL (2003) – Kim • O(n) for a lower bound L only(Reformulated as a Computational Geometry Problem)

After I opened this problem in Academia Sinica in Aug. 2001 ... • Bioinformatics (2003) – Lin, Huang, Jiang, Chao • A tool for finding k-best maximum-average segments • Bioinformatics (2003) – Wang, Xu • A tool for locating a maximum-sum segment with a lower bound L and an upper bound U

After I opened this problem in Academia Sinica in Aug. 2001 ... • CIAA(2003) – Lu et al. • An alternative linear-time algorithm for a maximum-sum segment (online; for finding repeats) • ESA (2003) – Chung, Lu • O(n log (U-L+1))  O(n) for L and U • ISAAC (2003) – Lin (R.R.), Kuo, Chao • O(nL) for a maximum-average path in a tree • O(n) for a complete tree • Some were omitted here... • Some are coming soon...

Length-unconstrained version • Maximum-average interval 3 2 14 6 6 2 10 2 6 6 14 2 1 The maximum element is the answer. It can be done in O(n) time.

Q:computing density in O(1) time? • prefix-sum(i) = S[1]+S[2]+…+S[i], • all n prefix sums are computable in O(n) time. • sum(i, j) = prefix-sum(j) – prefix-sum(i-1) • density(i, j) = sum(i, j) / (j-i+1) j i prefix-sum(j) prefix-sum(i-1)

A naive algorithm • A simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time • Try L, L+1, L+2, ..., n. In total, O(n2).

An O(nL)-time algorithm (Huang, CABIOS’ 94) • Observing that there exists an optimal interval of length bounded by 2L, we immediately have an O(nL)-time algorithm. We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.

Good partners • Finding good partnerg(i) for each i=1, 2, …, n-L+1. i + L g(i) L maximing avg[i, g(i)]

Right-Skew Decomposition • Partition S into substrings S1,S2,…,Sk such that • each Si is a right-skew substring of S • the average of any prefix is always less than or equal to the average of the remaining suffix. • density(S1) > density(S2) > … > density(Sk) • [Lin, Jiang, Chao] • Unique • Computable in linear time.

1 1 1 0 1 1 0 1 0 1 1 0 0 Right-Skew Decomposition 1 > 2/3 > 3/5 > 1/3 Lu’s illustration

An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02) • Decreasingly right-skew decomposition (O(n) time) 5 6 7.5 5 9 8 9 8 7 8 9 7 8 1 9 8 3 7 2 8

Right-Skew pointers p[ ] 5 6 7.5 5 9 8 9 8 7 8 9 7 8 1 9 8 3 7 2 8 1 2 3 4 5 6 7 8 9 10 p[ ] 1 3 3 6 5 6 10 8 10 10

Illustration g(i) i+L L i+L L Lu’s illustration

An O(n log L)-time algorithm • Jumping tables that allows binary search. • O(log L) time for each position to find its good partner, therefore the total running time is O(n log L). • We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.

Pointer-jumping tables 5 6 7.5 5 9 8 9 8 7 8 9 7 8 1 9 8 3 7 2 8 1 2 3 4 5 6 7 8 9 10 p(0)[ ] 1 3 3 6 5 6 10 8 10 10 p(1)[ ] 3 6 6 10 6 10 10 10 10 10 p(2)[ ] 10 10 10 10 10 10 10 10 10 10

Goldwasser, Kao, and Lu’s progress • AnO(n) time algorithm for the problem • Appeared in Proceedings of the Second Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep. 17-21, 2002. • JCSS, accepted.

j i g(j) g(i) A new important observation • i < j < g(j) < g(i) implies • density(i, g(i)) is no more than density(j, g(j)) Lu’s illustration

j i g(j) g(i) Lu’s illustration

Searching for all g(i) in linear time Lu’s illustration

k non-overlapping maximum-average segments(Lin, Huang, Jiang, Chao, Bioinformatics) • Maintain all candidates in a priority queue. • A new k-best algorithm + algorithms on trees (Lin and Chao (2003))

Longest increasing subsequence(LIS) • The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a1a2…an. • e.g. 9 2 5 3 7 11 8 10 13 6 • 3 7 • 7 10 13 • 7 11 • 3 5 11 13 are increasing subsequences. We want to find a longest one. are not increasing subsequences.

A naive approach for LIS • Let L[i] be the length of a longest increasing subsequence ending at position i. L[i] = 1 + max j = 0..i-1{L[j] | aj < ai}(use a dummy a0 = minimum, and L[0]=0) 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 ?

A naive approach for LIS L[i] = 1 + max j = 0..i-1{L[j] | aj < ai} 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 4 5 6 3 The maximum length The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence. This method runs in O(n2) time.

Binary search • Given an ordered sequence x1x2 ... xn, where x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time. n/2 ... n/4 n

Binary search • How many steps would a binary search reduce the problem size to 1?n n/2 n/4 n/8 n/16 ... 1 How many steps? O(log n) steps.

An O(n log n) method for LIS • Define BestEnd[k] to be the smallest number of an increasing subsequence of length k. 9 2 5 3 7 11 8 10 13 6 9 2 2 2 2 2 2 2 2 BestEnd[1] 5 3 3 3 3 3 3 BestEnd[2] 7 7 7 7 7 BestEnd[3] 11 8 8 8 BestEnd[4] 10 10 BestEnd[5] 13 BestEnd[6]

An O(n log n) method for LIS • Define BestEnd[k] to be the smallest number of an increasing subsequence of length k. 9 2 5 3 7 11 8 10 13 6 9 2 2 2 2 2 2 2 2 2 BestEnd[1] 5 3 3 3 3 3 3 3 BestEnd[2] 7 7 7 7 7 6 BestEnd[3] 11 8 8 8 8 BestEnd[4] For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n). 10 10 10 BestEnd[5] 13 13 BestEnd[6]

Longest Common Subsequence (LCS) • A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent. • The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

LCS For instance, Sequence 1: president Sequence 2: providence Its LCS is priden. president providence

LCS Another example: Sequence 1: algorithm Sequence 2: alignment One of its LCS is algm. a l g o r i t h m a l i g n m e n t

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences