250 likes | 367 Views
Comp. Genomics. Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?. Outline. MSA is very expensive Speedup – Carillo Lipman Approximation algorithms Heuristics. MSA is expensive. The running time of DP MSA is O(2 N L N )
E N D
Comp. Genomics Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?
Outline • MSA is very expensive • Speedup – Carillo Lipman • Approximation algorithms • Heuristics
MSA is expensive • The running time of DP MSA is O(2NLN) • E.g., to align 10 proteins of 50 residues each we need 1020 operations • If one operation takes one microsecond, it takes over a billion years to align these 10 proteins • How do we obtain practical algorithms?
Exercise 1 • Definitions: • D(xi,xj)-The optimal pairwise alignment between the ith and jth sequences • A*-The optimal SOP MSA (c(A*) = its cost) • A*i,j-The projection of the optimal SOP MSA on the ij plane (c(A*i,j) = its cost) • c‘ = an upper bound on c(A*) • Given D and c’, upper bound c(A*u,v).
The Carillo-Lipman bound A bound on the cost of the optimal MSA Cost of the optimal MSA The optimal Alignment is better than Any other Cost of the optimal MSA’s Projection on the ij plane Break sum
The Carillo-Lipman bound • Which cells can we cancel? • Those whose projection on any 2-dimensional plane falls in a cells such that the optimal 2-dimensional path through that cell costs more than the bound
Exercise 2 • We are building MSA between x,y and z of sizes n,m, and l: • D(x[1,..,i],y[1,..,j])=5, D(x[1,..,i],z[1,..,k])=7, D(y[1,..,j],z[1,..,k])=3 • D(x[i+1,…,n],y[j+1,…,m])=3, D(x[i+1,..,n],z[k+1,..,l])=7, D(y[j+1,..,m],z[K+1,..,l])=4 • Carillo Lipman gave us: C(Ax,y*)≤13, C(Ax,z*)≤14, C(Ay,z*)≤15 (we assume lower scores are better) • True or false: The cell (i,j,k) needs to be considered in the MSA
Solution • The cost of the optimal path through (i,j) on the xy plane is 8 • The Carillo-Lipman bound is 13. So the projection to the xy plane does not cancel the cell. • Similarly, the other bounds do not cancel the cell • The claim is true
Approximation algorithms • Carillo-Lipman is still impractical for many long sequences • Hence, our goal is to obtain faster algorithms • Approximation algorithms promise to remain a certain factor away from OPT • Good: constant factor algorithms (e.g., 1.785 approximation) • Worse: approximation ratio dependent on the input size (e.g., log(n)) • Not always good empirically, but important for inspiring good heuristics
Approximation ratio • Has to be maintained for anylegal input • Cost (score) of OPT is c(OPT) • Cost (score) of ALG is c(ALG) • Approximation ratio: c(ALG)/c(OPT)
Reminder: SP MSA • Input: strings S1, …,Sk of length n. • d(i,j)– The distance between SiandSjas induced by the MSA • Sum-of-pairs (SP) score: • Goal: find MSA with minimum SP score • We’ll look for minimal scores
The Center * algorithm • Assumptions: • The triangle inequality holds • σ(-,-)=0 • σ(x,y)=σ(y,x) • Input: strings S1,…,Sk. • The algorithm: • Find the string S*that minimizes • Iteratively add all other sequences to the alignment • Running time: O(k2n2)
Exercise 3 • Find the approximation ratio of the center star algorithm • Use the following definitions: • M* - An optimal alignment • M - The alignment produced by this algorithm • d(i,j) - The distance M induces on the pair Si,Sj • D(S,T) – min cost of alignment between S and T For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )
Randomized approximation algorithms • Now we want to reduce the O(k2n2) time further • Randomized algorithms perform well on most inputs, and thus have a good performance on average • Approximation ratio of a randomized algorithm: E(c(ALG)/c(OPT)) • The expectancy is over the choices of the algorithm (coin tosses)
Center-star reminder • Algorithm: • Compute the sequence “closest-to-all-others” • Align all sequences to it • Approximation: • (2-1/k) approximation to the optimal sum-of-pairs (SP) alignment • Complexity: • O(k2) pairwise alignments to choose the center star- O(k2n2) – Bottleneck! • O(k) pairwise alignments to construct the MSA: O(kn2)
Random-star • What if instead of picking the best sequence as the starting point, we pick a random sequence from the group? • Ex4: Show that for any r, we’ll construct an example (choose cost function, sequences and number of sequences) such that the algorithm will be > r worse than OPT: d(Mb)/d(Mopt)>r
Solution • Bad sequence: CCC..C (k letters) • Other sequences: k AGAGAG…AG and k GAGAGA…GA • Costs: 1 for gap and 1 for mismatch
Solution • c(ALG): • k2 couples with k mismatches (AGAG vs GAGA). • 2k couples with k mismatches (CCCC vs AGAG\GAGA). • D(Mb) = k3 + 2k2. • C(OPT): • We have k2 couples with 2 gaps (AGAG vs GAGA). • k couples with k mismatches (CCCC vs AGAG). • k couples with k-1 mismatches and 2 gaps (CCCC vs GAGA). • D(Mb) = 4k2 + k • Ratio: (k3 + 2k2) / (4k2 + k)=O(k) • Choose k large enough…
Random-star • Apparently the idea is not “that-bad” • Ex5: Show that if we choose the starting sequence from the k-size group at random we get an expected 2 approximation: E(d(Mb)/d(Mopt))<=2
Random-star • Ex6: Use this idea for a O(kn2) algorithm! • Solution: Choose sequence at random and proceed as in center-star • Complexity: • No need for initial pairwise alignments • All subsequent alignments can be implemented in O(kn2)
Progressive alignment • A heuristic method for finding MSA • There is no analytic bound on accuracy • Development is guided by empirical results • Examples: Feng-Doolittle, CLUSTALW
Progressive alignment • General algorithm: • Globally align every pair of sequences • Use the alignment scores to construct a “guide tree” • Combine alignments (sequence-sequence,sequence-profile, profile-profile) according to the guide tree
Progressive alignment W-NW F- RF W-NW WL-W WLW WNW FRF