1 / 35

Comp. Genomics

Comp. Genomics. Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?. Outline. MSA is very expensive Speedup – Carillo Lipman Approximation algorithms Heuristics. MSA is expensive. The running time of DP MSA is O(2 N L N )

aldis
Download Presentation

Comp. Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comp. Genomics Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?

  2. Outline • MSA is very expensive • Speedup – Carillo Lipman • Approximation algorithms • Heuristics

  3. MSA is expensive • The running time of DP MSA is O(2NLN) • E.g., to align 10 proteins of 50 residues each we need 1020 operations • If one operation takes one microsecond, it takes over a billion years to align these 10 proteins • How do we obtain practical algorithms?

  4. Exercise 1 • Show how we can save time by reducing the volume of the N-dimensional cube • Use the following definitions: • D(xi,xj)-The optimal pairwise alignment between the ith and jth sequences • A*-The optimal MSA • A*i,j-The projection of the optimal MSA on the ij plane

  5. The Carillo-Lipman bound A bound on the cost of the optimal MSA Cost of the optimal MSA The optimal Alignment is better than Any other Cost of the optimal MSA’s Projection on the ij plane Break sum

  6. The Carillo-Lipman bound • Which cells can we cancel? • Those whose projection on any 2-dimensional plane falls in a cells such that the optimal 2-dimensional path through that cell costs more than the bound

  7. Exercise 2 • We are building MSA between x,y and z of sizes n,m, and l: • D(x[1,..,i],y[1,..,j])=5, D(x[1,..,i],z[1,..,k])=7, D(y[1,..,j],z[1,..,k])=3 • D(x[i+1,…,n],y[j+1,…,m])=3, D(x[i+1,..,n],z[k+1,..,l])=7, D(y[j+1,..,m],z[K+1,..,l])=4 • Carillo Lipman gave us: C(Ax,y*)≤13, C(Ax,z*)≤14, C(Ay,z*)≤15 (we assume lower scores are better) • True or false: The cell (i,j,k) needs to be considered in the MSA

  8. Solution • The cost of the optimal path through (i,j) on the xy plane is 8 • The Carillo-Lipman bound is 13. So the projection to the xy plane does not cancel the cell. • Similarly, the other bounds do not cancel the cell • The claim is true

  9. Consensus sequence of optimal MA = Steiner string (up to spaces) • Build MA consistent with Steiner string S*. • Its consensus error is minimal: Σ D(Si,S*) (by definition). • Remove S* from the alignment. • This is the optimal consensus MA. • Easy to prove: S* is its consensus string.

  10. Approximation algorithms • Carillo-Lipman is still impractical for many long sequences • Hence, our goal is to obtain faster algorithms • Approximation algorithms promise to remain a certain factor away from OPT • Good: constant factor algorithms (e.g., 1.785 approximation) • Worse: approximation ratio dependent on the input size (e.g., log(n)) • Not always good empirically, but important for inspiring good heuristics

  11. Approximation ratio • Has to be maintained for anylegal input • Cost (score) of OPT is c(OPT) • Cost (score) of ALG is c(ALG) • Approximation ratio: c(ALG)/c(OPT)

  12. Simple example: Vertex cover • Input: An undirected graph G=(V,E) • Vertex cover: V’: a set of vertices such that every edge is covered by at least one vertex in the set • The problem:Find the smallest vertex cover • NP-Hard by reduction from 3-SAT/clique

  13. Approximating Vertex Cover • Start from E’=E and V’={ } and iterate: • Pick an edge from E’ • Add u,v - both of its endpoints - to V’ • Remove all edges adjacent to u,v from E’

  14. Demo G 6 F 8 3 A 7 D 9 E 2 5 B 4 0 C 1 Taken from Computational Complexity slides

  15. Approximating Vertex Cover • Complexity: O(E) with a proper implementation • V’ is a vertex cover: every edge has to be “touched” or it would remain in E’

  16. Approximating Vertex Cover • This vertex cover has size 2×minimum size (OPT solution) • Let A be the set of edges picked in 1. • No two edges in A share endpoints (why?) • (A is a maximal matching…) • The #vertices in vertex cover is c(ALG)=2|A| • OPT vertex cover must include at least one of these vertex pairs and thus c(OPT)>=|A| • Approximation ratio: c(ALG)/c(OPT)<=2

  17. Reminder: SP MSA • Input: strings S1, …,Sk of length n. • d(i,j)– The distance between SiandSjas induced by the MSA • Sum-of-pairs (SP) score: • Goal: find MSA with minimum SP score • We’ll look for minimal scores

  18. The Center * algorithm • Assumptions: • The triangle inequality holds • σ(-,-)=0 • σ(x,y)=σ(y,x) • Input: strings S1,…,Sk. • The algorithm: • Find the string S*that minimizes • Iteratively add all other sequences to the alignment • Running time: O(k2n2)

  19. Exercise 3 • Find the approximation ratio of the center star algorithm • Use the following definitions: • M* - An optimal alignment • M - The alignment produced by this algorithm • d(i,j) - The distance M induces on the pair Si,Sj • D(S,T) – min cost of alignment between S and T For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

  20. Solution

  21. Randomized approximation algorithms • Now we want to reduce the O(k2n2) time further • Randomized algorithms perform well on most inputs, and thus have a good performance on average • Approximation ratio of a randomized algorithm: E(c(ALG)/c(OPT)) • The expectancy is over the choices of the algorithm (coin tosses)

  22. Max-Cut Approximation • Input: Undirected graph G • Cut: Partition of the graph nodes into two sets (A,B) • Size of the cut: the number of edges between nodes in A and nodes in B • Max-cut: find a cut with a maximum number of edges • NP-Hard

  23. Max-Cut Approximation • Simple (stupid) randomization algorithm: • Start with A={} and B={} • Randomly assign every node to either A or B • How big a cut do we get? • For every e=(vi,vj) let Xij=1 iff e is included in the cut • Size of the cut = Sumi,j{Xij} • E[Xij] =1/2 (Why?)

  24. Max-Cut Approximation • Reminder: Linearity of expectation: E[Sumi,j{Xi,j}]=Sumi,j{E[Xi,j]} • expected size of the cut is m/2 • The optimalcut has at most m edges, of course • Hence E[c(ALG)/c(OPT)]<=1/2! • On the way we proved that there is always some cut with >= m/2 edges • More details: http://www.cs.cmu.edu/afs/cs/academic/class/15859-f04/www/scribes/lec1.pdf

  25. Center-star reminder • Algorithm: • Compute the sequence “closest-to-all-others” • Align all sequences to it • Approximation: • (2-1/k) approximation to the optimal sum-of-pairs (SP) alignment • Complexity: • O(k2) pairwise alignments to choose the center star- O(k2n2) – Bottleneck! • O(k) pairwise alignments to construct the MSA: O(kn2)

  26. Random-star • What if instead of picking the best sequence as the starting point, we pick a random sequence from the group? • Ex4: Show that for any r, we’ll construct an example (choose cost function, sequences and number of sequences) such that the algorithm will be > r worse than OPT: d(Mb)/d(Mopt)>r

  27. Solution • Bad sequence: CCC..C (k letters) • Other sequences: k AGAGAG…AG and k GAGAGA…GA • Costs: 1 for gap and 1 for mismatch

  28. Solution • c(ALG): • k2 couples with k mismatches (AGAG vs GAGA). • 2k couples with k mismatches (CCCC vs AGAG\GAGA). • D(Mb) = k3 + 2k2. • C(OPT): • We have k2 couples with 2 gaps (AGAG vs GAGA). • k couples with k mismatches (CCCC vs AGAG). • k couples with k-1 mismatches and 2 gaps (CCCC vs GAGA). • D(Mb) = 4k2 + k • Ratio: (k3 + 2k2) / (4k2 + k)=O(k) • Choose k large enough…

  29. Random-star • Apparently the idea is not “that-bad” • Ex5: Show that if we choose the starting sequence from the k-size group at random we get an expected 2 approximation: E(d(Mb)/d(Mopt))<=2

  30. Solution

  31. Random-star • Ex6: Use this idea for a O(kn2) algorithm! • Solution: Choose sequence at random and proceed as in center-star • Complexity: • No need for initial pairwise alignments • All subsequent alignments can be implemented in O(kn2)

  32. Progressive alignment • A heuristic method for finding MSA • There is no analytic bound on accuracy • Development is guided by empirical results • Examples: Feng-Doolittle, CLUSTALW

  33. Progressive alignment • General algorithm: • Globally align every pair of sequences • Use the alignment scores to construct a “guide tree” • Combine alignments (sequence-sequence,sequence-profile, profile-profile) according to the guide tree

  34. Progressive alignment W-NW F- RF W-NW WL-W WLW WNW FRF

  35. Profiles alignment scores • Euclidean: • Pearson correlation coefficient: • Kullback-Leibler divergence (relative entropy):

More Related