350 likes | 461 Views
Comp. Genomics. Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?. Outline. MSA is very expensive Speedup – Carillo Lipman Approximation algorithms Heuristics. MSA is expensive. The running time of DP MSA is O(2 N L N )
E N D
Comp. Genomics Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?
Outline • MSA is very expensive • Speedup – Carillo Lipman • Approximation algorithms • Heuristics
MSA is expensive • The running time of DP MSA is O(2NLN) • E.g., to align 10 proteins of 50 residues each we need 1020 operations • If one operation takes one microsecond, it takes over a billion years to align these 10 proteins • How do we obtain practical algorithms?
Exercise 1 • Show how we can save time by reducing the volume of the N-dimensional cube • Use the following definitions: • D(xi,xj)-The optimal pairwise alignment between the ith and jth sequences • A*-The optimal MSA • A*i,j-The projection of the optimal MSA on the ij plane
The Carillo-Lipman bound A bound on the cost of the optimal MSA Cost of the optimal MSA The optimal Alignment is better than Any other Cost of the optimal MSA’s Projection on the ij plane Break sum
The Carillo-Lipman bound • Which cells can we cancel? • Those whose projection on any 2-dimensional plane falls in a cells such that the optimal 2-dimensional path through that cell costs more than the bound
Exercise 2 • We are building MSA between x,y and z of sizes n,m, and l: • D(x[1,..,i],y[1,..,j])=5, D(x[1,..,i],z[1,..,k])=7, D(y[1,..,j],z[1,..,k])=3 • D(x[i+1,…,n],y[j+1,…,m])=3, D(x[i+1,..,n],z[k+1,..,l])=7, D(y[j+1,..,m],z[K+1,..,l])=4 • Carillo Lipman gave us: C(Ax,y*)≤13, C(Ax,z*)≤14, C(Ay,z*)≤15 (we assume lower scores are better) • True or false: The cell (i,j,k) needs to be considered in the MSA
Solution • The cost of the optimal path through (i,j) on the xy plane is 8 • The Carillo-Lipman bound is 13. So the projection to the xy plane does not cancel the cell. • Similarly, the other bounds do not cancel the cell • The claim is true
Consensus sequence of optimal MA = Steiner string (up to spaces) • Build MA consistent with Steiner string S*. • Its consensus error is minimal: Σ D(Si,S*) (by definition). • Remove S* from the alignment. • This is the optimal consensus MA. • Easy to prove: S* is its consensus string.
Approximation algorithms • Carillo-Lipman is still impractical for many long sequences • Hence, our goal is to obtain faster algorithms • Approximation algorithms promise to remain a certain factor away from OPT • Good: constant factor algorithms (e.g., 1.785 approximation) • Worse: approximation ratio dependent on the input size (e.g., log(n)) • Not always good empirically, but important for inspiring good heuristics
Approximation ratio • Has to be maintained for anylegal input • Cost (score) of OPT is c(OPT) • Cost (score) of ALG is c(ALG) • Approximation ratio: c(ALG)/c(OPT)
Simple example: Vertex cover • Input: An undirected graph G=(V,E) • Vertex cover: V’: a set of vertices such that every edge is covered by at least one vertex in the set • The problem:Find the smallest vertex cover • NP-Hard by reduction from 3-SAT/clique
Approximating Vertex Cover • Start from E’=E and V’={ } and iterate: • Pick an edge from E’ • Add u,v - both of its endpoints - to V’ • Remove all edges adjacent to u,v from E’
Demo G 6 F 8 3 A 7 D 9 E 2 5 B 4 0 C 1 Taken from Computational Complexity slides
Approximating Vertex Cover • Complexity: O(E) with a proper implementation • V’ is a vertex cover: every edge has to be “touched” or it would remain in E’
Approximating Vertex Cover • This vertex cover has size 2×minimum size (OPT solution) • Let A be the set of edges picked in 1. • No two edges in A share endpoints (why?) • (A is a maximal matching…) • The #vertices in vertex cover is c(ALG)=2|A| • OPT vertex cover must include at least one of these vertex pairs and thus c(OPT)>=|A| • Approximation ratio: c(ALG)/c(OPT)<=2
Reminder: SP MSA • Input: strings S1, …,Sk of length n. • d(i,j)– The distance between SiandSjas induced by the MSA • Sum-of-pairs (SP) score: • Goal: find MSA with minimum SP score • We’ll look for minimal scores
The Center * algorithm • Assumptions: • The triangle inequality holds • σ(-,-)=0 • σ(x,y)=σ(y,x) • Input: strings S1,…,Sk. • The algorithm: • Find the string S*that minimizes • Iteratively add all other sequences to the alignment • Running time: O(k2n2)
Exercise 3 • Find the approximation ratio of the center star algorithm • Use the following definitions: • M* - An optimal alignment • M - The alignment produced by this algorithm • d(i,j) - The distance M induces on the pair Si,Sj • D(S,T) – min cost of alignment between S and T For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )
Randomized approximation algorithms • Now we want to reduce the O(k2n2) time further • Randomized algorithms perform well on most inputs, and thus have a good performance on average • Approximation ratio of a randomized algorithm: E(c(ALG)/c(OPT)) • The expectancy is over the choices of the algorithm (coin tosses)
Max-Cut Approximation • Input: Undirected graph G • Cut: Partition of the graph nodes into two sets (A,B) • Size of the cut: the number of edges between nodes in A and nodes in B • Max-cut: find a cut with a maximum number of edges • NP-Hard
Max-Cut Approximation • Simple (stupid) randomization algorithm: • Start with A={} and B={} • Randomly assign every node to either A or B • How big a cut do we get? • For every e=(vi,vj) let Xij=1 iff e is included in the cut • Size of the cut = Sumi,j{Xij} • E[Xij] =1/2 (Why?)
Max-Cut Approximation • Reminder: Linearity of expectation: E[Sumi,j{Xi,j}]=Sumi,j{E[Xi,j]} • expected size of the cut is m/2 • The optimalcut has at most m edges, of course • Hence E[c(ALG)/c(OPT)]<=1/2! • On the way we proved that there is always some cut with >= m/2 edges • More details: http://www.cs.cmu.edu/afs/cs/academic/class/15859-f04/www/scribes/lec1.pdf
Center-star reminder • Algorithm: • Compute the sequence “closest-to-all-others” • Align all sequences to it • Approximation: • (2-1/k) approximation to the optimal sum-of-pairs (SP) alignment • Complexity: • O(k2) pairwise alignments to choose the center star- O(k2n2) – Bottleneck! • O(k) pairwise alignments to construct the MSA: O(kn2)
Random-star • What if instead of picking the best sequence as the starting point, we pick a random sequence from the group? • Ex4: Show that for any r, we’ll construct an example (choose cost function, sequences and number of sequences) such that the algorithm will be > r worse than OPT: d(Mb)/d(Mopt)>r
Solution • Bad sequence: CCC..C (k letters) • Other sequences: k AGAGAG…AG and k GAGAGA…GA • Costs: 1 for gap and 1 for mismatch
Solution • c(ALG): • k2 couples with k mismatches (AGAG vs GAGA). • 2k couples with k mismatches (CCCC vs AGAG\GAGA). • D(Mb) = k3 + 2k2. • C(OPT): • We have k2 couples with 2 gaps (AGAG vs GAGA). • k couples with k mismatches (CCCC vs AGAG). • k couples with k-1 mismatches and 2 gaps (CCCC vs GAGA). • D(Mb) = 4k2 + k • Ratio: (k3 + 2k2) / (4k2 + k)=O(k) • Choose k large enough…
Random-star • Apparently the idea is not “that-bad” • Ex5: Show that if we choose the starting sequence from the k-size group at random we get an expected 2 approximation: E(d(Mb)/d(Mopt))<=2
Random-star • Ex6: Use this idea for a O(kn2) algorithm! • Solution: Choose sequence at random and proceed as in center-star • Complexity: • No need for initial pairwise alignments • All subsequent alignments can be implemented in O(kn2)
Progressive alignment • A heuristic method for finding MSA • There is no analytic bound on accuracy • Development is guided by empirical results • Examples: Feng-Doolittle, CLUSTALW
Progressive alignment • General algorithm: • Globally align every pair of sequences • Use the alignment scores to construct a “guide tree” • Combine alignments (sequence-sequence,sequence-profile, profile-profile) according to the guide tree
Progressive alignment W-NW F- RF W-NW WL-W WLW WNW FRF
Profiles alignment scores • Euclidean: • Pearson correlation coefficient: • Kullback-Leibler divergence (relative entropy):