250 likes | 315 Views
Multiple Sequence Alignment. Algorithms in Computational Biology Spring 2006. Most of the slides were created by Dan Geiger and Ydo Wexler and edited by Itai Sharon, other created by Itai Sharon. S 1 =AGGTC. S 2 =GTTCG. S 3 =TGAAC. Possible alignment. Possible alignment. A - T. A G -.
E N D
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by Itai Sharon, other created by Itai Sharon
S1=AGGTC S2=GTTCG S3=TGAAC Possible alignment Possible alignment A - T A G - G G G G T T G T G G - - T T A T - A - - A - T A C C C C C A - G - - G C Multiple Sequence Alignment
Motivation • Construction of phylogenetic trees • Requires that sites being compared are homologous • Extraction of conserved regions in proteins • Construction of profiles characteristic for a protein family • Repetitive sequences in DNA
Multiple Sequence Alignment (MSA) Definition • Given strings s1, s2, …, sk an MSA algorithm maps them to strings s’1, s’2, …, s’k that may contain gaps, where: • |s’1| = |s’2| = … = |s’k| • The removal of gaps from s’ileaves si Note • It is usually convenient to represent an MSA as a matrix with k rows and |s’i| columns • No column may consist solely of gaps
Assigning scores to an MSA • We will consider additive functions only • Points to consider regarding a scoring function • Should not be dependent if the on the order of arguments • Should reward the presence of many equal/strongly related residues and penalize unrelated residues and spaces • In pairwise alignment the score is simply the sum of similarity scores of corresponding letters • What is the “best” way to measure the similarity of k>2 letters?
Sum of Pairs (SP) The sum of pairs score of an MSA is the sum of scores of all pairwise alignments induced by it • Example: • Using a cost function (x, x) = 0 and (x, y) = 1 for x ≠ y this alignment has a SP value of a c - c d b - - c - a d b - a - b c d a d 4 + 6 + 2= 12
A A A T Sum of Pairs • SP tends to overcount mutations. For instance: • Assume that our column consists of (A, A, A, T) and that (x, x) = 1, (x, y) = -1 • The score for the column will be 3*(A, A) + 3*(A, T) = 3 – 3 = 0 While this could be explained by a single mutation:
How to Perform MSA? • Multidimensional dynamic programming • Tree alignments • Star alignments • Progressive alignment
Multidimensional DP Alignment • Given k strings of length n, there is a natural generalization of the DP algorithm • Instead of a 2-dimensional table, we now have a k-dimensional table to fill • For each cell V(i), i=(i1,.., ik), compute an optimal multiple alignment for the k prefix sequences s1(1,.., i1),..., sk(1,.., ik) • The adjacent cells are all cells V(i-b), where bi{0,1} and bi≠0.Each cell depends on 2k-1 adjacent cells • Use the SP-score for computing the score
Multidimensional DP Alignment • What’s the price? • Number of cells to fill: O(nk) • Number of dependencies of each cell: 2k-1 • Time to compute the SP-score: O(k2) • In fact, the optimal SP-alignment problem was shown to be NP-complete! • Well, these sequences need to be aligned… what can we do? Complexity: O(k22knk)
Time Saving Heuristics – Relevance Tests Idea: Avoid computing score(i) for irrelevant cells • Compute a lower bound L on the optimal alignment • Any efficient approximation algorithm can be used • For each cell V(i) compute an upper bound U on the best alignment that goes through it • Ignore the cell if U<L
Time Saving Heuristics – Relevance Tests How do we compute the upper bound U for cell V(i)? • For cell i=(…,iu,…,iv,…) do the following: • For each two indices 1 u < v k compute the optimal score of a pairwise alignment of su and sv, which goes via cell i • Compute • Claim: U is an upper bound on the best MSA that goes through cell i
Time Saving Heuristics – Relevance Tests • How do we compute the optimal route? • Recall the space efficient algorithm for pairwise alignment. • can we go over all cells determine if they are relevant or not? • No. Start with (0,…,0) and add to the list relevant entries until reaching (n1,…,nk) • What is the new time complexity? • For each potential cell we’ve added O(k2n2) operations • Depending on the quality of L we’ve eliminated (hopefully) many cells
Tree Alignments – Structure Input • A set of k sequences S= {s1, s2, …, sk} • Topology of the tree T whose leaves are the members of S Algorithm • Find an assignment of sequences for the interior nodes of the tree that optimizes the overall score • For each edge e=(vi,vj) of T, its weight w(e) is the pairwise alignment score of vi and vj • The overall score is defined by
CAT CTG 1 1 1 CT CG 1 2 GT CG 2 +3 +1=6 Tree Alignments – an Example • Suppose that We’re given the following tree: • Given that (x, x)=1, (x, y)=0 and (x, -)=-1, the overall score of the alignment is score(T)=2+3+1=6
Tree Alignments – Notes • The MSA can be recovered from the alignments on the different edges • Overall score of the alignment is not SP • The tree alignment problem is NP-hard • There exists an algorithm that finds an optimal alignment in time exponential in the number of sequences • Tree alignment algorithm are applicable only when a tree topology is known
s3 s4 s2 s5 s1 s6 Star Alignments – Structure • Choose a sequence s* that will serve as the center of the star • How to choose: try all sequences, choose the one whose distances from all the rest is the smallest, etc. • Add other sequences by aligning them to s* • Add gaps to already aligned sequences when necessary • Never remove a gap (“Once a gap, always a gap”)
The Center Star Method • Publication • Gusfield, 1993 • Assumption • The cost function δ is a distance function that satisfies:(x, y) = (y, x) ≥ 0(x, x) = 0(x, z) + (z, y) ≥ (x, y) • Algorithm • Runs in polynomial time • alignment’s score is less than twice the score of the optimal alignment
The Center Star Method – Definitions • Definitions • M - the alignment produced by the algorithm • M* – the best alignment, namely the one that gets the lowest score • d(i, j), d*(i, j) – the distance induced by M (M*) on (si, sj) • DP(si, sj) – minimum pairwise alignment score • v(M) - score for alignment M: • Note that it is always true that d(i, j) ≥ DP(i, j)
The Center Star Method • Input • A set of k sequences S = {s1, …, sk} • Algorithm • Find the center s* = . Suppose s*= s1 • for i=2 to k do: • Suppose s1, …, si-1 are alreadyaligned as s’1, …, s’i-1 • Align si against s’1 by running the DP algorithm to produce the alignment (s”1, s’i) • Adjust s’2, …, s’i-1 to s”1 by adding gaps to those columns where gaps were added to get s”1 from s’1. • Replace s’1by s”1, add s’i. • end for
The Center Star Method – Time Analysis • Choosing s* • running the DP algorithm times – o(k2n2) • Adding s2, …, sk to the MSA • In step i the length of s’* is at most i·n • Aligning s’* with si takes o(i·n2) time • Performing k-1 such alignments takes o(k2n2) time: • Overall time complexity: o(k2n2)
Triangle inequality d(1,i)=DP(1,i) Definition of S1 The Center Star Method – Error Analysis
Progressive Alignments • Idea • successively align pairs of sequences using pairwise alignment algorithms • General structure • Choose two sequences and align them using a pairwise alignment algorithm • Choose another sequence and align it to the current alignment • Repeat the previous stage as long as there are sequences left
Progressive Alignments • Differences between algorithms • Choosing the next sequence • Progression involves aligning sequences vs. alignments only, or also alignments vs. alignments • Scoring methods • Progressive alignment algorithms • Clustal W • T-Coffee
CLUSTAL W • Publication • Thompson et al., 1994 • The algorithm consists of three stages: • Distance matrix construction, by pairwise alignment of each pair of sequences • Guide tree construction from the distance matrix • Progressive alignment of the sequences according to the branches in the guide tree • More on ClustalW – next week…