120 likes | 303 Views
Multiple Sequence alignment. Chitta Baral Arizona State University. Motivation and Introduction. Need for multiple sequence alignment We have the sequences of several proteins which have similar function in a number of different species
E N D
Multiple Sequence alignment Chitta Baral Arizona State University
Motivation and Introduction • Need for multiple sequence alignment • We have the sequences of several proteins which have similar function in a number of different species • We may want to know which part of these sequences are similar and which parts are different. • What is multiple alignment? • Let s1, …, sk be a set of sequences over the same alphabet. • Spaces are inserted in s1, …, sk to make them all of same size. • When the extended sequences are aligned, no column can be made exclusively of spaces. • An example • M Q P I L L L • M L R - L L - • M K - I L L L • M P PV L I L • First important issue: defining the quality of an alignment.
The `sum-of-pairs’ (SP) measure • Requirement of a good quality of alignment measure • Additive function • Function must be independent of order of arguments • Should reward presence of many equal or strongly related symbols (in the same column) and penalize unrelated symbols and spaces. • SP function: sum of pairwise scores of all pairs of symbols in the column • SP-score(I, -, I, V) = s(I,-) + s(I,I) + s(I,V) + s(-, I) + s(-,V) + s(I,V). • s(-,-) = 0. • Theorem: Let alpha be a multiple alignment of the set of sequences s1, …, sk; and alpha(I,j) denote the pairwise alignment of si and sj as induced by alpha. Then SP-score(alpha) = Sum over i,j [score(alpha(i,j)] • The above is only true if we have s(-,-) = 0. • This is because in pairwise alignment the presence of two aligned spaces (–) in the two sequences are ignored.
Optimal alignment using dynamic programming • Consider k sequences, each of length n • Use a k-dimensional array A[] of length n+1 in each dimension • Initialize A[0,…,0] = 0. • A[i1, …, ik] max b {A[i-b] + SP-score(Column(s,i,b))} • Where b ranges over all non-zero binary vectors of k elements, and • Column(s,i,b) = (cj) 1<= j <= k • With cj = sj[ij] if bj=1 and cj=- if bj = 0. • Boldface indicates k-tuples. • A[i1,i2,i3] max of • A[i1, i2, i3-1] + SP-score(-,-,s3[i3]) • A[i1, i2-1, i3] + SP-score(-,s2[i2],-) • A[i1, i2-1, i3-1] + SP-score(-,s2[i2],s3[i3]) • A[i1-1, i2, i3] + SP-score(s1[i1],-,-) • A[i1-1, i2, i3-1] + SP-score(s1[i1],-,s3[i3]) • A[i1-1, i2-1, i3] + SP-score(s1[i1],s2[i2],-) • A[i1-1, i2-1, i3-1] + SP-score(s1[i1],s2[i2],s3[i3])
Complexity analysis of the dynamic programming algorithm • Running time: • (n+1)k number of entries in the table • For each entry we need to find the maximum of 2k -1 elements • Finding the SP-score corresponding to each element means adding O(k2) numbers • Total = O(k22knk) i.e., exponential w.r.t. k.
A heuristic based approach • Outline of the approach • We have k sequences of length n each and we want to compute the optimal alignments according to the SP measure • We use dynamic programming, but try to avoid filling all entries of the k-dimensional array, and fill only the `relevant’ ones. • Which cells are relevant and why • Idea: look at pairwise projections of cells. • Note: Optimal alignments may not lead to pairwise projections that are optimal. • A T • A – • - T • A T • A T • is optimal, but A _ and _ T are not optimal.
Heuristics based approach … cont • Recall F(i,j) meant the score of the best alignment between the initial segment x1…i and y1…j. Lets denote it by sim(x[1..i],y[1..j]), and refer to it as axy[i,j]. • I.e., axy[i,j] = sim(x[1..i],y[1..j]). • Let bxy[i,j] = sim(x[i+1..n],y[j+1..m]). • Computed like axy but backwards. • And cxy[i,j] = axy[i,j] + bxy[i,j]. • Means the highest score of an alignment that cuts at (i,j) • Using the c matrix it is very easy to find the alignment. • Find a path from [n,m] to [0,0] that has the value cxy[n,m] all through. • Suppose we know a lower bound Lxy for cxy. I.e. we know for sure that sim(x,y) >= Lxy. • In that case, cxy[i,j] < Lxy means the cut through (i,j) does not lead to the best alignment.
H. B. A (cont) – A theorem • Theorem: Let a be an optimal alignment involving s1, …, sk. If SP-score(a) >= L then score(aij) > = Lij , where Lij = L – Sx<y & (x,y) =\= (i,j) (sim(sx,sy)). • Proof: • SP-score(a) >= L iff Sx<y score(axy) > = L • iff Sx<y & (x,y) =\= (i,j) score(axy) > = L - score(aij) • Implies Sx<y & (x,y) =\= (i,j) (sim(sx,sy)) > = L - score(aij) ##because sim(sx,sy) is the best score and hence is greater than or equal to score(axy). • iff score(aij) > = L – Sx<y & (x,y) =\= (i,j) (sim(sx,sy)). • Implication of this theorem: • Suppose we have a lower bound L of SP-score, over all possible alignments. • Then a cell with index (i1, …, ik) is relevant if the score of the best alignment (say a) that cuts through (i1, …, ik) > = L • By the theorem, this implies for all x, y, 1 <= x <y <= k we have score(axy) > = Lxy • Which means cxy (ix,iy) > = Lxy • This is because the best alignment will cut through ix iy. • Idea of the algorithm: • Pick a lower bound L; Compute cxy and Lxy for each pair x, y 1 < = x < y < = k. • Start with (0,…,0) and expand its influence to dependent relevant cells and continue until the final corner cell is reached.
The heuristic based algorithm • Input: s = (s1, …, sk) and lower bound L • Output: The value of an optimal alignment • For all x, y, 1 <=x<y<=k Compute cxy • For all x,y, 1 <=x<y<=k Lxy L - S(x,y) =\= (p,q) (sim(sp,sq)). • pool {0} • While pool not empty do • i the lexicographically smallest cell in the pool • pool pool \ {i} • If cxy[ix,iy]>= Lxy, forall x,y, 1 <= x<y<=k then • For all j dependent on i do • If j not in pool then pool pool U {j}; a[j] a[i] + SP-score(Column(s,i,j-i)) • else a[j] max( a[j], a[i] + SP-score(Column(s,i,j-i)) • Return a[n1, …, nk]
Star alignment • Let s1, …, sk be k sequences that we want to align • Pick one of the sequences sc as the center • For each index i =\= c find optimal alignment between si and sc • Aggregate these alignment using ``once a gap always a gap principle’’ • Start with one pair of alignment and keep adding alignment with respect to another string using sc as a guide by adding gaps when necessary • One way to select sc is to try all possibilities and pick the one that results in the best score. • Another way is to compute all optimal pairwise alignments and select as the center the string that maximizes Si =\= c sim(si,sc).
Tree alignment • Motivation: Sometimes we have an evolutionary tree for the sequences involved. • In that case we can compute the overall similarity based on pairwise alignment along tree edges. • Input: k sequences and a tree with leaves as these sequences. • Goal: Find a sequence asignment to the internal nodes of the tree so that the sum of the similarity between the sequences along each edges is maximized. • Tree alignment is NP-hard, but approximation algorithms exist. • Note: Star alignment can be viewed as a special case of tree alignment.