330 likes | 513 Views
Lecture 6: Multiple sequence alignment. BioE 480 Sept 9, 2004. Comparing Multiple Sequences: Multiple Alignment. We often need to compare several protein sequences that have similar functions. These proteins could come from different or the same species.
E N D
Lecture 6: Multiple sequence alignment BioE 480 Sept 9, 2004
Comparing Multiple Sequences: Multiple Alignment • We often need to compare several protein sequences that have similar functions. These proteins could come from different or the same species. • Which parts are similar, and which parts are different. • An example: A multiple alignment of 8 fragments from immunoglobulin sequences. VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- • This alignment highlights: • Conserved residues: a cysteine forming the disulphide bridges, and a conserved tryptophan • Conserved regions: in particular, "Q.PG"at the end of the first 4 sequences. • More sophisticated patterns: the dominance of hydrophobic residues at fragment positions 1 and 3. The alternating hydrophobicity pattern is typical for the surface beta-strand at the beginning of each fragment.
Multiple alignments are helpful for protein structure prediction. • Multiple alignment also helps to infer the evolutionary history of the sequences. • The first 4 sequences and the last 4 sequences seem to be derived from 2 different common ancestors, which in turn are derived from a "root" ancestor. • 4 fragments are from the variable regions, and 4 fragments from the constant regions of immunoglobulins. • The sequences of the variable regions are about as conserved as the sequences of the constant regions, except for their antigen-binding subregions, which are composed of just a few amino acids each, and give the antibody its specificity
We need to inspect longer fragments than shown here to make phylogenetic observations that are statistically significant! • A Multiple Alignment of k sequences is a rectangular array, consisting of characters taken from the alphabet , that satisfies the following 3 conditions: 1. There are exactly k rows. 2. Ignoring the gap character, row number i is exactly the sequence si . 3. Each column contains at least one character different from "-".
Scoring Multiple Alignment Which one is better? VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- or VTISCTGSSSNIG-AGNHVKWYQQLPG VTISCTGTSSNIG--SITVNWYQQLPG LRLSCSSSGFIFS--SYAMYWVRQAPG LSLTCTVSGTSFD--DYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNW--YVDG ATLVCLISDFYPG--AVTVAW--KADS AALGCLVKDYFPE--PVTVSW--NS-G VSLTCLVKGFYPS--DIAVEW--ESNG or VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCS-SSGFIFSS-YAMYWVRQAPG LSLTCT-VSGTSFDD-YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--
Additive Functions: the alignment score is the sum of column scores. • Independent of the ordering of the sequences in the alignment: A column with “I, -, I, V”, should score the same as “V, I, I, -” • Reward the presence of many equal or strongly related residues, and penalize unrelated residues and spaces. • Sum-of-Pair (SP) Function: The sum of pairwise scores of all pairs of symbols in the column. • eg. SP_score( I, -, I, V ) = p( I,- ) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V) • eg. • Here we used the "unit costs" for pairwise alignment, summing up all the costs of all possible pairs of letters, i.e. the sum of the unit costs of the pairs (1,2), (1,3), (1,4), ..., (1,8), (2,3), (2,4), ..., (2,8), (3,4), (3,5), ..., ..., ..., and (7,8). • In general, any cost/weight scheme could be used, it just needs to map pairs of characters to a numeric value.
p( -,- ) = 0: If we select two sequences from a multiple alignment, and ignore the rest, we have a pairwise alignment -- if we ignore columns with two spaces. This is called a projection of the multiple alignment. ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- becomes: ATLVCLISDFYPGAVTVAWKADS AALGCLVKDYFPEPVTVSWNSG • Projection of a multiple alignment a: if aij is the pairwise projection of a to sequence i and j SP_score(a) = S score(aij), i<j Here score(aij)is the score of a projected pairwise alignment.
Optimal Multiple Alignment is an alignment with minimum overall cost, or maximum overall similarity. • The Dynamic Programming Hyperlattice. For the alignment of 3sequences, every alignment can be seen as a unique path through a 3-dimensional lattice. • This path is denoted by listing at each node visited, component by component, the distance from the starting point in the bottom left (i.e. from the source (0,0,0) of the lattice): • (0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1),(4,3,2) for the above example • The distance from the starting point is the number of letters already aligned: • For example, column 4 of the alignment corresponds to node (3,3,1): 3 letters from the first and second sequence are aligned at that point, and one letter from the third.
a b • Imagine light sources on the top, front, and right-hand side of the lattice, "shadows" of the alignment will be projected to the opposing faces (walls). • Assume that the light sources are farther away from the lattice, and the shadows are projected without distortion. • In Fig. a, only the light source on the right is "on", projecting the path onto the face on the left. • In Fig. b, all light sources are "on".
An optimal multiple alignment can be calculated by dynamic programming. • The special case is pairwise case alignment: • We can visualize dynamic programming as a calculation that visits every node in a 2-dimensional matrix, or 2-d lattice, in a way that obeys the order of dependencies between the nodes, as indicated by the arrows. • We have arranged the sequences such that the calculation of the alignment starts in the bottom-left corner, and not the top-left corner. In this setting, we start bottom-left, then move to the right until the bottom row is finished, then visit the node marked by an asteriks (*), move to the right as before, etc. • In three or more dimensions, we have to look at more nodes: • e.g. 7 nodes for three sequences. Correspondingly, the minimum needs to be taken from 7 possible values
Computational Complexity of Multiple Alignment byStandard Dynamic Programming • Each node in the k-dimensional hyperlattice is visited once, and therefore the running time must be proportional to the number of nodes in the lattice. • This number is the product of the lengths of the sequences. • eg. the 3-dimensional lattice as visualized. • How many steps does the algorithm ``rest'' at each node ? • Dynamic programming organizes the visiting of nodes in such a way that we just need to ``look back'' one single step, at the nodes that we've visited before, to look up the values we need for calculating the minimum. • The time we spend for retrieving the minima and calculating the sum does not depend on the length of the sequences. • However, it depends on the number of sequences. We've had 3 values for 2 sequences, 7 values for 3 sequences, 15 values for 4 sequences. This goes up exponentially: 2k-1 • The running time is exponential: O(2kP |si|), i = 1.. K • If the proportionality factor is 1 nanosecond, then for 6 sequences of length 100, we'll have a running time of 26 x 1006 x10-9, that's roughly 64,000 seconds (17 hours). Add two more sequences, then we will have to wait 2.6 x109 = 82 years!
The memory space requirement is even worse. To trace back the alignment, we need to store the whole lattice, a data structure the size of a multidimensional skyscraper. • In fact, space is the No.1 problem here, bogging down multiple alignment methods that try to achieve optimality. • Furthermore, incorporating a realistic gap model, we will further increase our demands on space and running time
If many sequences are to be aligned, the super lattice becomes too large, and we can no longer visit every node in it. • No needs to visit nodes close to the extreme ends of the lattice, like E1, E2, E3 or E4: • The extreme ends are reached if we introduce a lot of gaps • For example, the alignment path of AAAACCCCCC---- ----CCCCCCTTTT is 4 steps away from the main diagonal, because every gap introduced in the beginning moves us one step away.
For the "sum-of-pairs" model, if the sequences are rather similar, and the cost for introducing gaps is not too low, then the optimal multiple alignment will not be like: AAAA-------- ----CCCC---- --------TTTT • The optimal alignment path is expected to be contained in a "polyhedron" close to the main diagonal. • A polyhedron is a solid formed by planar faces. • Note the shadow of the polyhedron. • A tube symbolizes the optimal alignment path. • When visiting a node and looking for the minimum along all the incoming edges, we can ignore those edges that are "coming from outside the polyhedron”. • On its top-left side, the cube is "covered" by the polyhedron. The edges 1, 2, 3, 6 and 7 are coming from the inside, and edges 4 and 5 can be ignored (and are therefore not labeled in the figure).
How to obtain such a polyhedron? • If the optimal is maximum 30 units away, we can have a simple polyhedron of radius 30 around the heuristic alignment, and search for the optimal inside it. • But how are we to know it is 30 not 40 (or 50, 70)? • Obtain a polyhedron consisting of paths (and nodes they traverse) satisfying specific upper bounds of cost in all pair-wise projections. • One bound for each projection, (i.e. each possible pair of sequences). • For 3 sequences, we calculate 3 upper bounds, and for 4 sequences, we calculate 6 upper bounds. • These bounds are obeyed by the projections of the optimal multiple alignment: It is pointless to consider nodes which only paths with higher costs in some projection go through. • Project a fixed heuristic multiple alignment in the direction of the sequences 1 and 2: • The heuristic multiple alignment is more costly than the optimal one (otherwise the optimal alignment was misnamed). • Do not know how close the two paths of the projected heuristic and the projected optimal alignment are, and we do not know which projected alignment is more costly: • Starts with aligning the pair (1,2) optimally, and then adding the other sequences one by one. The projection of this heuristic multiple alignment to the first 2 sequences is the optimal pairwise alignment, and the projection of the optimal multiple alignment can only be more costly (otherwise, the optimal pairwise one was misnamed). • Aligning multiply puts constraints on the pairwise projections, they are no longer independent, they all need to fit together, forming the multiple alignment in a consistent way.
Back to Pairwise Dynamic Programming • Solving a problem by using already computed solutions for smaller instances of the same problem. • Concept: Given two sequences s and t, instead of determining similarity between s and t as whole sequences only, we build up the solution by determining all similarities between arbitrary prefixes of the two sequences. • We start with shorter prefixes and use the computed values of these to solve the problem of larger prefixes. • Let m be the size of s and n the size of t. There are m+1 prefixes of s and n+1 of t, including the empty string. We can arrange our calculations in an (m+1)x(n+1) matrix a, where element a(i,j) contains the similarity between s[1...i] and t[1...j].
t(j) s(i) • The matrix a for s = AAAC and t = AGC • The first row and first column: multiples of space penalty. • Only one alignment possible if one seq is empty: add spaces, with score -2k • Key point: the value for the entry a(i,j) can be obtained by looking at just three previous entries: those for (i-1,j), (i-1,j-1) and (i,j-1). • The reason is that there are only three ways to align s[1..i] and t[1..j] align s[1...i] with t[1...j-1] and match a space with t[j] align s[1...i-1] with t[1...j-1] and match s[i] with t[j]. align s[1...i-1] with t[1...j] and match s[i] with a space. • These are exhaustive possibilities, since we cannot have two spaces paired.
Here entries are computed row by row, and usually gap g<0 Algorithm to Compute Global Similarity: For example, the value of a[1,2] comes from one of the three: a[1,1] - 2 = -1; a[0,1] -1 = -3; a[0,2] - 2 = -6 • sim(s[1...i],t[1...j]) = maximum of sim s[1...i],t[1...j-1]-2 sim s[1...i-1],t[1...j-1]+ p(i,j) sim s[1...i-1],t[1...j]-2 • Since a(i,j) stores sim(i,j), the similarity of s[1...i] with t[1...j] is: a(s[1...i],t[1...j]) = maximum of a (i, j-1) - 2 a(i-1, j-1) + p(i,j) a(i-1, j) - 2 • Important: the order of the computing needs to make sure that a (i, j-1), a(i-1, j-1), and a(i-1, j) are available when computing a (i, j).
If sequences are of similar length n, then the time complexity is O(n2) . Time complexity: O(m) O(n) O(m n)
Optimal Alignment The arrows are not implemented explicitly. t(j) s(i) We can start at entry (m, n) and follow the arrows until we get to (0,0). Each arrow gives one column of alignment. Horizontal: space in s matches t[j] Vertical: space in t matches s[i] Diagonal: s[i] matches t[j] Call to Align(m,n, len) gives an optimal alignment, given matrix a, and strings s, and t
The algorithm returns just one, giving preferences to edges leaving (i,j) in counterclockwise order. Answers are given in vectors align-s and align-t, holding in 1..len the aligned characters, symbols or spaces. Length of the alignment is returned by len: max(|s|,|t|) <= len<= m+n Time complexity when a matrix is given: O(len), the size of the returned alignment or O(m+n) It is possible that several alignments may have the same scores: +1-1-2 -2+1-1 +1-2-1 A A A A A A A A A A G - - A G A - G Uppermost alignment: That is: if there are two or three choices, a column with space in t is preferred over a column with two symbols, which is preferred over a column with space over r. eg s -ATAT ATAT- t TATA- > -TATA
Local Comparison • An alignment between a substring of s and a substring of t. • Goal: to find the highest scoring local alignment. • Same Data structure: an array a[1..m+1][1..n+1] • a[i,j] : the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j] • Because the empty string, which has a score 0, is always a valid suffix of a sequence, all entries >= 0. • First row and first column: initialize to 0. • The entry a(s[1...i],t[1...j]) = maximum of a (i, j-1) - g a(i-1, j-1) + p(i,j) a(i-1, j) - g 0 ------ an empty alignment
t(j) s(i)
Find the maximum entry in the whole array: this is the score of an optimal local alignment. • Start from any entry with this score value, and trace back until there is no arrow: • optimal local alignment. • In general, we are interested in not only the optimal local alignment, but also near optimal alignment. End Spaces in Alignments: End spaces are before the first character or after the last character. Consider the following alignment: C A G C A - C T T G G A T T C T C G G size 18 - - - C A G C G T G G - - - - - - - - size 8 -2-2-2 -2 -2-2-2-2-2-2-2-2 12x(-2) -1 -1 1 1 1 1 1 1 6x 1 There will be many spaces in any alignment because length differences, contributing to a large negative score (-19). The above alignment is pretty good, if end spaces are ignored: 6 matches, 1 mismatch, 1 space.
The alignment with the best score is: CAGCACTTGGATTCTCGG CAGC-----G-T----GG 10x(-2) +8 x 1 = -12 • Although this alignment gives a better score (-12 as compared to -18), it is not interesting because it is not finding similar regions • We are interested in regions in the longer sequence that are approximately the same as the shorter regions. • However, if we choose the first alignment and neglect all end spaces, then the score is +3. • Semiglobal Comparison!
t(j) s(i) • Ignore the end space after s: spaces after the last character has no cost. • In an optimal alignment, these spaces are matched to a suffix of t. Remove this final part of the alignment, we obtain an alignment between s and a prefix of t, with the same score. • Therefore need to find the best similarity between s and a prefix of t. • Since in the basic algorithm a[i,j] contains the similarity between s[1..i] and t[1..j], • Take the maximum value in the last row of a : sim(s, t) = max a[m, j], and j in [1, n] = a[m, k] The alignment can be obtained by tracing back, except we start from (m, k).
t(j) s(i) • Ignore the end space after t: spaces after the last character has no cost. • In an optimal alignment, these spaces match to a suffix of s. Remove this final part, we obtain an alignment between t and a prefix of s, with the same score. • Therefore need to find the best similarity between t and a prefix of s. • Since in the basic algorithm a[i,j] contains the similarity between s[1..i] and t[1..j], • Take the maximum value in the last COLUMN of a :sim(t, s) = max a[i, n], and i in [1, m] = a[k, n] The alignment can be obtained by tracing back, except we start from (k, n).
t(j) s(i) • Ignore the initial space before s: spaces before the first character has no cost. • This is equivalent to the best alignment between s and a suffix of t. • a[i,j] needs to contain the highest similarity between s[1..i] and a suffix of t[1..j], • Therefore, for s, with |s|=m and t, with |t|=n, we need to look at a[m,n]. The matrix can be filled the same way as the basic global algorithm, But the first row has to be 0: since initial spaces before s have no costs. The alignment can be obtained by tracing back from (m, n). • Ignore the initial spaces before t: Same, except the first column has to be 0. First row and col are 0s now.
In order to not penalize spaces at:Take the following action in a[,]: Beginning of s Initialize first row with zero End of s Find maximum in last row Beginning of t Initialize first col. with zero End of t Find for maximum in last column Summary of end gap conditions :
Affine Gap Penalty Function • If we are introducing k spaces together, the penalty should be less than that for k independent spaces. i.e.w(k) <= k w(1) or, w(k1+k2...kn) <= w(k1) +w(k2)+...w(kn). • A function which satisfies the above conditions is called a subadditive function. • An affine function is a function of the form, w(k) = h + gk, k>=1, where w(0) = 0 and h,g>0. This means that the first space in a gap costs h, while each subsequent space costs g.
Compare similar sequences Very similar sequences: For two sequences s and t with the same lengths, the scores of their optimal alignments between them are very close to the maximum possible. The dynamic programming matrix is square. The main diagonal gives an alignment without spaces. If that alignment is not optimal, need to add spaces. We add spaces in pairs, so s and t still have the same lengths: But now the alignment is off diagonal!
If the sequences are similar, the path of the best alignment should be very close to the main diagonal. Therefore, we may not need to fill the entire matrix, rather, we fill a narrow band of entries around the main diagonal. An algorithm that fills in a band of width 2k+1 around the main diagonal. S = GCGCATGGATTGAGCGA t = TGCGCCATGGATGAGCA The optimal alignment is: The path of the optimal alignment is not on the main diagonal, but twice removed. There are two spaces.
a[i,j] depends on a[i-1, j], a[i-1, j-1], and a[i,j-1]. • Do not use any a[i-1, j] and a[i,j-1] if they are outside the k-band. • No need to test a[i-1, j-1]: it will always be inside the k-band. • For a[i-1, j] and a[i,j-1] , we test because a[i,j] may be on the border of the band: InsideStrip(i, j, k) = (-k<=i-j<=k) ---------- if this is true: 1 • The entry a[n,n] contains the highest score of an alignment within the k-band. • The time complexity is O(kn), much better than O(n2) (k <<n). • If there are (k+1) or more space pairs, the best possible score is when all of the rest of the sequence match perfectly: match(n - k - 1) + 2(k + 1)g • If our k-band computation gives a score better than the above, than there is no need to increase k. • If not, we need to increase k, and repeat the calculation. • Usually, we double k and run the calculation again.