250 likes | 280 Views
Sequence Alignment. Motivation: Types. Two sequences of same length, some characters are different ( Database search) Aagtacggaga aagcaccgaga Two seq are of different length, possible gaps in one of them ( Database search) Aaccaccgaga Aa-caccgaga. Motivation: Types.
E N D
Motivation: Types • Two sequences of same length, some characters are different (Database search) Aagtacggaga aagcaccgaga • Two seq are of different length, possible gaps in one of them (Database search) Aaccaccgaga Aa-caccgaga
Motivation: Types • Match longest prefix of one with the suffix of the other (fragment assembly) Aaacgtcgata gatacgatg • Local alignment: longest substring matching over two sequences (homolog search) Gatacgatgctagtttacg agagcgatgcataattcgaatga
Motivation: Types • Multiple sequence alignment (page 71) (Comparative studies of sequences)
Formalizing sequence comparison • Either a character matches with the corresponding character in an an alignment (+1), • Or, it does not (-1), • Or, a gap needs to be inserted (-2)
Global Alignment • Smith-Waterman (1981) Dynamic programming algorithm • Scoring matrix for alignment (p 31) • Initializing boundaries of the scoring matrix for gaps in front of either string • Meaning of an entry to the matrix • Corner element is the final score
Global Alignment • Three alternatives in each iteration • Ordering of calculation: row or column-wise • The algorithm (p 52) • Recursive recovery process from corner element (constant m and n, the string lengths) • Variable len returned by the algorithm • Convention for tie braking
Local alignment • Alignment will stop anywhere • So, the min score is zero, even on boundaries • Best local alignment is where the score is max in the matrix • Recovery starts from that max value, stops at a zero value
Semi-global (as-required alignment) alignment • Four alternatives: penalty-less gaps in front of string s, in front of t, at the back of s, back of t) • Prefix-suffix matching by playing with alternatives • E.g., suffix of s with prefix of t: gaps at the back of s but in the front of t
Semi-global alignment • Example: p 56 • Gaps in front: zeros in row or column representing the string • Gaps at the back: recovery starts from the max of row or column representing the string • Above may be combined as required • Exercise: how to combine for matching suffix of s with prefix of t
Generalized gap penalty • Multiple gaps with the same penalty as that of one or by some formula w(k) • Each block matching gaps is to be considered as one unit (like a char) • Boundary (first row and col) initialization with w(k)
Generalized gap penalty • Three matrices interplaying: • one for character matching with p(I,j) • One for gaps in s • One for gaps in t • Formula on p 63
Affine gap penalty • Generalized gap penalty, with W(k) = h + gk, first gap costs more h+g • Formula changes slightly with known w(k) • block gap-matrices compares only previous elements: complexity reduces
Multiple sequence alignment • Function for each column: character or gap for each sequence • Combinatorics: 2^k –1, for k sequences (-1 for not putting gaps in all columns) • But . . .
Multiple sequence alignment • Order of arguments for the function should not matter: f(I,-,v) = f(I,v,-) • Score pairwise on a column • Combinatorics: (k choose 2) • For k=10, 2^k-1 = 1111, kC2=45 • We need gap to gap scoring now
Multiple sequence alignment • Total score can be measured either way: • Sum over all columns, Or, • Sum over all pairs of sequences • If p(-, -) = 0, then both the scoring above is same
Multiple sequence alignment • Consider 3 sequence alignment s1, s2, and s3 • (I, j, k)-th entry of the scoring matrix is for aligning s1[1..I], s2[1..j], s3[1..k] • 3D matrix (n x m x l) dimension, for |s1|=n, |s2|=m, |s3|=l
Multiple sequence alignment • Each entry in scoring matrix will be at a corner of a 3D box • Optimal score is calculated over all other 7 corners (max): A[I-1, j,k], A[I, j-1, k], A[I,j, k-1], A[I-1, j-1, k], A[I-1, j, k-1], A[I, j-1, k-1], A[I-1, j-1, k-1] [Vector(I,j,k) - bit-vector] • In each case sum-of-pair scores are to be added for the column [EXAMPLE] • Initialization: (-4)I 1<=I<=n, for two gaps against substrings of s1, likewise for s2 and s3
Multiple sequence alignment • For k sequences, k-dimensional matrix • Each entry is a calculation over 2^k –1 other corners of the “box” • Formula page 72
Alignment improvements • Alignment could be from the back also: S[I+1..n], t[j+1..m] • Front and back alignment could be combined to “cut” alignment: compute the two matrices, add them, align according to the added matrix
Alignment improvements • When the length of two sequences are comparable and expectation is to have good global alignment: • Retrieval is mostly along the diagonal • Computation can focus around a strip (fixed (k) number) around diagonal: k-band • More efficient • Usage of relevant cells only
Multiple sequence alignment: Star alignment • One sequence at center: all others are pairwise aligned against it • Which sequence to put at the center? • Try each: • create a 2D similarity matrix for all pairs, pick up the best (least of summed) row [page 79]
Multiple sequence alignment: Tree alignment • A spanning tree out of the sequences: nodes are sequences • Each edge labels the similarity between pair of nodes • Total tree cost, or aggregate over edges should be max • Star is a special tree