Sequence Alignment Techniques and Applications

Sequence Alignment

Motivation: Types • Two sequences of same length, some characters are different (Database search) Aagtacggaga aagcaccgaga • Two seq are of different length, possible gaps in one of them (Database search) Aaccaccgaga Aa-caccgaga

Motivation: Types • Match longest prefix of one with the suffix of the other (fragment assembly) Aaacgtcgata gatacgatg • Local alignment: longest substring matching over two sequences (homolog search) Gatacgatgctagtttacg agagcgatgcataattcgaatga

Motivation: Types • Multiple sequence alignment (page 71) (Comparative studies of sequences)

Formalizing sequence comparison • Either a character matches with the corresponding character in an an alignment (+1), • Or, it does not (-1), • Or, a gap needs to be inserted (-2)

Global Alignment • Smith-Waterman (1981) Dynamic programming algorithm • Scoring matrix for alignment (p 31) • Initializing boundaries of the scoring matrix for gaps in front of either string • Meaning of an entry to the matrix • Corner element is the final score

Global Alignment • Three alternatives in each iteration • Ordering of calculation: row or column-wise • The algorithm (p 52) • Recursive recovery process from corner element (constant m and n, the string lengths) • Variable len returned by the algorithm • Convention for tie braking

Local alignment • Alignment will stop anywhere • So, the min score is zero, even on boundaries • Best local alignment is where the score is max in the matrix • Recovery starts from that max value, stops at a zero value

Semi-global (as-required alignment) alignment • Four alternatives: penalty-less gaps in front of string s, in front of t, at the back of s, back of t) • Prefix-suffix matching by playing with alternatives • E.g., suffix of s with prefix of t: gaps at the back of s but in the front of t

Semi-global alignment • Example: p 56 • Gaps in front: zeros in row or column representing the string • Gaps at the back: recovery starts from the max of row or column representing the string • Above may be combined as required • Exercise: how to combine for matching suffix of s with prefix of t

Generalized gap penalty • Multiple gaps with the same penalty as that of one or by some formula w(k) • Each block matching gaps is to be considered as one unit (like a char) • Boundary (first row and col) initialization with w(k)

Generalized gap penalty • Three matrices interplaying: • one for character matching with p(I,j) • One for gaps in s • One for gaps in t • Formula on p 63

Affine gap penalty • Generalized gap penalty, with W(k) = h + gk, first gap costs more h+g • Formula changes slightly with known w(k) • block gap-matrices compares only previous elements: complexity reduces

Multiple sequence alignment • Function for each column: character or gap for each sequence • Combinatorics: 2^k –1, for k sequences (-1 for not putting gaps in all columns) • But . . .

Multiple sequence alignment • Order of arguments for the function should not matter: f(I,-,v) = f(I,v,-) • Score pairwise on a column • Combinatorics: (k choose 2) • For k=10, 2^k-1 = 1111, kC2=45 • We need gap to gap scoring now

Multiple sequence alignment • Total score can be measured either way: • Sum over all columns, Or, • Sum over all pairs of sequences • If p(-, -) = 0, then both the scoring above is same

Multiple sequence alignment • Consider 3 sequence alignment s1, s2, and s3 • (I, j, k)-th entry of the scoring matrix is for aligning s1[1..I], s2[1..j], s3[1..k] • 3D matrix (n x m x l) dimension, for |s1|=n, |s2|=m, |s3|=l

Multiple sequence alignment • Each entry in scoring matrix will be at a corner of a 3D box • Optimal score is calculated over all other 7 corners (max): A[I-1, j,k], A[I, j-1, k], A[I,j, k-1], A[I-1, j-1, k], A[I-1, j, k-1], A[I, j-1, k-1], A[I-1, j-1, k-1] [Vector(I,j,k) - bit-vector] • In each case sum-of-pair scores are to be added for the column [EXAMPLE] • Initialization: (-4)I 1<=I<=n, for two gaps against substrings of s1, likewise for s2 and s3

Multiple sequence alignment • For k sequences, k-dimensional matrix • Each entry is a calculation over 2^k –1 other corners of the “box” • Formula page 72

Alignment improvements • Alignment could be from the back also: S[I+1..n], t[j+1..m] • Front and back alignment could be combined to “cut” alignment: compute the two matrices, add them, align according to the added matrix

Alignment improvements • When the length of two sequences are comparable and expectation is to have good global alignment: • Retrieval is mostly along the diagonal • Computation can focus around a strip (fixed (k) number) around diagonal: k-band • More efficient • Usage of relevant cells only

Multiple sequence alignment: Star alignment • One sequence at center: all others are pairwise aligned against it • Which sequence to put at the center? • Try each: • create a 2D similarity matrix for all pairs, pick up the best (least of summed) row [page 79]

Multiple sequence alignment: Tree alignment • A spanning tree out of the sequences: nodes are sequences • Each edge labels the similarity between pair of nodes • Total tree cost, or aggregate over edges should be max • Star is a special tree

PAM matrix for matching residues

BLAST search engine

Sequence Alignment Techniques and Applications

Sequence Alignment Techniques and Applications

Presentation Transcript

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment:

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Sequence alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Sequence Alignment