1 / 25

Sequence Alignment

Sequence Alignment. Motivation: Types. Two sequences of same length, some characters are different ( Database search) Aagtacggaga aagcaccgaga Two seq are of different length, possible gaps in one of them ( Database search) Aaccaccgaga Aa-caccgaga. Motivation: Types.

kalin
Download Presentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment

  2. Motivation: Types • Two sequences of same length, some characters are different (Database search) Aagtacggaga aagcaccgaga • Two seq are of different length, possible gaps in one of them (Database search) Aaccaccgaga Aa-caccgaga

  3. Motivation: Types • Match longest prefix of one with the suffix of the other (fragment assembly) Aaacgtcgata gatacgatg • Local alignment: longest substring matching over two sequences (homolog search) Gatacgatgctagtttacg agagcgatgcataattcgaatga

  4. Motivation: Types • Multiple sequence alignment (page 71) (Comparative studies of sequences)

  5. Formalizing sequence comparison • Either a character matches with the corresponding character in an an alignment (+1), • Or, it does not (-1), • Or, a gap needs to be inserted (-2)

  6. Global Alignment • Smith-Waterman (1981) Dynamic programming algorithm • Scoring matrix for alignment (p 31) • Initializing boundaries of the scoring matrix for gaps in front of either string • Meaning of an entry to the matrix • Corner element is the final score

  7. Global Alignment • Three alternatives in each iteration • Ordering of calculation: row or column-wise • The algorithm (p 52) • Recursive recovery process from corner element (constant m and n, the string lengths) • Variable len returned by the algorithm • Convention for tie braking

  8. Local alignment • Alignment will stop anywhere • So, the min score is zero, even on boundaries • Best local alignment is where the score is max in the matrix • Recovery starts from that max value, stops at a zero value

  9. Semi-global (as-required alignment) alignment • Four alternatives: penalty-less gaps in front of string s, in front of t, at the back of s, back of t) • Prefix-suffix matching by playing with alternatives • E.g., suffix of s with prefix of t: gaps at the back of s but in the front of t

  10. Semi-global alignment • Example: p 56 • Gaps in front: zeros in row or column representing the string • Gaps at the back: recovery starts from the max of row or column representing the string • Above may be combined as required • Exercise: how to combine for matching suffix of s with prefix of t

  11. Generalized gap penalty • Multiple gaps with the same penalty as that of one or by some formula w(k) • Each block matching gaps is to be considered as one unit (like a char) • Boundary (first row and col) initialization with w(k)

  12. Generalized gap penalty • Three matrices interplaying: • one for character matching with p(I,j) • One for gaps in s • One for gaps in t • Formula on p 63

  13. Affine gap penalty • Generalized gap penalty, with W(k) = h + gk, first gap costs more h+g • Formula changes slightly with known w(k) • block gap-matrices compares only previous elements: complexity reduces

  14. Multiple sequence alignment • Function for each column: character or gap for each sequence • Combinatorics: 2^k –1, for k sequences (-1 for not putting gaps in all columns) • But . . .

  15. Multiple sequence alignment • Order of arguments for the function should not matter: f(I,-,v) = f(I,v,-) • Score pairwise on a column • Combinatorics: (k choose 2) • For k=10, 2^k-1 = 1111, kC2=45 • We need gap to gap scoring now

  16. Multiple sequence alignment • Total score can be measured either way: • Sum over all columns, Or, • Sum over all pairs of sequences • If p(-, -) = 0, then both the scoring above is same

  17. Multiple sequence alignment • Consider 3 sequence alignment s1, s2, and s3 • (I, j, k)-th entry of the scoring matrix is for aligning s1[1..I], s2[1..j], s3[1..k] • 3D matrix (n x m x l) dimension, for |s1|=n, |s2|=m, |s3|=l

  18. Multiple sequence alignment • Each entry in scoring matrix will be at a corner of a 3D box • Optimal score is calculated over all other 7 corners (max): A[I-1, j,k], A[I, j-1, k], A[I,j, k-1], A[I-1, j-1, k], A[I-1, j, k-1], A[I, j-1, k-1], A[I-1, j-1, k-1] [Vector(I,j,k) - bit-vector] • In each case sum-of-pair scores are to be added for the column [EXAMPLE] • Initialization: (-4)I 1<=I<=n, for two gaps against substrings of s1, likewise for s2 and s3

  19. Multiple sequence alignment • For k sequences, k-dimensional matrix • Each entry is a calculation over 2^k –1 other corners of the “box” • Formula page 72

  20. Alignment improvements • Alignment could be from the back also: S[I+1..n], t[j+1..m] • Front and back alignment could be combined to “cut” alignment: compute the two matrices, add them, align according to the added matrix

  21. Alignment improvements • When the length of two sequences are comparable and expectation is to have good global alignment: • Retrieval is mostly along the diagonal • Computation can focus around a strip (fixed (k) number) around diagonal: k-band • More efficient • Usage of relevant cells only

  22. Multiple sequence alignment: Star alignment • One sequence at center: all others are pairwise aligned against it • Which sequence to put at the center? • Try each: • create a 2D similarity matrix for all pairs, pick up the best (least of summed) row [page 79]

  23. Multiple sequence alignment: Tree alignment • A spanning tree out of the sequences: nodes are sequences • Each edge labels the similarity between pair of nodes • Total tree cost, or aggregate over edges should be max • Star is a special tree

  24. PAM matrix for matching residues

  25. BLAST search engine

More Related