1 / 60

Sequence Alignment: Importance and Techniques

Finding similarity between biological sequences is crucial for protein homology, structure prediction, and more. Learn about global and local alignment methods, scoring functions, homology examples, and dynamic programming approaches.

wilsondavis
Download Presentation

Sequence Alignment: Importance and Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Выравнивание двух последовательностей

  2. -4 -6 -2 -3 -4 -2 -6 -3 -2 -8 -5 -4 0 -2 1 -1 -1 0 -1 -1

  3. Sequence comparison: Motivation Finding similarity between sequences is important for many biological questions. • Find homologous proteins • Allows to predict structure and function • Locate similar subsequences in DNA • e.g: allows to identify regulatory elements • Locate DNA sequences that might overlap • Helps in sequence assembly

  4. Dot plots Not technically an “alignment” But gives picture of correspondence between pairs of sequences Dot represents similarity between segments of the two sequences

  5. Sequence Alignment • Input: two sequences over the same alphabet • Output: an alignment of the two sequences • Two basic variants of sequence alignment: • Global – all characters in both sequences participate • Needleman-Wunsch, 1970 • Local – find related regions within sequences • Smith-Waterman, 1981

  6. Sequence Alignment - Example -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A • Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel) • Input: GCGCATGGATTGAGCGAandTGCGCCATTGATGACCA • Possible output:

  7. Scoring Function • Score each position independently: • Match: +1 • Mismatch: -1 • Indel: -2 • Score of an alignment is sum of position scores • Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23

  8. Homology Example: Evolution of the Globins

  9. Sequence vs. Structure Similarity Sequence 1 lcl|1A6M:_ MYOGLOBIN Length 151 (1..151) Sequence 2 lcl|1JL7:A MONOMER HEMOGLOBIN COMPONENT III Length 147 (1..147) Score = 31.6 bits (70), Expect = 10 Identities = 33/137 (24%), Positives = 55/137 (40%), Gaps = 17/137 (12%) Query: 2 LSEGEWQLVLHVWAKVEA--DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 59 LS + Q+V W + + AG G++ L + +HPE F + Sbjct: 2 LSAAQRQVVASTWKDIAGADNGAGVGKECLSKFISAHPEMAAVFG--------FSGASDP 53 Query: 60 DLKKHGVTVLTALGAI---LKKKGHHEAELKPLAQSH---ATKHKIPIKYLEFISEAIIH 113 + + G VL +G L +G AE+K + H KH I +Y E + +++ Sbjct: 54 GVAELGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGNKH-IKAEYFEPLGASLLS 112 Query: 114 VLHSRHPGDFGADAQGA 130 + R G A A+ A Sbjct: 113 AMEHRIGGKMNAAAKDA 129

  10. Example Alignment: Globins figure at right shows prototypical structure of globins figure below shows part of alignment for 8 globins (-’s indicate gaps)

  11. Insertions/Deletions and Protein Structure Why is it that two “similar” sequences may have large insertions/deletions? some insertions and deletions may not significantly affect the structure of a protein loop structures: insertions/deletions here not so significant

  12. Sequence vs. Structure Similarity Myoglobin and hemoglobin are similar, but slight differences in structure let them perform different functions. 1A6M: Myoglobin 1JL7: Hemoglobin

  13. Myoglobin & Hemoglobin Красивые ролики по структуре миоглобина и гемоглобина http://higheredbcs.wiley.com/legacy/college/boyer/0471661791/structure/HbMb/hbmb.htm

  14. The Space of Global Alignments some possible global alignments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS E-LV VIS- ELV-- --VIS EL-V -VIS

  15. Number of Possible Alignments given sequences of length m and n assume we don’t count as distinct and we can have as few as 0 and as many as min{m, n} aligned pairs therefore the number of possible alignments is given by -C G- C- -G

  16. Number of Possible Alignments there are • possible global alignments for 2 sequences of length n • e.g. two sequences of length 100 have possible alignments • but we can use dynamic programming to find an optimal alignment efficiently

  17. Dynamic Programming Algorithmic technique for optimization problems that have two properties: • Optimal substructure: Optimal solution can be computed from optimal solutions to subproblems • Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small

  18. Dynamic Programming Break problem into overlapping subproblems use memoization: remember solutions to subproblems that we have already seen 3 7 5 1 8 6 2 4

  19. Fibonacci example 1,1,2,3,5,8,13,21,... fib(n) = fib(n - 2) + fib(n - 1) Could implement as a simple recursive function However, complexity of simple recursive function is exponential in n

  20. Fibonacci dynamic programming Two approaches Memoization: Store results from previous calls of function in a table (top down approach) Solve subproblems from smallest to largest, storing results in table (bottom up approach) Both require evaluating all (n-1) subproblems only once: O(n)

  21. Dynamic Programming Graphs 1 2 3 4 5 6 graph for fib(6) Dynamic programming algorithms can be represented by a directed acyclic graph • Each subproblem is a vertex • Direct dependencies between subproblems are edges

  22. Global Alignment • Input: two sequences over the same alphabet • Output: an alignment of the two sequences in which all characters in both sequences participate • The Needleman-Wunsch algorithm finds an optimal global alignment between two sequences • Uses a scoring function • A dynamic programming algorithm

  23. Dynamic Programming Idea consider last step in computing alignment of AAAC with AGC three possible options; in each we’ll choose a different pairing for end of alignment, and add this to the best alignment of previous characters AAA AAA C C AG AGC - C AAAC - AG C consider best alignment of these prefixes score of aligning this pair +

  24. The Needleman-Wunsch (NW) Algorithm • Suppose we have two sequences: • s=s1…sn and t=t1…tm • Construct a matrix V[n+1, m+1] in which V(i, j) contains the score for the best alignment between s1…si and t1…tj. • The grade for cell V(i, j) is: V(i-1, j)+d V(i, j) = max V(i, j-1)+d V(i-1, j-1)+score(si, tj) • d- штраф за открытие разрыва (gap-open) - linear gap penalty • V(n,m) is the score for the best alignment between s and t

  25. NW Algorithm – An Example • Alphabet: • DNA, ∑ = {A,C,G,T} • Input: • s = AAAC • t = AGC • Scoring scheme: • Match: score(x, x) = 1 • Mismatch: score(x, y) = -1 • Gap Opening d = -2 V(i-1, j)+d V(i, j) = max V(i, j-1)+d V(i-1, j-1)+score(si, tj)

  26. Initializing Matrix: Global Alignment with Linear Gap Penalty A G C s 2s 3s 0 A s A 2s A 3s C 4s

  27. -4 -6 -2 -3 -4 -2 -6 -3 -2 -8 -5 -4 NW Algorithm – An Example Match: score(x, x) = 1 Mismatch: score(x, y) = -1 Gap Opening d = -2 V(i-1, j)+d V(i, j) = max V(i, j-1)+d V(i-1, j-1)+score(si, tj) -AGC AAAC 0 -2 1 -1 AG-C AAAC -1 0 -1 A-GC AAAC -1 Лучший вес по определению Обратный проход:движемся обратно по тем ячейкам, из которых было вычислено

  28. -4 -6 -2 -3 -4 -2 -6 -3 -2 -8 -5 -4 NW – Time and Space Complexity Time: • Filling the matrix: • Backtracing: • Overall: Space: • Holding the matrix: 0 -2 O(n·m) 1 -1 O(n+m) O(n·m) -1 0 -1 -1 O(n·m)

  29. Local Alignment Motivation useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere useful for comparing DNA sequences that share a similar motif but differ elsewhere useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence) more sensitive when comparing highly diverged sequences

  30. Structure of a genome a gene transcription pre-mRNA splicing mature mRNA translation Human 3x109 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding protein

  31. Structure of a genome gene D A B C Make D If B then NOT D If A and B then D short sequences regulate expression of genes lots of “junk” sequence e.g. ~50% repeats selfish DNA gene B D C Make B If D then B

  32. Cross-species genome similarity • 98% of genes are conserved between any two mammals • ~75% average similarity in protein sequence hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001 mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001 rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938 fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174 hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001 mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001 rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938 fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174 hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001 mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001 rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938 fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174 hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001 mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001 rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938 fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174 “atoh” enhancer in human, mouse, rat, fugu fish

  33. The local alignment problem Given two strings x = x1……xM, y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc

  34. Smith-Waterman Algorithm • Два отличия от Нидлмана-Вунша • Для каждого элемента матрицы дана возможность принять значение, равное нулю, если все другие значения отрицательны • Выравнивание может заканчиваться в любом месте таблицы. Лучший вес - наибольшее значение всей матрицы. Оттуда и начинается обратный проход 0 V(i-1, j)+d V(i, j) = max V(i, j-1)+d V(i-1, j-1)+score(si, tj)

  35. empty G A T C A C C T 0 0 0 0 0 0 0 0 empty 0 0 0 0 0 G 0 0 0 0 0 A 0 0 0 0 1 T 0 0 0 0 0 A 0 0 C 0 0 0 0 C 0 0 0 C Local Alignment GATCACCT GATACCC GATCACCT GAT_ACCC • Let gap = -2 match = 1 mismatch = -1. 0 1 0 0 0 0 2 0 0 1 0 3 1 0 1 1 2 2 0 0 2 1 3 1 0 1 1 2 4 2 1 0 2 3 3

  36. Overlap Alignment Перекрывающиеся выравнивания Consider the following problem: • Find the most significant overlap between two sequences S,T ? • Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences. Мы хотим получить разновидность глобального выравнивания, но в котором нет штрафа за свисающие концы То есть выравнивание начиналось на левой или верхней границе матрицы, а заканчивалось на правой или нижней

  37. Overlap Alignment Формально: Исходя из S[1..n] , T[1..m] найти i,j такие что d - максимально, где d: d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) } . Решение: То же самое, что и глобальное выравнивание, за исключением того, что мы не штрафуем за висящие концы.

  38. local overlap global Overlap Alignment • Initialization:V[i,0]=0,V[0,j]=0 Recurrence:as in global alignment Score:maximum value at the bottom line and rightmost line

  39. Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5

  40. Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5

  41. Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme: • Match: +4 • Mismatch: -1 • Indel: -5

  42. Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5 -2 Overlap Alignment (Example) The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE-

  43. Динамическое программирование с более сложными моделями • До сих пор мы рассматривали простейшую модель разрывов, где штраф d - линейно зависел от его длины. Каждый следующий остаток наказывается так же, как и первый. (g)= - nd n - число остатков, d - штраф за открытие разрыва • Введем аффинную функцию. (n)= -d-(n-1)e n - число остатков, d - штраф за открытие разрыва, а e - штраф за его продолжение

  44. Dynamic Programming for the Affine Gap Penalty Case to do in time, need 3 matrices instead of 1 IGAxi LGVyi best score given that x[i] is aligned to y[j] best score given that x[i] is aligned to a gap AIGAxi GVyi-- best score given that y[j] is aligned to a gap GAxi-- SLGVyi

  45. Why Three Matrices Are Needed consider aligning the sequences WFP and FW using d= -4 (gap opening), e = -1 (gap extension) and the following values from the BLOSUM-62 substitution matrix: W F P WF FW 0 -5 -6 -7 -WF FW- -WFP FW-- F -5 1 1 -4 optimal alignment W -6 6 2 0 best alignment of these prefixes; to get optimal alignment, need to also remember S(F, W) = 1 S(W, W) = 11 S(F, F) = 6 S(W, P) = -4 S(F, P) = -4 • the matrix shows the highest-scoring partial alignment for each pair of prefixes

  46. Global Alignment DP for the Affine Gap Penalty Case d+e e d+e e M Ix Iy M Ix Iy M Ix Iy M Ix Iy M Ix Iy

  47. Global Alignment DP for the Affine Gap Penalty Case initialization d+e d+e • traceback • start at largest of • stop at any of • note that pointers may traverse all three matrices

  48. Global Alignment Example C A C T A M : 0 -∞ -∞ -∞ -∞ -∞ Ix : -3 -∞ -∞ -∞ -∞ -∞ Iy : -3 -4 -5 -6 -7 -8 -∞ 1 -5 -4 -7 -8 -4 -∞ -∞ -∞ -∞ -∞ A -∞ -∞ -3 -4 -5 -6 -∞ -3 0 -2 -5 -6 A -5 -3 -9 -8 -11 -12 -∞ -∞ -7 -4 -5 -6 -∞ -6 -4 -1 -3 -4 -6 -4 -4 -6 -9 -10 T -∞ -∞ -10 -8 -5 -6 ACACT AA--T ACACT A--AT ACACT --AAT three optimal alignments:

  49. Local Alignment DP for the Affine Gap Penalty Case d+e e d+e e

  50. Local Alignment DP for the Affine Gap Penalty Case initialization • traceback • start at largest • stop at

More Related