1 / 68

Sequence Analysis Methods

CZ5225: Modeling and Simulation in Biology Lecture 3: Sequence analysis methods Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Sequence Analysis Methods.

tender
Download Presentation

Sequence Analysis Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CZ5225: Modeling and Simulation in BiologyLecture 3: Sequence analysis methods Prof. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, National University of Singapore

  2. Sequence Analysis Methods

  3. Gene and Protein Sequence Alignment as a Mathematical Problem: Example: Sequence a:  ATTCTTGC Sequence b: ATCCTATTCTAGC          Best Alignment:             ATTCTTGC                                  ATCCTATTCTAGC                                           /|\                   gap    Bad Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                                /|\             /|\                                           gap          gap What is a good alignment? 

  4. How to rate an alignment? • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

  5. Mismatch Match C---TTAACTCGGATCA--T Deletion gap Insertion gap Pairwise Alignment Sequence a: CTTAACT Sequence b: CGGATCAT An alignment of a and b:

  6. Alignment Graph Insertion gap Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T Deletion gap

  7. Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACTCGGATCA--T

  8. Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C C---TTAACTCGGATCA--T

  9. Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T CT C---TTAACTCGGATCA--T

  10. Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A CTTAAC C---TTAACTCGGATCA--T

  11. Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T

  12. Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T

  13. Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT CTTAACT-CGGATCAT

  14. Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT CTTAACT-CGGATCAT

  15. Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT - CTTAACTCGGATCAT

  16. Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT - C - - TTAACTCGGATC - AT -

  17. Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT CTTAACT - - - - - CGGATCAT

  18. Which pathway is better? Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT Multiple pathways Each with a unique scoring function

  19. Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T

  20. Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T

  21. Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T

  22. Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T Alignment score 6+8=14

  23. An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.

  24. Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n

  25. Initializations Gap symbol: -3 C G G A T C A T S0,0= 0 S0,1=-3, S0,2=-6, S0,3=-9, S0,4=-12, S0,5=-15, S0,6=-18, S0,7=-21, S0,8=-24 S1,0=-3, S2,0=-6, S3,0=-9, S4,0=-12, S5,0=-15, S6,0=-18, S7,0=-21 CTTAACT

  26. Match: 8 Mismatch: -5 Gap symbol: -3 S1,1 = ? C G G A T C A T Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = -3 - 3 = -6 Option 3: S1,1=S1,0 + w( - , b1) = -3-3 = -6 Optimal: S1,1 = 8 CTTAACT

  27. Match: 8 Mismatch: -5 Gap symbol: -3 S1,2 = ? C G G A T C A T Option 1: S1,2 = S0,1 +w(a1, b2) = -3 -5 = -8 Option 2: S1,2=S0,2 + w(a1, -) = -6 - 3 = -9 Option 3: S1,2=S1,1 + w( - , b2) = 8-3 = 5 Optimal: S1,2 =5 CTTAACT

  28. Match: 8 Mismatch: -5 Gap symbol: -3 S2,1 = ? C G G A T C A T Option 1: S2,1= S1,0 +w(a2, b1) = -3 -5 = -8 Option 2: S2,1=S1,1 + w(a2, -) = 8 - 3 = 5 Option 3: S2,1=S2,0 + w( - , b1) = -6-3 = -9 Optimal: S2,1 =5 CTTAACT

  29. Match: 8 Mismatch: -5 Gap symbol: -3 S2,2 = ? C G G A T C A T Option 1: S2,2= S1,1 +w(a2, b2) = 8 -5 = 3 Option 2: S2,2=S1,2 + w(a2, -) = 5 - 3 = 2 Option 3: S2,2=S2,1 + w( - , b2) = 5-3 = 2 Optimal: S2,2 =3 CTTAACT

  30. S3,5 = ? C G G A T C A T CTTAACT

  31. S3,5 = ? C G G A T C A T CTTAACT optimal score

  32. C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT

  33. Local vs. Global Sequence Alignment: Example: DNA sequence a:  ATTCTTGC DNA sequence b: ATCCTATTCTAGC          Local Alignment:             ATTCTTGC Gaps ignored in local alignments                                  ATCCTATTCTAGC                                          /|\                   gap    Global Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                              /|\             /|\                                      gap          gap Gaps counted in global alignments

  34. Global Alignment vs. Local Alignment • global alignment: • local alignment: All sections are counted Only local sections (normally separated by gaps) are counted

  35. An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computedas follows.

  36. Match: 8 Mismatch: -5 Gap symbol: -3 Initializations C G G A T C A T CTTAACT

  37. Match: 8 Mismatch: -5 Gap symbol: -3 S1,1 = ? C G G A T C A T Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = 0 - 3 = -3 Option 3: S1,1=S1,0 + w( - , b1) = 0-3 = -3 Option 4: S1,1=0 Optimal: S1,1 = 8 CTTAACT

  38. local alignment Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T CTTAACT

  39. local alignment A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score

  40. BLAST Basic Local Alignment Search Tool Procedure: • Divide all sequences into overlapping constituent words (size k) • Build the hash table for Sequence a. • Scan Sequence b for hits. • Extend hits.

  41. BLAST Basic Local Alignment Search Tool Step 1: Hash table for sequence A

  42. Amino acid similarity matrix PAM 120 Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids

  43. Amino acid similarity matrix PAM 250 This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps.

  44. Amino acid similarity matrix Blosum 45 The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. .

  45. BLAST Basic Local Alignment Search Tool

  46. BLAST Basic Local Alignment Search Tool Step 2: Use all of the 2-letter words in query sequence to scan against database sequence and mark those with score > 8 Note: Marked points can be on the diagonal and off-diagonal LN:LN=9 NF:NY=8 GW:PW=10

  47. BLAST Step2: Scan sequence b for hits.

  48. BLAST Step2: Scan sequence b for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the extension fades away.

  49. Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC

  50. Multiple sequence alignment MSA

More Related