700 likes | 876 Views
CZ5225: Modeling and Simulation in Biology Lecture 3: Sequence analysis methods Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Sequence Analysis Methods.
E N D
CZ5225: Modeling and Simulation in BiologyLecture 3: Sequence analysis methods Prof. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, National University of Singapore
Gene and Protein Sequence Alignment as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap What is a good alignment?
How to rate an alignment? • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
Mismatch Match C---TTAACTCGGATCA--T Deletion gap Insertion gap Pairwise Alignment Sequence a: CTTAACT Sequence b: CGGATCAT An alignment of a and b:
Alignment Graph Insertion gap Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T Deletion gap
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACTCGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C C---TTAACTCGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T CT C---TTAACTCGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A CTTAAC C---TTAACTCGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T
Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T
Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT CTTAACT-CGGATCAT
Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT CTTAACT-CGGATCAT
Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT - CTTAACTCGGATCAT
Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT - C - - TTAACTCGGATC - AT -
Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT CTTAACT - - - - - CGGATCAT
Which pathway is better? Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT Multiple pathways Each with a unique scoring function
Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T
Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T
Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T
Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T Alignment score 6+8=14
An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.
Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n
Initializations Gap symbol: -3 C G G A T C A T S0,0= 0 S0,1=-3, S0,2=-6, S0,3=-9, S0,4=-12, S0,5=-15, S0,6=-18, S0,7=-21, S0,8=-24 S1,0=-3, S2,0=-6, S3,0=-9, S4,0=-12, S5,0=-15, S6,0=-18, S7,0=-21 CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S1,1 = ? C G G A T C A T Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = -3 - 3 = -6 Option 3: S1,1=S1,0 + w( - , b1) = -3-3 = -6 Optimal: S1,1 = 8 CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S1,2 = ? C G G A T C A T Option 1: S1,2 = S0,1 +w(a1, b2) = -3 -5 = -8 Option 2: S1,2=S0,2 + w(a1, -) = -6 - 3 = -9 Option 3: S1,2=S1,1 + w( - , b2) = 8-3 = 5 Optimal: S1,2 =5 CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S2,1 = ? C G G A T C A T Option 1: S2,1= S1,0 +w(a2, b1) = -3 -5 = -8 Option 2: S2,1=S1,1 + w(a2, -) = 8 - 3 = 5 Option 3: S2,1=S2,0 + w( - , b1) = -6-3 = -9 Optimal: S2,1 =5 CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S2,2 = ? C G G A T C A T Option 1: S2,2= S1,1 +w(a2, b2) = 8 -5 = 3 Option 2: S2,2=S1,2 + w(a2, -) = 5 - 3 = 2 Option 3: S2,2=S2,1 + w( - , b2) = 5-3 = 2 Optimal: S2,2 =3 CTTAACT
S3,5 = ? C G G A T C A T CTTAACT
S3,5 = ? C G G A T C A T CTTAACT optimal score
C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT
Local vs. Global Sequence Alignment: Example: DNA sequence a: ATTCTTGC DNA sequence b: ATCCTATTCTAGC Local Alignment: ATTCTTGC Gaps ignored in local alignments ATCCTATTCTAGC /|\ gap Global Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Gaps counted in global alignments
Global Alignment vs. Local Alignment • global alignment: • local alignment: All sections are counted Only local sections (normally separated by gaps) are counted
An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computedas follows.
Match: 8 Mismatch: -5 Gap symbol: -3 Initializations C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S1,1 = ? C G G A T C A T Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = 0 - 3 = -3 Option 3: S1,1=S1,0 + w( - , b1) = 0-3 = -3 Option 4: S1,1=0 Optimal: S1,1 = 8 CTTAACT
local alignment Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T CTTAACT
local alignment A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score
BLAST Basic Local Alignment Search Tool Procedure: • Divide all sequences into overlapping constituent words (size k) • Build the hash table for Sequence a. • Scan Sequence b for hits. • Extend hits.
BLAST Basic Local Alignment Search Tool Step 1: Hash table for sequence A
Amino acid similarity matrix PAM 120 Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids
Amino acid similarity matrix PAM 250 This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps.
Amino acid similarity matrix Blosum 45 The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. .
BLAST Basic Local Alignment Search Tool Step 2: Use all of the 2-letter words in query sequence to scan against database sequence and mark those with score > 8 Note: Marked points can be on the diagonal and off-diagonal LN:LN=9 NF:NY=8 GW:PW=10
BLAST Step2: Scan sequence b for hits.
BLAST Step2: Scan sequence b for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the extension fades away.
Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC