570 likes | 581 Views
Sequence Alignment. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao. Bioinformatics. Bioinformatics and Computational Biology-Related Journals:.
E N D
Sequence Alignment Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao
Bioinformatics and Computational Biology-Related Journals: • Bioinformatics (previously called CABIOS) • Bulletin of Mathematical Biology • Genome Research • Genomics • IEEE/ACM Transactions on Computational Biology and Bioinformatics • Journal of Bioinformatics and Computational Biology • Journal of Computational Biology • Journal of Molecular Biology • Nature • Nucleic Acid Research • Science
Bioinformatics and Computational Biology-Related Conferences: • Intelligent Systems for Molecular Biology (ISMB) • Pacific Symposium on Biocomputing (PSB) • The Annual International Conference on Research in Computational Molecular Biology (RECOMB) • Workshop on Algorithms in Bioinformatics (WABI) • The IEEE Computer Society Bioinformatics Conference (CSB)
Bioinformatics and Computational Biology-Related Books: • Calculating the Secrets of Life: Applications of the Mathematical Sciences in Molecular Biology, by Eric S. Lander and Michael S. Waterman (1995) • Introduction to Computational Biology: Maps, Sequences, and Genomes, by Michael S. Waterman (1995) • Introduction to Computational Molecular Biology, by Joao Carlos Setubal and Joao Meidanis (1996) • Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield (1997) • Computational Molecular Biology: An Algorithmic Approach, by Pavel Pevzner (2000) • Introduction to Bioinformatics, by Arthur M. Lesk (2002)
Useful Websites • MIT Biology Hypertextbook • http://www.mit.edu:8001/afs/athena/course/other/esgbio/www/7001main.html • The International Society for Computational Biology: • http://www.iscb.org/ • National Center for Biotechnology Information (NCBI, NIH): • http://www.ncbi.nlm.nih.gov/ • European Bioinformatics Institute (EBI): • http://www.ebi.ac.uk/ • DNA Data Bank of Japan (DDBJ): • http://www.ddbj.nig.ac.jp/
Dot Matrix C G G A T C A T Sequence A:CTTAACT Sequence B:CGGATCAT CTTAACT
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: C---TTAACTCGGATCA--T Sequence A Sequence B
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Mismatch Match C---TTAACTCGGATCA--T Deletion gap Insertion gap
Alignment Graph C G G A T C A T Sequence A: CTTAACT Sequence B: CGGATCAT CTTAACT C---TTAACTCGGATCA--T
A simple scoring scheme • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score
An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.
ComputingSi,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n
Initializations C G G A T C A T CTTAACT
S3,5 = ? C G G A T C A T CTTAACT
S3,5 = 5 C G G A T C A T CTTAACT optimal score
C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT
Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment?
Initializations G A A T C T G C CAATTGA
S4,2 = ? G A A T C T G C CAATTGA
S5,5 = ? G A A T C T G C CAATTGA
S5,5 = 14 G A A T C T G C CAATTGA optimal score
C A A T - T G AG A A T C T G C -5 +8 +8 +8 -3 +8 +8 -5 = 27 G A A T C T G C CAATTGA
Global Alignment vs. Local Alignment • global alignment: • local alignment:
An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computedas follows.
Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT The best score
A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score
Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal local alignment?
Did you get it right? G A A T C T G C CAATTGA
A A T – T GA A T C T G 8+8+8-3+8+8 = 37 G A A T C T G C CAATTGA
Affine gap penalties • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) • Each gap is charged an extra gap-open penalty: -4. -4 -4 C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score: 12 – 4 – 4 = 4
Affine gap panalties • A gap of length k is penalized x + k·y. gap-open penalty • Three cases for alignment endings: • ...x...x • ...x...- • ...-...x gap-symbol penalty an aligned pair a deletion an insertion
Affine gap penalties • Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith a deletion. • Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith an insertion. • Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.
Affine gap penalties (A gap of length k is penalized x + k·y.)
D D D I I I S S S Affine gap penalties -y w(ai,bj) -x-y D -x-y I S -y
Constant gap penalties • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: 0 (w(-,x)=w(x,-)=0) • Each gap is charged a constant penalty: -4. -4 -4 C - - - T T A A C TC G G A T C A - - T +8 0 0 0 +8 -5 +8 0 0 +8 = +27 Alignment score: 27 – 4 – 4 = 19
Constant gap penalties • Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith a deletion. • Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith an insertion. • Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.
Restricted affine gap panalties • A gap of length k is penalized x + f(k)·y. where f(k) = k for k <= c and f(k) = c for k > c • Five cases for alignment endings: • ...x...x • ...x...- • ...-...x • and 5. for long gaps an aligned pair a deletion an insertion
D(i, j) vs. D’(i, j) • Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length <= c D(i, j) >= D’(i, j) • Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j) <= D’(i, j)
k best local alignments • Smith-Waterman(Smith and Waterman, 1981; Waterman and Eggert, 1987) • FASTA(Wilbur and Lipman, 1983; Lipman and Pearson, 1985) • BLAST(Altschul et al., 1990; Altschul et al., 1997)
FASTA • Find runs of identities, and identify regions with the highest density of identities. • Re-score using PAM matrix, and keep top scoring segments. • Eliminate segments that are unlikely to be part of the alignment. • Optimize the alignment in a band.
FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence B Sequence A
FASTA Step 2: Re-score using PAM matrix, andkeep top scoring segments.
FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.
FASTA Step 4: Optimize the alignment in a band.