550 likes | 582 Views
Explore sequence alignments, dot matrix, and pairwise alignment concepts for bioinformatics in this chapter. Learn about global vs. local alignment, scoring schemes, and optimal alignment calculations using examples. Understand affine and constant gap penalties.
E N D
Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5
Dot Matrix Sequence A:CTTAACT Sequence B:CGGATCAT C G G A T C A T CTTAACT
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: C---TTAACTCGGATCA--T Sequence A Sequence B
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Mismatch Match C---TTAACTCGGATCA--T Deletion gap Insertion gap
Alignment Graph Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T CTTAACT C---TTAACTCGGATCA--T
A simple scoring scheme • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score
An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.
ComputingSi,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n
Initializations C G G A T C A T CTTAACT
S3,5 = ? C G G A T C A T CTTAACT
S3,5 = 5 C G G A T C A T CTTAACT optimal score
C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT
Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment?
Initializations G A A T C T G C CAATTGA
S4,2 = ? G A A T C T G C CAATTGA
S5,5 = ? G A A T C T G C CAATTGA
S5,5 = 14 G A A T C T G C CAATTGA optimal score
C A A T - T G AG A A T C T G C -5 +8 +8 +8 -3 +8 +8 -5 = 27 G A A T C T G C CAATTGA
Global Alignment vs. Local Alignment • global alignment: • local alignment:
An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computedas follows.
Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT The best score
A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score
Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal local alignment?
Did you get it right? G A A T C T G C CAATTGA
A A T – T GA A T C T G 8+8+8-3+8+8 = 37 G A A T C T G C CAATTGA
Affine gap penalties • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) • Each gap is charged an extra gap-open penalty: -4. -4 -4 C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score: 12 – 4 – 4 = 4
Affine gap panalties • A gap of length k is penalized x + k·y. gap-open penalty • Three cases for alignment endings: • ...x...x • ...x...- • ...-...x gap-symbol penalty an aligned pair a deletion an insertion
Affine gap penalties • Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith a deletion. • Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith an insertion. • Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.
Affine gap penalties (A gap of length k is penalized x + k·y.)
D D D I I I S S S Affine gap penalties -y w(ai,bj) -x-y D -x-y I S -y
Constant gap penalties • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: 0 (w(-,x)=w(x,-)=0) • Each gap is charged a constant penalty: -4. -4 -4 C - - - T T A A C TC G G A T C A - - T +8 0 0 0 +8 -5 +8 0 0 +8 = +27 Alignment score: 27 – 4 – 4 = 19
Constant gap penalties • Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith a deletion. • Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj endingwith an insertion. • Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.
Restricted affine gap panalties • A gap of length k is penalized x + f(k)·y. where f(k) = k for k <= c and f(k) = c for k > c • Five cases for alignment endings: • ...x...x • ...x...- • ...-...x • and 5. for long gaps an aligned pair a deletion an insertion
D(i, j) vs. D’(i, j) • Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length <= c D(i, j) >= D’(i, j) • Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j) <= D’(i, j)
k best local alignments • Smith-Waterman(Smith and Waterman, 1981; Waterman and Eggert, 1987) • FASTA(Wilbur and Lipman, 1983; Lipman and Pearson, 1985) • BLAST(Altschul et al., 1990; Altschul et al., 1997)
FASTA • Find runs of identities, and identify regions with the highest density of identities. • Re-score using PAM matrix, and keep top scoring segments. • Eliminate segments that are unlikely to be part of the alignment. • Optimize the alignment in a band.
FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence B Sequence A
FASTA Step 2: Re-score using PAM matrix, andkeep top scoring segments.
FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.
FASTA Step 4: Optimize the alignment in a band.
BLAST • Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman) • The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.
The maximal segment pair measure • A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences.(for DNA: Identities: +5; Mismatches: -4) • The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. • BLAST heuristically attempts to calculate the MSP score. the highest scoring pair
BLAST • Build the hash table for Sequence A. • Scan Sequence B for hits. • Extend hits.
BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For protein sequences: Seq. A = ELVISAdd xyz to the hash table if Score(xyz, ELV) ≧ T;Add xyz to the hash table if Score(xyz, LVI) ≧ T;Add xyz to the hash table if Score(xyz, VIS) ≧ T; For DNA sequences: Seq. A = AGATCGAT 12345678 AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..TTT
BLAST Step2: Scan sequence B for hits.