750 likes | 1.1k Views
COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA). Xingquan Zhu Dept. of Computer Science and Engineering. Outline. Dynamic Programming Dot Matrix Plot FASTA BLAST. Difficulty of PSA. Consider two sequences of length n Number of possible global alignments
E N D
COT 6930HPC and BioinformaticsPairwise Sequence Alignment (PSA) Xingquan Zhu Dept. of Computer Science and Engineering
Outline • Dynamic Programming • Dot Matrix Plot • FASTA • BLAST
Difficulty of PSA • Consider two sequences of length n • Number of possible global alignments • For n = 40, there are over 1023 possible alignments • Approaches • “Optimal” solution • Dynamic programming • Heuristic solutions • Dot matrix plot • FASTA • BLAST
Dynamic Programming (DP) • Alignment • Problem can be subdivided • Optimal alignment of n bases • 1. Incorporates optimal alignment of 1…n-1 bases • 2. Combine with best alignment for nth base • Bioinformatics application • Global alignment [Needleman-Wunsch 1970] • Local alignment [Smith-Waterman 1981] • Complexity • O(n2) or O(n × m) algorithm (sequences of length n, m) • Feasible for moderate sized sequences, not entire genomes
Dynamic Programming (DP) • Dynamic programming usually consists of three components. • Recursive relation • Tabular computation • Trace back • This efficient recursive method is used to search through all possible alignments and finding the one with the optimal score.
How DP is used for alignment? • An alignment is represented as a path through a matrix. To search through the matrix of all possible paths and find the optimal path DP is used.
Four possible outcomes in aligning two sequences [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!)
Intuition of Dynamic Programming If we already have the optimal solution to: XY AB then we know the next pair of characters will either be: XYZ or XY- or XYZ ABCABC AB- (where “-” indicates a gap). So we can extend the match by determining which of these has the highest score.
Intuition of Dynamic Programming Considering sequences: “HEAGA” and “PA”
DP (Global Alignment) • Recursive formula • Notation • xi ith letter of string x • yj jth letter of string y • x1..i prefix of x from letters 1 through i • F matrix of optimal scores (DP Matrix) • F(i, j) represents optimal score lining up x 1..i with y1..j • d gap penalty • s scoring matrix
Constant vs. Non-constant Gap Penalty • Constant gap penalty • Non-constant gap penalty
DP (Global Alignment) • Algorithm • Initialize: F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) • Fill from top left to bottom right using recursive formula
DP Example • Sequences • Seq1: HEAGAWGHEE • Seq2: PAWHEAE • Gap Penalty: -8 (Constant) • Scoring Matrix (Blosum 50)
DP Example Algorithm Initialize F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) Fill from top left to bottom right using recursive formula Blosum 50
DP Example Algorithm Initialize F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) Fill from top left to bottom right using recursive formula Blosum 50
DP Example (Traceback) • Trace arrows from bottom right to top left • Diagonal – both match • Up – left sequence match a gap • Or insert a gap to top sequence • Left – top sequence match a gap • Or insert a gap to left sequence
Global alignment vs. local alignment • Global alignment: the entire sequence of each protein or DNA sequence is contained in the alignment. • Local alignment: only regions of greatest similarity between two sequences are aligned percent identity: ~26% RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
Global vs. local alignment • Global alignment are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. • Global methods are useful when you want to force two sequences to align over their entire length • Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences.
DP (local alignment) • Make 0 minimal score (i.e., start new alignment) • Alignment can start / end anywhere • Start at highest score(s) • End when 0 reached
DP (local alignment) Blosum 50
DP (Local Alignment) • Traceback • Start at highest score and trace arrows back to first 0
Multiple, Repeat, & Overlap Matches • Multiple alignments • “Optimal” global alignment may be best by small margin • Can report several high-scoring alignments • Repeat matches • Allow sections to match repeatedly • Find repeated patterns (domain / motif) • Overlap matches • Matching when the two sequences overlap • Does not penalize overhanging ends
Overlap Alignment Consider the following problem: • Find the most significant overlap between two sequences S,T ? • Possible overlap relations: a. b. Difference from local alignment: require alignment between the endpoints of the two sequences.
Overlap Alignment Formally: given S[1..n] , T[1..m] find i,j such that: d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) } is maximal. Solution: Same asGlobal alignment except we don’t not penalise overhanging ends.
Overlap Alignment • Initialization:F(i,0)=0,F(0,j)=0 • Recurrence:as in global alignment • Score:maximum value at the bottom line and rightmost line
Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5
Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5
Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme: • Match: +4 • Mismatch: -1 • Indel: -5
Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5 -2 Overlap Alignment (Example) The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE-
Outline • Dynamic Programming • Dot Matrix Plot • FASTA • BLAST
Dot Matrix Plot • Method • Put two sequences on vertical and horizontal axes of graph • Put dots wherever there is a match • Diagonal line is region of identity (local alignment) • Filter non-diagonals (minimal length, identity threshold)
The Dotplot is a Table for overview of the similarities between two sequences The palindromic sequence: MAXISTAYAWAYATSIXAM The stretches of similar residues show up as diagonals The palindromic sequence is very important part of DNA sequence. Many enzyme recognize these sequences
Dot Matrix Plot Example • Chimpanzee DNA aligned with itself • Filters needed to eliminate random matches filtered 6 of 10 filtered 8 of 10
Dot Matrix Plot • Advantage • Extremely simple • Can use scoring matrix • Can use multiple filters to rank matches • Example • Red > 50 bases, 95% identical (very close) • Blue > 25 bases, 75% identical (close) • Green > 10 bases, 50% identical (distant) • Limitations • O(n2) algorithm • Relies on visual analysis • Difficult to find “best” alignment • Despite multiple filters, difficult to compare alignments
Dot Matrix Plot • Demo • http://www.vivo.colostate.edu/molkit/dnadot/
Outline • Dynamic Programming • Dot Matrix Plot • FASTA • BLAST
G A A T T C A G T T A G 1 1 1 1 1 1 1 1 1 1 1 G 1 1 1 1 1 1 1 2 2 2 2 A 1 2 2 2 2 2 2 2 2 2 2 T 1 2 2 3 3 3 3 3 3 3 3 C 1 2 2 3 3 4 4 4 4 4 4 G 1 2 2 3 3 4 4 5 5 5 5 A 1 2 3 3 3 4 5 5 5 5 6 FASTA - Idea - • Problem of Dynamic Programming DP compute the score in a lot of useless area for optimal sequence • FASTA focuses on diagonal area
FASTA - Heuristic - • Heuristic Good local alignment should have some exact match subsequence. FASTA focus on this area
FASTA High Level Description Let q be a query max 0 For each sequence, s in DB compare q with s and compute a score, y if max < y max y; bestSequence s ; Return bestSequence
FASTA - Algorithm - • Step 1 Find all hot-spots // Hot spots is pairs of words of length k that exactly match k=4-6 for DNA, k=1-2 for Amino acid sequences Sequence 1 Hot Spots Sequence 2
G A A T T C A G T T A G * * Q Location G * * A 2,3,7,11 A * * * * C 6 T * * * * G 1,8 C * G * * T 4,5,9,10 A * * * * FASTA - Algorithm - • Step 1 in detail Use look-up Table Query : G A A T T C A G T T A Sequence: G G A T C G A Dot—Matrix Look-up Table
FASTA - Algorithm - • Step 2 Score the Hot-spot and locate the ten best diagonal run. // There is some scoring system; ex. PAM250
FASTA - Algorithm - • Step 3 Combine sub-alignments into one alignment with GAP GAP One of local alignment
FASTA - Algorithm - • Step 4 # Consider weighted direct graph. # Let node be a sub-alignment found in step 1 # Let u and v be nodes # Edge (u,v) exists if alignment u is before in the sequence. # Each edge has gap penalty (negative) # Find the maximum weight path Sub-sequence Edge One Sequence
One of Sequence FASTA - Algorithm - GAP • Step 4 in detail Sub-alignment Gap -5 -3 -3 Max Weight Path
FASTA - Algorithm - • Step 5 Use the dynamic programming in restricted area around the best-score alignment to find out the higher-score alignment than the best-score alignment Width of this band is a parameter
FASTA - Algorithm - • Summary of Algorithm 1: Find all hot-spots // Hot spots is pairs of words of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment 4: Score Each alignment with gap penalty and pick up the best-score alignment 5: Use the dynamic programming in restricted area around the best-score alignment to find out the alignment greater than the best-score alignment.
FASTA - Complexity - • Complexity # Step 1 and 2 // select the best 10 diagonal run Let n be a sequence from DB O(n+m) because Step 1 just uses look up the table O(n+m) << O(mn) m,n = 100 to 200