430 likes | 613 Views
Traceback and local alignment. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com. Outline. Responses from last class Sequence alignment Motivation Scoring alignments Python.
E N D
Traceback and local alignment Prof. William Stafford Noble Department of Genome SciencesDepartment of Computer Science and Engineering University of Washington thabangh@gmail.com
Outline • Responses from last class • Sequence alignment • Motivation • Scoring alignments • Python
One-minute responses • Thank you for doing the one-minute responses and the revisions. • Liked that you let us solve the DP. • Liked going to lab in second half of lecture. • More slowly with Python programming. • Please explain more about string handling in Python, especially sys.argv. • sys.argv is now more clear. • Need more practical problems using sys. • I know how to compute the DP score but not how to get the alignment. • I still do not get how sequence alignment works. • Can we go over DP again? • Please don’t give us a test because our essay phase has begun already. • Please limit the number of questions in class. • Moving too slow with previous work and too fast with current work. • I understood about 95% of the lecture. • Can you give us some Python documentation or some interesting web site to improve and learn?
Revision • What two things are needed to score a pairwise alignment? • A substitution matrix and a gap penalty. • What does entry (i,j) in the DP matrix store? • The score of the best-scoring alignment up to those positions. • What are the three valid moves when filling in the DP matrix? • Horizontal and vertical, corresponding to gaps. Diagonal, corresponding to a substitution.
A small example Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.
A simple example Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.
A simple example Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.
Traceback • Start from the lower right corner and trace back to the upper left. • Each arrow introduces one character at the end of each aligned sequence. • A horizontal move puts a gap in the left sequence. • A vertical move puts a gap in the top sequence. • A diagonal move uses one character from each sequence.
A simple example Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. • Start from the lower right corner and trace back to the upper left. • Each arrow introduces one character at the end of each aligned sequence. • A horizontal move puts a gap in the left sequence. • A vertical move puts a gap in the top sequence. • A diagonal move uses one character from each sequence.
A simple example Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. • Start from the lower right corner and trace back to the upper left. • Each arrow introduces one character at the end of each aligned sequence. • A horizontal move puts a gap in the left sequence. • A vertical move puts a gap in the top sequence. • A diagonal move uses one character from each sequence. AAG- AAG- -AGC A-GC
GA-ATC CATA-C DP matrix
GAAT-C CA-TAC DP matrix
GAAT-C C-ATAC DP matrix
GAAT-C -CATAC DP matrix
Multiple solutions • When a program returns a sequence alignment, it may not be the only best alignment. GA-ATC CATA-C GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC
Traceback problem #1 Write down the alignment corresponding to the circled score.
GA CA Solution #1 Write down the alignment corresponding to the circled score.
Traceback problem #2 Write down three alignments corresponding to the circled score.
Solution #2 GAATC CA--- Write down three alignments corresponding to the circled score.
Solution #2 GAATC C-A-- GAATC CA--- Write down three alignments corresponding to the circled score.
Solution #2 GAATC -CA-- GAATC C-A-- GAATC CA--- Write down three alignments corresponding to the circled score.
Local alignment • A protein may be homologous to a region within a second protein. • Usually, an alignment that spans the complete length of both sequences is not required.
BLAST allows local alignments Global alignment Local alignment
Global alignment DP • Align sequence x and y. • F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.
Local alignment DP • Align sequence x and y. • F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.
A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. 0
A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. 0
A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. 0 2 -5 -5 0 0
A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. 0
A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. 0
A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. 0
Local alignment • Two differences with respect to global alignment: • No score is negative. • Traceback begins at the highest score in the matrix and continues until you reach 0. • Global alignment algorithm: Needleman-Wunsch. • Local alignment algorithm: Smith-Waterman.
A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. 0 AG AG
Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. 0
Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. 0
Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. AAG AAG 0
Summary • Local alignment finds the best match between subsequences. • Smith-Waterman local alignment algorithm: • No score is negative. • Trace back from the largest score in the matrix.
Sample problem #1 • Given: • Two letters • A substitution matrix written as three columns (letter, letter, value) A C 0 A D -2 • Return: • The value associated with the two letters • You must store the substitution matrix in memory. • You must account for the fact that the order of the letters doesn’t matter.
Solution outline • Read the substitution matrix into a dictionary. • Each line of the file has three values. A C 4 A D -2 • Each line is stored in a dictionary with the first two entries as a tuple key and the last entry as the value. substitionMatrix[(letter1, letter2)] = value
Sample problem #2 • Given: • A file containing two sequences of equal length (one per line) • A substitution matrix written as three columns (letter, letter, value) • Return: • The score of the ungapped alignment between the sequences • Test using input files from class web page. • Solutions: 1 = 69, 2 = 104, 3 = 153
Solution outline • Store the substitution matrix in memory, as before. • Check to be sure the sequences are the same length, and report an error if they are not. • Use a for loop to traverse both sequences at once.
One-minute response At the end of each class • Write for about one minute. • Provide feedback about the class. • Was part of the lecture unclear? • What did you like about the class? • Do you have unanswered questions? I will begin the next class by responding to the one-minute responses