450 likes | 556 Views
Pairwise sequence comparison. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com. One-minute responses. Need more clarification on the alignment of sequences.
E N D
Pairwise sequence comparison Prof. William Stafford Noble Department of Genome SciencesDepartment of Computer Science and Engineering University of Washington thabangh@gmail.com
One-minute responses • Need more clarification on the alignment of sequences. • More precise definitions, included on slides. • Python programming very useful. • A lot to learn at once. • I did not get the Python part, but it’s easy when I try it myself. • Please write on the board sometimes. • I liked the part about evolutionary theory. • I understood about 50% of the lecture. • You speak too fast. • I did not understand Moore’s law. • I liked the biology video and the analogy with Moore’s law. • The Python code is different from what we are used to. • Give a practical example of BLAST usage. • More explanation on sys.argv. • I like the summary of the previous lecture. • Sometimes you talk a long time before asking for questions. • Biology of cells and genomes wasn’t clear enough. • Take time explaining the biology concepts. • The last part of the lecture was not clear. • Give more examples. • The class could go a bit slower on difficult concepts. • It is a good habit to accept comments. • We may forget what you are saying because we are not taking notes. • I didn’t understand the mechanism by which we can physically read the DNA sequence. • This topic is outside the scope of this class to cover. If you’d like to read about this, check out “Overview of DNA sequencing strategies.”
Other questions • How many assignments will we do? • Will there be a test? • Is there a lot of mathematics here, or just Python and biology? • Are theoretical understanding of genetics and biochemistry vital for studying bioinformatics? • Are we going to write this feedback every day? • How can I make a dictionary from a .txt file? • When you compare two protein sequences, how do you guess where the first codon begins? • What is evolution? Is it dogma?
Outline • Responses from last class • Sequence alignment • Motivation • Scoring alignments • Python
Revision • What is the difference between the DNA and RNA alphabets? • In RNA, “T” is changed to “U.” • What is a codon, and why is it significant? • A set of three adjacent RNA nucleotides. One codon codes for a single amino acid. • What is the universal genetic code? • A set of rules for translating codons into amino acids. • What is the purpose of aligning two DNA or protein sequences? • To infer common ancestry, function or structure.
Sequence comparison overview • Problem: Find the “best” alignment between a query sequence and a target sequence. • To solve this problem, we need • a method for scoring alignments, and • an algorithm for finding the alignment with the best score. • The alignment score is calculated using • a substitution matrix, and • gap penalties. • The algorithm for finding the best alignment is dynamic programming.
A simple alignment problem. • Problem: find the best pairwise alignment of GAATC and CATAC.
Scoring alignments GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • We need a way to measure the quality of a candidate alignment. • Alignment scores consist of two parts: a substitution matrix, and a gap penalty. GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C
Scoring aligned bases A hypothetical substitution matrix: GAATC | | CATAC -5 + 10 + -5 + -5 + 10 = 5
BLOSUM62 A R N D C Q E G H I L K M FPST W Y V BZX A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1
Scoring gaps • Linear gap penalty: every gap character receives a score of d. • Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e. GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17 G--AATC d=-4 CATA--C e=-1 -5 + -4 + -1 + 10 + -4 + -1 + 10 = 5
A simple alignment problem. • Problem: find the best pairwise alignment of GAATC and CATAC. • Use a linear gap penalty of -4. • Use the following substitution matrix:
How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • How many different alignments of two sequences of length n exist? GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C Too many to enumerate!
-G- CAT DP matrix The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -8
-G-A CAT- DP matrix Moving horizontally in the matrix introduces a gap in the sequence along the left edge.
-G-- CATA DP matrix Moving vertically in the matrix introduces a gap in the sequence along the top edge.
G - Introducing a gap
- C DP matrix
G C DP matrix
Three legal moves • A diagonal move aligns a character from the left sequence with a character from the top sequence. • A vertical move introduces a gap in the sequence along the top edge. • A horizontal move introduces a gap in the sequence along the left edge.
----- CATAC DP matrix
-G CA G- CA --G CA- DP matrix -4 -9 -12 0 -4 -4
DP matrix Find the optimal alignment, and its score.
DP in equation form • Align sequence x and y. • F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.
Summary • Scoring a pairwise alignment requires a substition matrix and gap penalties. • Dynamic programming is an efficient algorithm for finding the optimal alignment. • Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. • DP iteratively fills in the matrix using a simple mathematical rule.
A simple example Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.
sys.argv • sys.argv is a list containing the strings given on the command line • Write a program that adds up all of the numbers on the command line. > add-numbers.py 1 2 3 6
sys.stdout versus sys.stderr • sys.stdoutand sys.stderrare two different streams to print to • Use sys.stdout for the primary output of your program. • Use sys.stderr to report errors or give status updates.
write() versus print() • The write() function differs from print() • write() does not automatically add an end-of-line. • write() requires that you specify the name of the file or stream to be written to. • write() only accepts a single string.
% • The “%” operator substitutes values into a string based on the presence of format strings. • %s = string • %d = integer • %g = float • Place “%” between a string and a tuple of values. >>> "%s %s %s" % ("larry", "curly", "moe") 'larry curly moe’ >>> "%d + %d" % (21, 15) '21 + 15'
Using sys.stderr • Write a program to divide one number by a second number. > divide.py 8 3 2.66667 • Print the usage message if exactly two numbers are not given. > ./divide.py USAGE: divide.py<value1> <value2> Divide <value1> by <value2> and report the result. • Stop and print an error if the second number is zero. > divide.py 8 0 Divide by zero error.
Using write() and % • Write a program that prints the command line arguments with “+” signs between, and then reports their sum. > add-numbers2.py 1 2 3 1 + 2 + 3 = 6
One-minute response At the end of each class • Write for about one minute. • Provide feedback about the class. • Was part of the lecture unclear? • What did you like about the class? • Do you have unanswered questions? • Sign your name I will begin the next class by responding to the one-minute responses