1 / 45

Pairwise sequence comparison

Pairwise sequence comparison. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com. One-minute responses. Need more clarification on the alignment of sequences.

dena
Download Presentation

Pairwise sequence comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise sequence comparison Prof. William Stafford Noble Department of Genome SciencesDepartment of Computer Science and Engineering University of Washington thabangh@gmail.com

  2. One-minute responses • Need more clarification on the alignment of sequences. • More precise definitions, included on slides. • Python programming very useful. • A lot to learn at once. • I did not get the Python part, but it’s easy when I try it myself. • Please write on the board sometimes. • I liked the part about evolutionary theory. • I understood about 50% of the lecture. • You speak too fast. • I did not understand Moore’s law. • I liked the biology video and the analogy with Moore’s law. • The Python code is different from what we are used to. • Give a practical example of BLAST usage. • More explanation on sys.argv. • I like the summary of the previous lecture. • Sometimes you talk a long time before asking for questions. • Biology of cells and genomes wasn’t clear enough. • Take time explaining the biology concepts. • The last part of the lecture was not clear. • Give more examples. • The class could go a bit slower on difficult concepts. • It is a good habit to accept comments. • We may forget what you are saying because we are not taking notes. • I didn’t understand the mechanism by which we can physically read the DNA sequence. • This topic is outside the scope of this class to cover. If you’d like to read about this, check out “Overview of DNA sequencing strategies.”

  3. Other questions • How many assignments will we do? • Will there be a test? • Is there a lot of mathematics here, or just Python and biology? • Are theoretical understanding of genetics and biochemistry vital for studying bioinformatics? • Are we going to write this feedback every day? • How can I make a dictionary from a .txt file? • When you compare two protein sequences, how do you guess where the first codon begins? • What is evolution? Is it dogma?

  4. Outline • Responses from last class • Sequence alignment • Motivation • Scoring alignments • Python

  5. Revision • What is the difference between the DNA and RNA alphabets? • In RNA, “T” is changed to “U.” • What is a codon, and why is it significant? • A set of three adjacent RNA nucleotides. One codon codes for a single amino acid. • What is the universal genetic code? • A set of rules for translating codons into amino acids. • What is the purpose of aligning two DNA or protein sequences? • To infer common ancestry, function or structure.

  6. Sequence comparison overview • Problem: Find the “best” alignment between a query sequence and a target sequence. • To solve this problem, we need • a method for scoring alignments, and • an algorithm for finding the alignment with the best score. • The alignment score is calculated using • a substitution matrix, and • gap penalties. • The algorithm for finding the best alignment is dynamic programming.

  7. A simple alignment problem. • Problem: find the best pairwise alignment of GAATC and CATAC.

  8. Scoring alignments GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • We need a way to measure the quality of a candidate alignment. • Alignment scores consist of two parts: a substitution matrix, and a gap penalty. GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C

  9. Scoring aligned bases A hypothetical substitution matrix: GAATC | | CATAC -5 + 10 + -5 + -5 + 10 = 5

  10. BLOSUM62 A R N D C Q E G H I L K M FPST W Y V BZX A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1

  11. Scoring gaps • Linear gap penalty: every gap character receives a score of d. • Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e. GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17 G--AATC d=-4 CATA--C e=-1 -5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

  12. A simple alignment problem. • Problem: find the best pairwise alignment of GAATC and CATAC. • Use a linear gap penalty of -4. • Use the following substitution matrix:

  13. How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • How many different alignments of two sequences of length n exist? GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C Too many to enumerate!

  14. -G- CAT DP matrix The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -8

  15. -G-A CAT- DP matrix Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

  16. -G-- CATA DP matrix Moving vertically in the matrix introduces a gap in the sequence along the top edge.

  17. Initialization

  18. G - Introducing a gap

  19. - C DP matrix

  20. DP matrix

  21. G C DP matrix

  22. Three legal moves • A diagonal move aligns a character from the left sequence with a character from the top sequence. • A vertical move introduces a gap in the sequence along the top edge. • A horizontal move introduces a gap in the sequence along the left edge.

  23. ----- CATAC DP matrix

  24. DP matrix

  25. -G CA G- CA --G CA- DP matrix -4 -9 -12 0 -4 -4

  26. DP matrix

  27. DP matrix

  28. DP matrix

  29. DP matrix

  30. DP matrix Find the optimal alignment, and its score.

  31. DP matrix

  32. DP in equation form • Align sequence x and y. • F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

  33. DP in equation form

  34. Summary • Scoring a pairwise alignment requires a substition matrix and gap penalties. • Dynamic programming is an efficient algorithm for finding the optimal alignment. • Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. • DP iteratively fills in the matrix using a simple mathematical rule.

  35. A simple example Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

  36. Some useful Python tidbits

  37. sys.argv • sys.argv is a list containing the strings given on the command line • Write a program that adds up all of the numbers on the command line. > add-numbers.py 1 2 3 6

  38. sys.stdout versus sys.stderr • sys.stdoutand sys.stderrare two different streams to print to • Use sys.stdout for the primary output of your program. • Use sys.stderr to report errors or give status updates.

  39. write() versus print() • The write() function differs from print() • write() does not automatically add an end-of-line. • write() requires that you specify the name of the file or stream to be written to. • write() only accepts a single string.

  40. % • The “%” operator substitutes values into a string based on the presence of format strings. • %s = string • %d = integer • %g = float • Place “%” between a string and a tuple of values. >>> "%s %s %s" % ("larry", "curly", "moe") 'larry curly moe’ >>> "%d + %d" % (21, 15) '21 + 15'

  41. Using sys.stderr • Write a program to divide one number by a second number. > divide.py 8 3 2.66667 • Print the usage message if exactly two numbers are not given. > ./divide.py USAGE: divide.py<value1> <value2> Divide <value1> by <value2> and report the result. • Stop and print an error if the second number is zero. > divide.py 8 0 Divide by zero error.

  42. Using write() and % • Write a program that prints the command line arguments with “+” signs between, and then reports their sum. > add-numbers2.py 1 2 3 1 + 2 + 3 = 6

  43. One-minute response At the end of each class • Write for about one minute. • Provide feedback about the class. • Was part of the lecture unclear? • What did you like about the class? • Do you have unanswered questions? • Sign your name I will begin the next class by responding to the one-minute responses

More Related