390 likes | 400 Views
Learn the basics of dynamic programming for sequence comparison in genomics and statistics, including alignment scoring. Dive into solving alignment problems and understanding substitution matrices.
E N D
Lecture 2A Thursday, January 10, 2008 Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble
One-minute responses • Lecture on sequence comparison was quick enough to remind us of the basic concepts. • It was nice that you checked in with a show of hands where we all were in terms of experience. • When doing global sequence alignment, how do you move or assign the “window” to calculate the score? • I only referred to the “window” because it would take too long to calculate the score in class for the whole alignment. In general, for a global alignment (which is what we’re talking about so far), you score the whole alignment, not just a window. • I’m not sure what this response should be, but I found the lecture informative and I think this class will be perfect for my needs. • Coming into this class, I feel somewhat intimidated, but this was a good introduction. The problems were challenging, but I liked being given time to try and figure it out without help and play around with python. • The lecture in general was a great introduction to the material. • I do have lots of experience programming, but I’m not worried about being bored in the class; I just figure it will give me more time to focus on the genomics and statistics. • I like your “What’s the motivation?” approach.
One-minute responses • The python intro was a the right level for a completely inexperienced user. • It was helpful to have experience with Python on the first day of class. • The pace is good for the first class. • You are a computer scientist, and I definitely am not. Possibly after reading the text a bit I will be OK. I know that for you, you are going slow, but I am not familiar with the language of programming at all. Hopefully, the class will continue slowly. • I suspect that the class will quickly become challenging, so I should just enjoy my first python program. • That was a reasonable explanation and the sample problems were entertaining. • Would be helpful if you defined “arg” and “sys” more fully. • My programs got the desired output by may have used slightly different syntax. • I enjoyed the class. It was good for me and really allowed me to try things, fail, and ask questions. It also taught me how to find mistakes, make manipulations and make things work.
Sequence comparison overview • Problem: Find the “best” alignment between a query sequence and a target sequence. • To solve this problem, we need • ___________________, and • ___________________. • The alignment score is calculated using • ___________________, and • ___________________. • The algorithm for finding the best alignment is __________________.
Sequence comparison overview • Problem: Find the “best” alignment between a query sequence and a target sequence. • To solve this problem, we need • a method for scoring alignments, and • an algorithm for finding the alignment with the best score. • The alignment score is calculated using • a substitution matrix, and • gap penalties. • The algorithm for finding the best alignment is dynamic programming.
Review • What does the 62 in BLOSUM62 mean? • Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? • What is the difference between a gap open and a gap extension penalty?
Review • What does the 62 in BLOSUM62 mean? • The sequences in the alignment used to generate the matrix share 62% pairwise sequence identity. • Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? • Leucine and isoleucine are biochemically very similar; consequently, a substitution of one for the other is much more likely to occur. • What is the difference between a gap open and a gap extension penalty? • When we assign a score to an observed gap in an alignment, we charge a larger penalty for the first gapped position than for subsequent positions in the same gap.
A simple alignment problem. • Problem: find the best pairwise alignment of GAATC and CATAC. • Use a linear gap penalty of -4. • Use the following substitution matrix:
How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • How many different alignments of two sequences of length N exist? GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C
How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • How many different alignments of two sequences of length n exist? GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C Too many to enumerate!
-G- CAT DP matrix The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -8
-G-A CAT- DP matrix Moving horizontally in the matrix introduces a gap in the sequence along the left edge.
-G-- CATA DP matrix Moving vertically in the matrix introduces a gap in the sequence along the top edge.
G - Introducing a gap
- C DP matrix
G C DP matrix
----- CATAC DP matrix
-G CA G- CA --G CA- DP matrix -4 -9 -12 0 -4 -4
DP matrix What is the alignment associated with this entry?
DP matrix -G-A CATA
DP matrix Find the optimal alignment, and its score.
GA-ATC CATA-C DP matrix
GAAT-C CA-TAC DP matrix
GAAT-C C-ATAC DP matrix
GAAT-C -CATAC DP matrix
Multiple solutions • When a program returns a sequence alignment, it may not be the only best alignment. GA-ATC CATA-C GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC
DP in equation form • Align sequence x and y. • F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.
Dynamic programming • Yes, it’s a weird name. • DP is closely related to recursion and to mathematical induction. • We can prove that the resulting score is optimal.
Summary • Scoring a pairwise alignment requires a substition matrix and gap penalties. • Dynamic programming is an efficient algorithm for finding the optimal alignment. • Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. • DP iteratively fills in the matrix using a simple mathematical rule.
Reading • Chapters 3 and 6 of Bioinformatics: Sequence and Genome Analysis by David Mount, 2nd edition.
Problem: find the best pairwise alignment of GAATC and CATAC. d = -4