220 likes | 367 Views
Alignment methods. April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS stayed in the population?
E N D
Alignment methods • April 21, 2009 • Quiz 1-April 23 (JAM lectures through today) • Writing assignment topic due Tues, April 23 • Hand in homework #3 • Why has HbS stayed in the population? • Learning objectives- Understand difference between global alignment and local alignment. Understand the Needleman-Wunsch algorithm. Understand the Smith-Waterman algorithm in global alignment mode. • Workshop-Perform alignment of two nucleotide sequences • Homework #4 due Tues, April 23
Evolutionary Basis of Sequence Alignment Why are there regions of identity when comparing protein sequences? 1) Conserved function-amino acid residues participate in reaction. 2) Structural (For example, conserved cysteine residues that form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a common ancestor gene.
Identity Matrix A 1 C 0 1 I 0 0 1 L 0 0 0 1 A C I L Simplest type of scoring matrix
Similarity It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. CO2- CO2- +NH3 +NH3 Isoleucine Leucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between?
One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity.
Evolutionary Basis of Sequence Alignment (Cont. 2) Note: it is possible that two proteins share a high degree of similarity but have two different functions. For example, human gamma-crystallin is a lens protein that has no known enzymatic activity. It shares a high percentage of identity with E. coli quinone oxidoreductase. These proteins likely had a common ancestor but their functions diverged. Analogous to railroad car and diner. Both have the same form but different functions.
Global Alignment Method For example, the two hypothetical sequences abcdefghajklm abbdhijk could be aligned like this abcdefghajklm || | | || abbd...hijk As shown, there are 6 matches, 2 mismatches, and one gap of length 3.
Global Alignment Method Scored The alignment is scored according to a payoff matrix $payoff = {match => $match, mismatch => $mismatch, gap_open => $gap_open, gap_extend => $gap_extend}; For correct operation, an algorithm is created such that the match must be positive and the other payoff entities must be negative.
Global Alignment Method (cont. 3) • Example • Given the payoff matrix • $payoff = {match => 4, • mismatch => -3, • gap_open => -2, • gap_extend => -1};
Global Alignment Method (cont. 4) The sequences abcdefghajklm abbdhijk are aligned and scored like this a b c d e f g h a j k l m | | | | | | a b b d . . . h i j k match 4 4 4 4 4 4 mismatch -3 -3 gap_open -2 gap_extend -1-1-1 for a total score of 24-6-2-3 = 13.
Global Alignment Method (cont. 5) The algorithm should guarantee that no other alignment of these two sequences has a higher score under this payoff matrix.
Let’s align the following with a simple payoff matrix: ABCNJRQCLCRPM and AJCJNRCKCRBP Where match = 1 mismatch = 0 gap = 0 gap extension = 0 Alignment A Sequence 1: ABCNJ-RQCLCR-PM Sequence 2: AJC-JNR-CKCRBP- Score: 101010101011010 Total Score: 8 Alignment B Sequence 1: ABC-NJRQCLCR-PM Sequence 2: AJCJN-R-CKCRBP- Score: 101010101011010 Total Score: 8
Three steps in Dynamic Programming 1. Initialization 2. Matrix fill or scoring 3. Traceback and alignment
Matrix Fill (entire matrix) Sequence 1: ABC-NJRQCLCR-PM Sequence 2: AJCJN-R-CKCRBP- Score: 101010101011010 Total Score: 8 Sequence 1: ABCNJ-RQCLCR-PM Sequence 2: AJC-JNR-CKCRBP- Score: 101010101011010 Total Score: 8
Smith-Waterman algorithm Mi,j = MAXIMUM [ Mi-1, j-1 + si,,j (match or mismatch in the diagonal), Mi, j-1+ w (gap in sequence #1), Mi-1, j + w (gap in sequence #2), 0] Where Mi-1, j-1 is the value in the cell diagonally juxtaposed to Mi,j. (The i-1, j-1 cell is up and to the left of mi,nj). Where si,j is the value for the match or mismatch in the minj cell. Where Mi, j-1 is the value in the cell above Mi,j. Where w is the value for the gap penalty. Where Mi-1, j is the value in the cell to the left of Mi,j.
Initialization step: Create Matrix with M + 1 columns and N + 1 rows. M = number of letters in sequence 1 and N = number of letters in sequence 2. First column (M-1) and first row (N-1) will be filled with 0’s.
Matrix fill step: Each position Mi,j is defined to be the MAXIMUM score at position i,j Mi,j = MAXIMUM [ Mi-1, j-1 + si,,j (match or mismatch in the diagonal) Mi, j-1 + w (gap in sequence #1) Mi-1, j + w (gap in sequence #2)] row column
Sequence 1: ABCNJ-RQCLCR-PM Sequence 2: AJC-JNR-CKCRBP- Score : 8