490 likes | 846 Views
Stat 115 Lecture 3. Pairwise Sequence Alignment and Scoring Matrices. Xiaole Shirley Liu And Jun Liu. Mol Bio Quick Facts. Building block of DNA is deoxyribonucleic acid Building block of protein is amino acid Protein is a peptide, a long peptide
E N D
Stat 115 Lecture 3 Pairwise Sequence Alignmentand Scoring Matrices Xiaole Shirley Liu And Jun Liu
Mol Bio Quick Facts • Building block of DNA is deoxyribonucleic acid • Building block of protein is amino acid • Protein is a peptide, a long peptide • Only exons can code for functional proteins or RNAs • Introns are spliced out STAT 115
Outline • Motivation and introduction • Dynamic programming • Global sequence alignment • Needleman-Wunsch • 3 steps: Initialize, fill matrix, trace back • Gap penalties • Local sequence alignment, Smith-Waterman • Scoring matrices • PAM • BLOSUM STAT 115
gap match mismatch Pairwise Sequence Alignment • Given: two sequences, scoring for match/mismatch/gap • Goal: find pairing of letters in the two sequences that optimize the total score This is a hard example. That is another easy example. This is a --hard---- example. || ||||| | | ||||||||| That is another easy example. STAT 115
Align Biological Sequences • DNA (4 nt + gap) TTGATCAC TTTA-CAC • Protein (20 aa + gap) RKVA--GMAKPNM RKIAVAAASKPAV • Sometimes > 4 nt for DNA and > 20 aa for proteins • A word on IUPAC STAT 115
A adenosine C cytidine G guanine T thymidine U uridine R G A (purine) Y T C (pyrimidine) K G T (keto) M A C (amino) S G C (strong) W A T (weak) B C G T (not A) D A G T (not C) H A C T (not G) V A C G (not T) N A C G T (any) – gap IUPAC for DNA STAT 115
A Ala B Asp or Asn C Cys D Asp E Glu F Phe G Gly H His I Iso K Lys L Leu M Met N Asn P Pro G Gln R Arg S Ser T Thr U Sel V Val W Try Y Tyr Z Glu or Gln X Any * Translation stop – Gap IUPAC for Protein STAT 115
Why Align Two Sequences • If two sequences are similar, they might share the same ancestor • If two sequences are similar, they may share the same structure, therefore similar function • In genome sequencing assembly, if two sequences have overlapping similar regions, they might be connected to represent longer sequenced region. STAT 115
Scoring Schemes • Match, mismatch, gap score determines the final alignment This is a --hard---- example. || ||||| | | ||||||||| That is another easy example. • Match OK, mismatch costly, gap cheap. This is a-- h-ard---- example. Th--at is anothe-r easy example. • Match cheap, mismatch cheap, gap costly. This is a hard example.------ That is another easy example. STAT 115
Dot Matrix Approach • Naïve algorithm • Dot – match, find diagonal lines • Can’t afford more complex scoring • Visual analysis, hard to find optimal alignments STAT 115
Dynamic Programming • Essence of dynamic programming: • Store the sub-problem solutions for later use • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two • Earliest method, Needleman & Wunsch 1969 • Still the best (sensitive and optimal) algorithm for pair-wise alignment STAT 115
Dynamic Programming • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two Best previous alignment i j STAT 115
Dynamic Programming • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two • Repeat the process until reaching the two sequences’ ends New best alignment = Best previous alignment + align (i,j) i j STAT 115
Dynamic Programming Steps • Initialize NxM matrix • N, M are the length of the two sequences • Fill the matrix • Each element record the current best score, and pointer to the previous best alignment • Always search the previous column and row for the best previous alignment • Trace back to obtain optimal alignment STAT 115
Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)} A A T G C 0 -1 -2 -3 -4 -5 A G G C STAT 115
Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)} A A T G C 0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G G C STAT 115
Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)} A A T G C 0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G C STAT 115
Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)} A A T G C 0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C STAT 115
Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)} A A T G C 0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C -4 -2 -1 0 1 2 STAT 115
Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)} A A T G C 0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C -4 -2 -1 0 1 2 Trace back AATGC -AGGC AATGC AG-GC STAT 115
F(i-1,j-1) F(i,j-1) F(i-1,j) F(i,j) Alignment Recursion (linear gap penalty) C A T T G 0 -1 -2 -3 -4 -5 T C ATG -1 0 -1 0 -2 1 0 -1 E.g., -3 3 2 1 0 5 -4 -1 3 4 4 5 6 -5 -2 2 STAT 115
F(i,j) Affine Gap Penalty • Gap penalty function: • Typically a>b (e.g., a=12; b=2) • An order O(nm) algorithm. Key: keep 3 functions, each recording a directional optimum. STAT 115
Gap Penalty • Gap penalty = g + l e A A T G C A 1 1 0 0 0 G 0 1 1 1.4 0.3 G C g – gap start l – gap length e – gap extend e.g. g = -0.5 e = -0.1 STAT 115
Gap Penalty • Gap penalty = g + l e A A T G C A 1 1 0 0 0 G 0 1 1 1.4 0.3 G 0 0.4 1 2 1.4 C 0 0.3 0.4 1 3 g – gap start l – gap length e – gap extend e.g. g = -0.5 e = -0.1 STAT 115
End Gaps • Should we penalize gaps at the ends? ATCCGCATACCGGA --CCGCATAC---- • If two sequences similar length and entire sequences are supposed to be similar, penalize. • If two sequences have very different length, do not penalize (most of the time, ignore end gap penalties) STAT 115
Global vs. Local Alignment • Global: Needleman-Wunsch • Find best alignment for the whole 2 sequences • Could have no penalty for mismatches/gaps • Trace back from lower right corner to upper left corner • Local: Smith-Waterman • Find high scoring subsequences • E.g. Two proteins only share one similar functional domain • Can be achieved by modifying global alignment method STAT 115
Local Alignment Modifications • Use negative mismatch and gap penalties • The minimum score for [i, j] is 0 • If S[i,j] < 0, rewrite it to 0, point to self • If previous col/row is all 0, S[i,j] point to self • The best score is sought anywhere in the matrix • Not just last column and last row (should keep a global pointer to the best score) • Trace back until a cell pointing to itself (not necessary to the beginning of the two sequences) STAT 115
Matrix Filling in Smith-Waterman max {k < j} S(i,j-k) + g(k) S(i,j) = max S(i-1,j-1) + m(i,j) max {l < i} S(i-l,j) + g(l) 0 S(i,j-2) g(2) S(i-1,j-1) S(i,j) S(i-3,j) g(3) STAT 115
Example STAT 115
Finding suboptimal alignments STAT 115
Smith-Waterman • Negative mis-match & negative gaps • Scoring matrix >= 0 • Trace from maximum STAT 115
Scoring Matrices • For DNA, match + 5, mismatch – 4 • For proteins, different amino acid pairs receive different scores • Consideration: size, shape, electric charge, van der Waals interaction, ability to form salt, hydrophobic, and hydrogen bonds • Substitution matrices • Often symmetrical • + / – scores: functional similarity STAT 115
PAM Matrices • MO Dayhoff 1978 • PAM: percent accepted mutations • Database of 1572 changes in 71 groups of closely related proteins (> 85% similar) • Construct phylogenetic tree of each group, tabulate probability of amino acid changes between every pairs of aa • For statistician: Markov chain transition b/w aa pairs STAT 115
PAM Matrix Family • PAM-N • PAM-0: 1 on diagonal and 0 all the rest • PAM-N: what would happen if N out of 100 aa mutate • For statisticians: matrix multiplication N times • Bigger N, more diverged substitution matrice • Final matrix STAT 115
PAM250 • 250 substitutions in 100 residues, only 1/5 residue remain unchanged STAT 115
BLOSUM Matrices • BLOcks amino acid SUbstitution Matrices • Henikoff and Henikoff, PNAS. 1992, 89:10915-9 • Check >500 protein families in the Prosite database (Bairoch 1991) • Find ~2000 blocks of aligned segments • BLOCKS database • Characterize ungapped patterns 3-60 aa long • Check aligned columns for observed substitutions STAT 115
BLOSUM Matrix Entry • How frequently do aa appear • How often do we expect to see i, j together • How often do we actually see them together in all the alignments • BLOSUM entry STAT 115
BLOSUM62 Matrix STAT 115
BLOSUM Matrices • Blocks are grouped before looking at aa substitutions • BLOSUMN: if sequences > N% identical, their contributions are weighted to sum to 1 • Most widely used: BLOSUM62 and PAM250 STAT 115
More About Dynamic Programming • Example 1: Suppose I have x0 amount of savings at retirement, and also receive stamount of social security payment every year. Annual interest rate is qt, what is an optimal spending plan if I want to leave zero dollars at the end (say, year 5)? Maximizing total spending? STAT 115
Example 2: Secretary Problem • We get to observe the “qualities” of m secretaries: X1,…, Xm sequentially according to a random order. Our goal is to maximize the probability of finding the “best” candidate with no looking back! • Heuristic: We start our reasoning backwards. Suppose we have seen X1,…, Xj, should we stop or go on? STAT 115
Let’s start reasoning ... • What if we wait till the last person? • What if we wait till having two people left? • Strategy: if m-1st person is better than previous ones, take her; otherwise wait till the last one. • Get a recursion? If we let go j-1 people, and take the best-person-so-far starting from jth person ... Pj maximized at STAT 115
Final “answer” • We should reject the first 37% of the candidates and start to look: recruit the first person who is the best among all that have been interviewed. • The chance of getting the best one is ~37% ! STAT 115
Summary • Dynamic programming finds optimal alignment between two sequences • Keep subproblem solution for later use • Needleman & Wunsch, global • Smith & Waterman, local • Scoring sensitive to gap penalty and substitution matrices used • Substitution matrices capture aa similarity • PAM and BLOSUM matrices STAT 115
Question for Thoughts • Given a string of integers (both positive and negative) • E.g. 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7, 6 • Can you read each number only ONCE, and tell from which number to which number you will get the largest sum? • 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7, 6, -2 • Largest sum = 12 • Hint: dynamic programming STAT 115
Acknowledgement • Russ Altman • Theodor Hanekamp • Eric Rouchka STAT 115