440 likes | 645 Views
BCB 444/544. Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics #6_Aug31. Required Reading ( before lecture). Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment Chp 3 - pp 31-41 Wed Aug 29 - for Lecture #5 Dynamic Programming
E N D
BCB 444/544 Lecture 6 FinishDynamic Programming Scoring Matrices Alignment Statistics #6_Aug31 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Required Reading (before lecture) Mon Aug 27- for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Wed Aug 29- for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html Thurs Aug 30- Lab #2: Databases, ISU Resources & Pairwise Sequence Alignment Fri Aug 31- for Lecture #6 Scoring Matrices & Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Announcements Fri Aug 31 - Revised notes for Lecture 5 posted online Changes? mainly re-ordering, symbols, color "coding" Mon Sept 3 - NO CLASSES AT ISU(Labor Day)!! - Enjoy!! Tues Sept 4 - Lab #2 Exercise WriteupDue by 5 PM (or sooner!) Send via email to Pete Zabackpetez@iastate.edu (HW#2 assignment will be posted online) Fri Sept 14 - HW#2 Due by 5 PM (or sooner!) Fri Sept 21 - Exam #1 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • √Evolutionary Basis • √Sequence Homology versus Sequence Similarity • √Sequence Similarity versus Sequence Identity • Methods - cont • Scoring Matrices • Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Methods • √Global and Local Alignment • √Alignment Algorithms • √Dot Matrix Method • Dynamic Programming Method - cont • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Sequence Homology vs Similarity • Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find:Optimal pairing of sequences that: • Retains the order of characters • Introduces gaps where needed • Maximizes total score BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--cesometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define ascoring functionthat rewards matches (+)and penalizes mismatches (-) andgaps (-) Scoring Function (S): e.g. Match: 1 Mismatch: 1 Gap: 0 S = (#matches)- (#mismatches) -(#gaps) Note: I changed symbols & colors on this slide! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others (physicochemical properties are similar) e.g., Ser & Thr are more similar than Trp & Ala • Substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Global vs Local Alignment Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Global vs Local Alignment - example Local alignment Global alignment CTGTCGCTGCACG -TG-C-C-G--TG CTGTCGCTGCACG -TGCCG-TG---- CTGTCGCTGCACG -TGCCG-T----G 1 = CTGTCGCTGCACG 2 = TGCCGTG Which is better? BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Global vs Local Alignment Which should be used when? It is critical to choose correct method! Global Alignment vs Local Alignment? Shout out the answers!! Which should we use for? • Searching for conserved motifs in DNA or protein sequences? • Aligning two closely related sequences with similar lengths? • Aligning highly divergent sequences? • Generating an extended alignment of closely related sequences? • Generating an extended alignment of closely related sequences with very different lengths? Hmmm - we'll work on that Excellent! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Alignment Algorithms 3 major methods for pairwise sequence alignment: • Dot matrix analysis • Dynamic programming • Word or k-tuple methods (later, in Chp 4) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Dot Matrix Method (Dot Plots) C G G A C A C A C G • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match • Contiguous diagonal lines reveal alignment; "breaks" = gaps (indels) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Interpretation of Dot Plots When comparing 2 sequences: • Diagonal lines of dots indicate regions of similarity between 2 sequences • Reverse diagonals (perpendicular to diagonal) indicate inversions • What do such patterns mean when comparing a sequence with itself (or its reverse complement)? • e.g.: Reverse diagonals crossing diagonals (X's) indicate palindromes Exploring Dot Plots BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Dynamic Programming For Pairwise sequence alignment C A T -T C A - C | | | | | C - T CG C A G C Idea: Display one sequence above another with spaces inserted in both to reveal similarity BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Global Alignment: Scoring CTGTCG-CTGCACG -TGC-CG-TG---- Reward for matches: Mismatch penalty: Space/gap penalty: Score = w – x - y w = #matches x = #mismatches y = #spaces Note: I changed symbols & colors on this slide! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Global Alignment: Scoring Reward for matches: 10 Mismatch penalty: -2 Space/gap penalty: -5 CT GT CG – CT G C - T GC –C G –T G - -510 10-2-5-2-5 -510 10-5 Total = 11 Note: I changed symbols & colors on this slide! We could have done better!! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Alignment Algorithms • Global: Needleman-Wunsch • Local: Smith-Waterman • Both NW and SW use dynamic programming • Variations: • Gap penalty functions • Scoring matrices BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Dynamic Programming - Key Idea: The score of the best possible alignment that ends at a given pair of positions (i, j) is equal to: the score of best alignment ending just previous to those two positions (i.e., ending at i-1, j-1) PLUS the score for aligning xi and yj BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Global Alignment: DP Problem Formulation & Notations x1 x2 x3 y1 S(2,3) = score of best alignment of AG (x1x2) to AAA (y1y2y3) y2 y3 y4 Given two sequences (strings) • X = x1x2…xN of length N x = AGCN = 3 • Y = y1y2…yM of length M y = AAACM = 4 Construct a matrix with (N+1) x (M+1) elements, where S(i,j) = Score of best alignment of x[1..i]=x1x2…xi with y[1..j]=y1y2…yj Which means: Score of best alignment of a prefix of X and a prefix of Y BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Dynamic Programming - 4 Steps: • Define score of optimum alignment, using recursion • Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest subproblems first (bottom-up approach) • Calculate score of optimum alignment(s) • Trace back through matrix to recover optimum alignment(s) that generated optimal score BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
1- Define Score of Optimum Alignment using Recursion Define: Initial conditions: Recursive definition: For 1 i N, 1 j M: BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems 0 1 N 0 S(0,0)=0 1 S(i-1,j-1) S(i-1,j) S(i,j-1) S(i,j) S(N,M) M • Construct sequence vs sequence matrix: Recursion Initialization BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
2- contFill in DP Matrix 0 1 N 0 S(0,0)=0 1 S(i-1,j-1) S(i-1,j) S(i,j-1) S(i,j) S(N,M) M • Fill in from [0,0] to [N,M] (row by row), calculating best • possible score for each alignment including residues at [i,j] • Keep track of dependencies of scores (in a pointer matrix). BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
3- Calculate Score S(N,M) of Optimum Alignment - for Global Alignment x1 x2 . . . xi-1 xi x1 x2 . . . xi-1 xi y1 y2 . . . yj-1 yj y1 y2 . . . yj — xi aligns to yj xi aligns to a gap yj aligns to a gap S(i-1,j-1) + (xi,yj) S(i-1,j) - x1 x2 . . . xi — y1 y2 . . . yj-1 yj S(i,j-1) - What happens in last step in alignment of x[1..i] to y[1..j]? 1 of 3 cases applies: BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Example Case 1: Line up xi with yj i i - 1 x: C A T T C A C y: C - T T C A G j j -1 Case 2: Line up xi with space i - 1 i x: C A T T C A - C y: C - T T C A G - j Case 3: Line up yj with space i x: C A T T C A C - y: C - T T C A - G j -1 j BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Fill in the matrix -5 -10 -15 -20 -25 -30 -35 λ C T C G C A G C 0 -5 -10 -15 -20 -25 -30 -35 -40 λ 10 5 C A T T C A C +10 for match, -2 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Calculate score of optimum alignment λ C A T T C A C λ C T C G C A G C +10 for match, -2 for mismatch,-5 for space BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
4- Trace back through matrix to recover optimum alignment(s) that generated the optimal score How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix Result? Optimal alignment(s) of sequences BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Traceback - for Global Alignment Start in lower right corner & trace back to upper left Each arrow introduces one character at end of sequence alignment: • A horizontal move puts a gap in leftsequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from eachsequence BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Traceback to Recover Alignment λ C A T * T * C A C λ C T C G C A G C Can have >1 optimum alignment; this example has 2 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Local Alignment: Motivation • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions • Local alignment between two protein-encoding sequencesis likely to be between two exons • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities • Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Local Alignment: Example g g t c t g a g a a a c g a Match: +2Mismatch or space: -1 Best local alignment: g g tc t g ag a a ac – g a- Score = 5 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Local Alignment: Algorithm • S [i, j] = Score for optimally aligning a suffix of X with a suffix of Y • Initialize top row & leftmost column of matrix with "0" • Recall: for Global Alignment, • S [i, j] = Score for optimally aligning a prefixof X with a prefix of Y • Initialize top row & leftmost column of with gap penalty BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Traceback - for Local Alignment λ C A T T C A C λ C T C G C A G C +1 for a match, -1 for a mismatch,-5 for a space BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Some Results re: Alignment Algorithms(for ComS, CprE & Math types!) • Most pairwise sequence alignment problems can be solved in O(mn) time • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
"Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
PAM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differnces in closely related proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate ofexpected mutation if n% of amino acids had changed • PAM1 - for less divergent sequences (shorter time) • PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
BLOSUM Matrix BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated • BLOSUM45 - for more divergent sequences • BLOSUM62 - for less divergent sequences BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats
Affine Gap Penalty Functions Gap penalty = h + gk where k = length of gap h = gap opening penalty g = gap extension penalty Can also be solved in O(nm) time using dynamic programming BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats