620 likes | 1.2k Views
Ch 3 Sequence Alignment. Pairwise alignments. Introduction Why do alignments? Definitions Dot matrix comparison Scoring alignments Alignment methods Significance of alignments. Why do alignments?.
E N D
Pairwise alignments • Introduction • Why do alignments? • Definitions • Dot matrix comparison • Scoring alignments • Alignment methods • Significance of alignments
Why do alignments? Sequence Alignment is useful for discovering structural, functional and evolutional information in biological sequences.
Over time, genes accumulate mutations • Environmental factors • Radiation • Oxidation • Mistakes in replication or repair • Deletions, Duplications • Insertions • Inversions • Point mutations
Deletions • Codon deletion:ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal • Frame shift mutation:ACG ATA GCG TAT GTA TAG CCG…ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal
Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT
Comparing two sequences • Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT • Insertions/deletions, must align:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT
Russell F. Doolittle Sequence Alignment Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Science221:275-277, 1983. • A sequence for platelet derived growth factor from mammalian cells was virtually identical to the sequence for the retrovirus encoded oncogene known as v-sis (gene causing cancer in animals). • Retrovirus had acquired the gene from the host cell as some kind of genetic exchange event and then had produced a mutant that could alter the function of the normal protein when it infected another animal.
Definitions An alignmentis a mutual arrangement of sequences, which exhibits where the sequences are similar, and where they differ. An optimal alignmentis one that exhibits the most correspondences and the least differences. It is the alignment with the highest score. May or may not be biologically meaningful.
Dot Matrix Comparison A: T C A G A G G T C T G B: T C A G A G C T G
Interpretation of dot matrix • Regions of similarity appear as diagonal runs of dots • Reverse diagonals (perpendicular to diagonal) indicate inversions • Can link or "join" separate diagonals to form alignment with "gaps"
More on Dot Matrix • Improving detection of matching regions by filtering • using sliding window to compare the two sequences. For example, print a dot at a matrix position only if • 7 out of the next 11 positions in the sequence are identical • Similarity score of the next 11 positions in the sequence is greater than 5.
Sequence repeats • Many sequences contains repetitive regions. a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2 (http://arbl.cvmbs.colostate.edu/molkit/dnadot/bkg.html)
More on Dot Matrix • Dot matrix graphically presents regions of identity or similarity between two sequences • The use of windows and thresholds can reduce “noise” in dot matrix • Inversions and duplications have unique “signatures” in dot matrix
Demo: http://arbl.cvmbs.colostate.edu/molkit/dnadot/ • >gi|126143346|dbj|AB294237.1| Lethenteron japonicum mRNA for hemoglobin 4, complete cds AGATACCGGAGGTGTTCATCAGACCGTCTCTACTCGCCGAGTCCACCTCGCGCGCGCGTCGCTTCCTTCG CGGACATTCCCCAGCAGCCGGCAACCATGCCCATCGAAGACTCGGGGAGCAAGCCGGACTTCTCGGACGA CGAGAAGAAGGCCATCAAGGACTCGTGGAGTGGCGTCTACTCCGAATACGAGTCGACGAGCTCCGAGATT CTCATCAAGTTCTTCGTGGACAATCCATCCGCGCAGGACTTCTTCCCCAAGTTCAAGGACCTGGACTCGG AGGAGAAGCTCAAGGGCTCCACGGCGGTGCGCTGGCACGCCGAGAGGATCATCAACGCCGTCAACGACGC CATCTGGCTCCTGGACGAGCCCGAGAAGAACGCCAAGAAACTGAAGGAGCTGAGCGAGAAGCACGCGGTG CAGCTCAACGTCGATGCCAAGTTCTTCAAGGTCCTGGCTGAGGTGATCCTGGACAAGGTGGCAGAGAAGA ACGACGGCTCCTTCTCGGACAGCGCTCGCTCTGCGTGGGAGAAGCTGCTGACTTACATCTGCATCAGCCT CAAAGTCGCCTACTAAAAATTCATGTCGCTTTTCAAATAAAATGCAAGTCAAATTCCTGCTCAAACGAGC AGCAAAAAACACAAAACACCCACGTACATGCTACGGACAAAATCTTGGAATGTTGGTGCTTAAATTACAG AAATAAAGATGCATTCGAGGTTAT • >gi|126143341|dbj|AB294235.1| Lethenteron japonicum mRNA for hemoglobin 1, complete cds GACTTCAGACTCTCTTTGCAGGATCATCGCCGATCACCTCCAAAAGCAATCACCATGCCTATCGTCGACT CTGGAAGCGTTGCGCCTCTGAGCGCTGCTGAGAAGACCAAAATCCGCAGCGCGTGGGCCCCCGTGTACTC CAATTATGAGACCTCCGGTGTGGACATCCTGGTGAAGTTCTTCACCAGCACCCCCGCTGCTCAGGAATTC TTCCCCAAGTTCAAGGGGATGACTTCGGCTGATCAGCTCAAGAAATCCGCGGACGTGCGCTGGCACGCCG AGAGGATCATCAACGCCGTCAACGATGCCGTGGCGTCCATGGATGACACCGAGAAGATGAGCATGAAGCT AAGGGATCTGAGTGGCAAGCACGCCAAAAGCTTCCAGGTGGACCCCCAGTACTTCAAGGTCCTGGCCGCG GTCATTGCTGACACCGTTGCCGCAGGCGATGCCGGCTTCGAGAAGCTCATGAGCATGATTTGCATCCTGC TCAGGTCCGCCTACTAAACCAACGTGCACTGCCACTCCTGCAAGACTGCTACCCAACTGCCTTTGCCACA CGATTTGCACCAGTTTCTGTTGACAGAGATAACGAATGTCTTGTGAGAATGAAGCATATTTCCGAGTTGT TTTTGTCAATGACAATACAATGAATGTAACTTCCCAATGAAAATCAATAAAACTTTCATGCATCCAG
Scoring a sequence alignment • Match score: +1 • Mismatch score: +0 • Gap penalty: –1 • ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT Score = ?
Scoring a sequence alignment • Match score: +1 • Mismatch score: +0 • Gap penalty: –1 • ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Gaps: 7 × (– 1) Score = +11
How to measure the similarity Basically three kinds of changes can occur at any given position within a sequence: • Mutation • Insertion • Deletion • Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations.
A T C G A T C G A T C G A 2 -3 -3 -3 A 1 -5 -5 -1 A 1 0 0 0 T 0 1 0 0 T -3 2 -3 -3 T -5 1 -1 -5 C -3 -3 2 -3 C -5 -1 1 -5 C 0 0 1 0 G 0 0 0 1 G -3 -3 -3 2 G -1 -5 -5 1 Identity matrix BLAST matrix Transition-Transversion matrix Scoring Matrices for Aligning DNA Sequences Transition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T). Transversions --- (A/G) (C/T)
Amino Acid Substitution Matrices PAM - point accepted mutation based on global alignment [evolutionary model] BLOSUM - block substitutions based on local alignments [similarity among conserved sequences]
chance to see the pair in homologous proteins chance to see the pair in unrelated proteins by chance Log-odds = log ( ) Part of PAM 250 Matrix
Amino acid change Probability PAM 1 PAM 250 Phe to Ala 0.0002 0.04 Phe to Arg 0.0001 0.01 Phe to Asn 0.0001 0.02 Phe to Asp 0.0000 0.01 Phe to Cys 0.0000 0.01 ... … … PAM matrices PAM 1 Matrix reflects an amount of evolution producing on average one mutation per hundred amino acids (1 unit evolution). PAM 250 --- 250 unit evolution
Limitations of PAM Matrices • Based on a small set of closely related proteins; • Constructed based on the phylogenetic relationships prior to scoring mutations; • Difficulty of determining ancestral relationships among sequences; • …
BLOSUM Matrices • Based on the observed amino acid substitutions in a large set of ~2000 conserved amino acid patterns (blocks). The blocks are found in a database of protein sequences representing more than 500 families of related proteins and act as signatures of these protein families. • The matrices are measured on the multiple alignment of the blocks. • The entries of the matrices are computed based on the same principle used in PAM -- log(odds’ ratio).
chance to see the pair in homologous proteins chance to see the pair in unrelated proteins by chance Log-odds = log ( ) Part of BLOSUM 62 Matrix • BLOSUM62 was measured on pairs of sequences with an average of 62 % identical amino acids.
PAM vs. BLOSUM • PAM • Based on mutational model of evolution (Markov process) • PAM1 is based on sequences of 85% similarity • Designed to track the evolutionary origins • BLOSUM • Based on the multiple alignment of blocks • Good to be used to compare distant sequences • Designed to find proteins’ conserved domains
How to measure the similarity Basically three kinds of changes can occur at any given position within a sequence: • Mutation • Insertion • Deletion • Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful. Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.
Pairwise Global Alignment • Global alignment - Needleman-Wunsch (1970) • maximizes the number of matches between the sequences along the entire length of the sequences. • Reason for making a global alignment: • checking minor difference between two sequences • Analyzing polymorphisms (ex. SNPs) between closely related sequences • …
Pairwise Global Alignment • Computationally: • Given: a pair of sequences (strings of characters) • Output: an alignment that maximizes the similarity score
How can we find an optimal alignment? • ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCT • How many possible alignments? C(27,7) gap positions = ~888,000 possibilities • Dynamic programming: The Needleman & Wunsch algorithm 27 1
= (2n)!/(n!)2 = (22n /n ) = (2n) 2n n Time Complexity Consider two sequences: AAGT AGTC How many possible alignments the 2 sequences have?
Elements of Dynamic Programming • Dynamic Programming method is used to solve optimization problems to which optimal solutions depend on the optimal solutions to the subproblems. It involves • Characterize the structure of the optimal solutions • Recursively define the score of an optimal solution in terms of the scores of optimal solutions of subproblems • Compute the solution in a bottom-up fashion • Trace back the optimal solution
Needleman & Wunsch Consider two sequences: AGAC AGC To find the optimal solution, • C is aligned with C, • need to find the best alignment between AGA and AG. • 1st C is aligned with _, • need to find the best alignment between AGA and AGC. • 2nd C is aligned with _, • need to find the best alignment between AGAC and AG. AGAC AGC_ AGAC_ AGC
Dynamic Programming Let n be the size of sequence s = AAAC, m be the size of sequence t = AGC. Consider subproblems: matching the prefixes of s and t. t has ? possible prefixes, including empty string s has ? possible prefixes, including empty string n+1 m+1
Score(s[1…i],t[1…j-1])+gap penalty Score(s[1…i],t[1…j])=max Score(s[1…i-1],t[1…j-1])+score(s[i],s[j]) Score(s[1…i-1],t[1…j])+gap penalty Dynamic Programming We would like to match s[1…i] and t[1…j]: • Align s[1…i] with t[1…j-1] and match a space with t[j] • Align s[1…i-1] with t[1…j-1] and match s[i] with t[j] • Align s[1…i-1] with t[1…j] and match a space with s[i] Similarity between s and t:
Needleman & Wunsch • Place each sequence along one axis • Place score 0 at the up-left corner • Fill in 1st row & column with gap penalty multiples • Fill in the matrix with max value of 3 possible moves: • Vertical move: Score + gap penalty • Horizontal move: Score + gap penalty • Diagonal move: Score + match/mismatch score • The optimal alignment score is in the lower-right corner • To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.
empty A A A C empty 0 -2 -4 -6 -8 A -2 1 -1 -3 -5 G -4 -1 0 -2 -4 C -3 -2 -1 -1 -6 = GAP in top sequence = GAP in low sequence = ALIGN both positions Example • Let gap = -2 match = 1 mismatch = -1. AAAC A-GC AAAC -AGC
Elements of Dynamic Programming • Characterize the structure of the optimal solutions • Recursively define the score of an optimal solution in terms of the scores of optimal solutions of subproblems • Compute the solution in a bottom-up fashion • Trace back the optimal solution
Time Complexity Analysis • Initialize matrix values: O(n), O(m) • Filling in rest of matrix: O(nm) • Traceback: O(n+m) • If strings are same length, total time O(n2)
Local Alignment • Problem first formulated: • Smith and Waterman (1981) • Problem: • Find an optimal alignment between a substring of s and a substring of t • Algorithm: • is a variant of the basic algorithm for global alignment
Motivation • Searching for unknown domains or motifs within proteins from different families • Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) • Identifying active sites of enzymes • Comparing long stretches of anonymous DNA • Querying databases where query word much smaller than sequences in database • Analyzing repeated elements within a single sequence
Local Alignment • Let n be the size of sequence s = GATCACCT m be the size of sequence t = GATACCC. Consider subproblems: matching the suffixes of s and t. t has ? possible suffixes, including empty string s has ? possible suffixes, including empty string n+1 m+1
CATCACCT GAT_ACCC empty C A T C A C C T 0 0 0 0 0 0 0 0 empty 0 0 0 0 0 G 0 0 0 0 0 A 0 0 0 0 1 T 1 0 0 0 0 A 0 1 C 0 1 0 0 C 0 0 0 C Local Alignment CATCACCT GATACCC • Let gap = -2 match = 1 mismatch = -1. 0 0 0 0 0 0 1 0 0 1 0 2 0 0 1 0 1 1 0 0 1 0 2 1 0 1 0 1 3 1 1 0 1 2 2
CATCACCT GAT_ACCC empty C A T C A C C T 0 0 0 0 0 0 0 0 0 empty 0 0 0 0 0 0 0 0 0 0 G 1 0 0 1 0 0 0 0 0 A 0 2 0 0 0 0 0 0 1 T 1 0 1 1 1 0 0 0 0 A 0 0 1 0 2 1 0 0 1 C 1 0 1 3 1 1 0 0 0 C 1 0 1 2 2 0 0 0 C Local Alignment CATCACCT GATACCC • Let gap = -2 match = 1 mismatch = -1.
Smith & Waterman • Place each sequence along one axis • Place score 0 at the up-left corner • Fill in 1st row & column with 0s • Fill in the matrix with max value of 4 possible values: • 0 • Vertical move: Score + gap penalty • Horizontal move: Score + gap penalty • Diagonal move: Score + match/mismatch score • The optimal alignment score is the max in the matrix • To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit
Semi-global Alignment Example: CAGCA-CTTGGATTCTCGG –––CAGCGTGG–––––––– CAGCACTTGGATTCTCGG CAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.