550 likes | 786 Views
BIO/CS 471 – Algorithms for Bioinformatics. Sequence Alignments and Dynamic Programming. Module II: Sequence Alignments. Finally a real bioinformatics problem! Problem: SequenceAlignment Input: Two or more strings of characters
E N D
BIO/CS 471 – Algorithms for Bioinformatics Sequence Alignmentsand Dynamic Programming
Module II: Sequence Alignments • Finally a real bioinformatics problem! • Problem: SequenceAlignment • Input: Two or more strings of characters • Output: The optimal alignment of the input strings, possibly including gaps, and an alignment score. • Example • Input: The calico cat was very sleepy. The cat was sleepy. • Output: The calico cat was very sleepy. The ------ cat was ---- sleepy. Sequence Alignments
Biological Sequence Alignment • Input: DNA or Amino Acid sequence • Output: Optimal alignment • Example • Input: ACTATAGCTATAGCTATAGCGTATCG ACGATTACAGGTCGTGTCG • Output: ACTATAGCTATAGCTATAGCGTATCG || || ||*|| | |||*||| ACGAT---TACAGGT----CGTGTCG • Score: +8 Sequence Alignments
Why align sequences? • Why would we want to align two sequences? • Compare two genes/proteins, e.g. to infer function • Database searches • Protein structure prediction • Why would we want to align multiple sequences? • Conservation analysis • Molecular evolution Sequence Alignments
How do we decide on a scoring function? • Objective: Sequences that are homologous or evolutionarily related, should align with high scores. Less related sequences should score lower. • Question: How do homologous sequences arise? Sequence Alignments
Transcription The Central Dogma DNA transcription RNA translation Proteins Sequence Alignments
DNA Replication • Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set • DNA polymerase is the enzyme that copies DNA • Reads the old strand in the 3´ to 5´ direction Sequence Alignments
Over time, genes accumulate mutations • Environmental factors • Radiation • Oxidation • Mistakes in replication or repair • Deletions, Duplications • Insertions • Inversions • Point mutations
Deletions • Codon deletion:ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal • Frame shift mutation:ACG ATA GCG TAT GTA TAG CCG…ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal Sequence Alignments
Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT Sequence Alignments
The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA Sequence Alignments
Two kinds of homologs • Orthologs • Species A has a particular gene • Species A diverges into species B and C, each with the same gene • The two genes accumulate substitutions independently Sequence Alignments
Orthologs and Paralogs • Paralogs • A gene duplication event within a species results in two copies of a gene • One of the copies is now under less selective constraint • The copies can accumulate substitutions independently, before and after speciation Sequence Alignments
Scoring a sequence alignment • Match score: +1 • Mismatch score: +0 • Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Gaps: 7 × (– 1) Score = +11 Sequence Alignments
Origination and length penalties • We want to find alignments that are evolutionarily likely. • Which of the following alignments seems more likely to you?ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT • We can achieve this by penalizing more for a new gap, than for extending an existing gap Sequence Alignments
Scoring a sequence alignment (2) • Match/mismatch score: +1/+0 • Origination/length penalty: –2/–1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Origination: 2 × (–2) • Length: 7 × (–1) Score = +7 Sequence Alignments
Optimal Substructure in Alignments • Consider the alignment:ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT • Is it true that the alignment in the boxed region must be optimal? Sequence Alignments
A Greedy Strategy • Consider this pair of sequencesGAGCCAGC • Greedy Approach:G or G or -C - G • Leads toGAGC--- Better: GACG---CAGC CACG GAP = 1 Match = +1 Mismatch = 2 Sequence Alignments
A +1 CTCGA CAGTAGA -1 CTCG- ACAGTAG- -1 ACTCGA CAGTAG Breaking apart the problem • Suppose we are aligning:ACTCGACAGTAG • First position choices: Sequence Alignments
A Recursive Approach to Alignment • Choose the best alignment based on these three possibilities: align(seq1, seq2) { if (both sequences empty) {return 0;} if (one string empty) { return(gapscore * num chars in nonempty seq); else { score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2)); score2 = align(tail(seq1), seq2) + gapscore; score3 = align(seq1, tail(seq2) + gapscore; return(min(score1, score2, score3)); } } } Sequence Alignments
Time Complexity of RecurseAlign • What is the recurrence equation for the time needed by RecurseAlign? 3 3 n 3 3 3 9 … 27 3 3 3n Sequence Alignments
RecurseAlign repeats its work Sequence Alignments
How can we find an optimal alignment? • Finding the alignment is computationally hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG—CATCGTC--T-ATCT • C(27,7) gap positions = ~888,000 possibilities • It’s possible, as long as we don’t repeat our work! • Dynamic programming: The Needleman & Wunsch algorithm Sequence Alignments
What is the optimal alignment? • ACTCGACAGTAG • Match: +1 • Mismatch: 0 • Gap: –1 Sequence Alignments
Needleman-Wunsch: Step 1 • Each sequence along one axis • Mismatch penalty multiples in first row/column • 0 in [1,1] (or [0,0] for the CS-minded) Sequence Alignments
Needleman-Wunsch: Step 2 • Vertical/Horiz. move: Score + (simple) gap penalty • Diagonal move: Score + match/mismatch score • Take the MAX of the three possibilities Sequence Alignments
Needleman-Wunsch: Step 2 (cont’d) • Fill out the rest of the table likewise… Sequence Alignments
Needleman-Wunsch: Step 2 (cont’d) • Fill out the rest of the table likewise… • The optimal alignment score is calculated in the lower-right corner Sequence Alignments
But what is the optimal alignment • To reconstruct the optimal alignment, we must determine of where the MAX at each step came from… Sequence Alignments
A path corresponds to an alignment • = GAP in top sequence • = GAP in left sequence • = ALIGN both positions • One path from the previous table: • Corresponding alignment (start at the end):AC--TCG ACAGTAG Score = +2 Sequence Alignments
Practice Problem • Find an optimal alignment for these two sequences: GCGGTT GCGT • Match: +1 • Mismatch: 0 • Gap: –1 Sequence Alignments
Practice Problem • Find an optimal alignment for these two sequences: GCGGTT GCGT GCGGTTGCG-T- Score = +2 Sequence Alignments
Semi-global alignment • Suppose we are aligning:GCGGGCG • Which do you prefer?G-CG -GCGGGCG GGCG • Semi-global alignment allows gaps at the ends for free. Sequence Alignments
Semi-global alignment • Semi-global alignment allows gaps at the ends for free. • Initialize first row and column to all 0’s • Allow free horizontal/vertical moves in last row and column Sequence Alignments
Local alignment • Global alignments – score the entire alignment • Semi-global alignments – allow unscored gaps at the beginning or end of either sequence • Local alignment – find the best matching subsequence • CGATGAAATGGA • This is achieved by allowing a 4th alternative at each position in the table: zero. Sequence Alignments
Local alignment • Mismatch = –1 this time CGATGAAATGGA Sequence Alignments
Food for thought… • What is the asymptotic time complexity of the Needleman & Wunsch algorithm? • Is this good or bad? • How about the space complexity? • Why might this be a problem? Sequence Alignments
Saving Space • Note that we can throw away the previous rows of the table as we fill it in: This row is based only on this one Sequence Alignments
Saving Space (2) • Each row of the table contains the scores for aligning a prefix of the left-hand sequence with all prefixes of the top sequence: Scores for aligning aca with all prefixes of actcg Sequence Alignments
Divide and Conquer • By using a recursive approach, we can use only two rows of the matrix at a time: • Choose the middle character of the top sequence, i • Find out where i aligns to the bottom sequence • Needs two vectors of scores • Recursively align the sequences before and after the fixed positions i ACGCTATGCTCATAG CGACGCTCATCG Sequence Alignments
Finding where i lines up • Find out where i aligns to the bottom sequence • Needs two vectors of scores • Assuming i lines up with a character:alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t) + align(CTCATAG, suffix(t)) • Which character is best? • Can quickly find out the score for aligning ACGCTAT with every prefix of t. i s: ACGCTATGCTCATAG t: CGACGCTCATCG Sequence Alignments
Finding where i lines up • But, i may also line up with a gap • Assuming i lines up with a gap:alignscore = align(ACGCTAT, prefix(t)) + gapscore + align(CTCATAG, suffix(t)) i s: ACGCTATGCTCATAG t: CGACGCTCATCG Sequence Alignments
Recursive Call • Fix the best position for I • Call align recursively for the prefixes and suffixes: i s: ACGCTATGCTCATAG t: CGACGCTCATCG Sequence Alignments
Complexity • Let len(s) = m and len(t) = n • Space: 2m • Time: • Each call to build similarity vector = m´n´ • First call + recursive call: i s: ACGCTATGCTCATAG t: CGACGCTCATCG j Sequence Alignments
G - - G or General Gap Penalties • Suppose we are no longer using simple gap penalties: • Origination = −2 • Length = −1 • Consider the last position of the alignment for ACGTA with ACG • We can’t determine the score forunless we know the previous positions! Sequence Alignments
A A C --- A TATCCG A C T AC A C T ACC T ------ C G C -- Scoring Blocks • Now we must score a block at a time • A block is a pair of characters, or a maximal group of gaps paired with characters • To score a position, we need to either start a new block or add it to a previous block Sequence Alignments
The Algorithm • Three tables • a – scores for alignments ending in char-char blocks • b – scores for alignments ending in gaps in the top sequence (s) • c – scores for alignments ending in gaps in the left sequence (t) • Scores no longer depend on only three positions, because we can put any number of gaps into the last block Sequence Alignments
The Recurrences Sequence Alignments
The Optimal Alignment • The optimal alignment is found by looking at the maximal value in the lower right of all three arrays • The algorithm runs in O(n3) time • Uses O(n2) space Sequence Alignments