Sequence Alignment Tutorial #2

Sequence AlignmentTutorial #2 © Ydo Wexler & Dan Geiger .

Sequence Comparison Much of bioinformatics involves sequences • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: |alphabet|=4 • Protein: |alphabet|=20

Global Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: • GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA • A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

Hypotheses space Best biological explanaiton Biological data Global Alignment -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel) Example (cont): Symmetric view of evolution

Global Alignmentscoring scheme Score each position independently: • Match: +1 • Mismatch: -1 • Indel: -2 Score of an alignment is sum of position scores Example:-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score:(+1x5) + (-1x6) + (-2x11) = -23

Sequence Alignment Variants Two basic variants of sequence alignment: • Global alignment (Needelman-Wunsch) • Local alignment (Smith-Waterman) Today we’ll see : • Overlap alignment • Affine cost for gaps We’ll use ideas of dynamic programming presented in the lecture

Overlap Alignment Consider the following problem: • Find the most significant overlap between two sequences S,T ? • Possible overlap relations: a. b. Difference from local alignment: Here we require alignment between the endpoints of the two sequences.

Overlap Alignment Formally: given S[1..n] , T[1..m] find i,j such that: d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) } is maximal. Solution: Same asGlobal alignment except we don’t not penalise overhanging ends.

Overlap Alignment • Initialization:V[i,0]=0,V[0,j]=0 Recurrence:as in global alignment Score:maximum value at the bottom line and rightmost line

Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5

Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE Scoring scheme: • Match: +4 • Mismatch: -1 • Indel: -5

Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5 -2 Overlap Alignment (Example) The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE-

Affine gap scores • Observation: Insertions and deletions often occur in blocks longer than a single nucleotide. • Consequence: • Current scoring scheme gives a constant penalty per gap unit. • This does not score well the above phenomenon. Question: How do we modify the scheme to incorporate this?

Alignment with affine gap scores • Penalty score for a gap of length g : d - penalty for introduction of a gap e - penalty for elongating the gap by one unit. Typically d > e • Problem: When aligning S[i] to a gap we do not know whether to penalize by d or e. Solution: we compute 3 matrices simultaneously M(i,j) - the score obtained by aligning S[i] to T[j] IS(i,j) - the score obtained by aligning S[i]to a gap IT(i,j) - the score obtained by aligning T[j]to a gap

We assume that a deletion will not be followed directly by an insertion. This can be obtained by using Affine gap scores • Initialization:depending on the problem (global, local,…) • Recurrence:uses already known values - M(i’,j’), IS(i’,j’), IT(i’,j’)

Affine gap scores • Simplification: Why are two matrices enough?

Sequence Alignment Tutorial #2

Sequence Alignment Tutorial #2

Presentation Transcript

Sequence Alignment Tutorial #3

Sequence Alignment Tutorial

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence Alignment

Sequence alignment

Sequence Alignment