1.12k likes | 1.46k Views
Pairwise Sequence Alignments. Topics to be Covered. Comparison methods Global alignment Local alignment. Introduction to Alignment. Analyze the similarities and differences at the individual base level or amino acid level
E N D
Topics to be Covered • Comparison methods • Global alignment • Local alignment
Introduction to Alignment • Analyze the similarities and differences at the individual base level or amino acid level • Aim is to infer structural, functional and evolutionary relationships among sequences
Sequence Alignment 982 TGTTTGCTAAAGCTTCAGCTATCCACAACCCAATTGACCTCTAC 1022 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 961 TCTTTGCTAAGACCGCCTCCATCTACAACCCAATCA - - - TCTAC 1001 • Two sequences written out , one on top of the other • Identical or similar characters placed in same column • Nonidentical characters either placed in same column as mismatch or opposite gap in the other sequence • Overall quality of the alignment is then evaluated based on a formula that counts the number of identical (or similar) pairs minus the number of mismatches and gaps
Pairwise Sequence Alignments • Why to compare • Similarity search is necessary for: • Family assignment • Sequence annotation • Construction of phylogenetic trees • Learn about evolutionary relationships • Classify sequences • Identify functions • Homology Modeling
Essential Elements of an Alignment Algorithm • Defining the problem (Global, local alignment) • Scoring scheme (Gap penalties) • Distance Matrix (PAM, BLOSUM series)
Global and Local Alignments Global – attempt is made to align the entire sequence using as many characters as possible, up to both ends of the sequences Local – stretches of sequence with the highest density of matches are aligned • L G P S S K Q T G K G S – S R I W D N • | | | | | | | Global Alignment • L N – I T K S A G K G A I M R L G D A • - - - - - - T G K G - - - - - - • | | | Local Alignment • - - - - - - - A G K G - - - - - -
Local vs. Global Alignment (cont’d) • Global Alignment • Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C TCCCAGTTATGTCAGGGGACACGAGCATGCAGAGAC |||||||||||| AATTGCCGCCGTCGTTTTCAGCAGTTATGTCAGATC
Global and Local Alignments • Global - When two sequences are of approximately equal length. Here, the goal is to obtain maximum score by completely aligning them • Local- When one sequence is a sub-string of the other or the goal is to get maximum local score • Protein motif searches in a database
Dynamic programming algorithm • Dynamic programming = • Build up optimal alignment using previous solutions for optimal alignments of subsequences
A T T A A T A T T A T A A T T A Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences vand w : v : w: • The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar
T A T A A T T A A T T A A T T A Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v : -- w: -- • The edit distance: dH(v, w) = 2. • Hamming distance neglects insertions and deletions in DNA
Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform vw
Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w V = ATATATAT W= TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task.
Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W= TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task
Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT W= TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ???
Edit Distance: Example • TGCATAT ATCCGAT in 5 steps • TGCATAT(delete last T) • TGCATA(delete last A) • TGCAT (insert A at front) • ATGCAT (substitute C for 3rdG) • ATCCAT (insert G before last A) • ATCCGAT (Done)
Edit Distance: Example • TGCATAT ATCCGAT in 5 steps • TGCATAT(delete last T) • TGCATA(delete last A) • TGCAT (insert A at front) • ATGCAT (substitute C for 3rdG) • ATCCAT (insert G before last A) • ATCCGAT (Done) • What is the edit distance? 5?
Edit Distance: Example (cont’d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT(delete 6thT) ATGCATA (substitute G for 5thA) ATGCGTA (substitute C for 3rdG) ATCCGTA (Done)
Edit Distance: Example (cont’d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT(delete 6thT) ATGCAAT (substitute G for 5thA) ATGCGAT (substitute C for 3rdG) ATCCGAT (Done) Can it be done in 3 steps???
The Alignment Grid • Every alignment path is from source to sink
v A C G T A C T w 1 2 3 4 5 6 7 0 A 0 T 1 G 2 T 3 T 4 A 5 T 6 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path -
v A T C G T A C w A 1 2 3 4 5 6 7 0 0 T G 1 2 T 3 T 4 A 5 T 6 7 Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment:
v A T C G T A C w 1 2 3 4 5 6 7 0 A 0 T 1 G 2 T 3 T 4 A 5 T 6 7 Alignment as a Path in the Edit Graph Old Alignment 0122345677 v= AT_GTTAT_ w=ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w=ATCG_TA_C 0123445667
From LCS to Alignment: Change up the Scoring • The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring schema: • +1 : match premium • -μ : mismatch penalty • -σ : indel penalty
Simple Scoring • When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, • the resulting score is: • #matches – μ(#mismatches) – σ (#indels)
Dynamic programming algorithm • define a matrix Fij: • Fij is the optimal alignment of • subsequence A1...i and B1...j • iterative build up: F(0,0) = 0 • define each element i,j from • (i-1,j): gap in sequence A • (i, j-1): gap in sequence B • (i-1, j-1): alignment of Ai to Bj
Sequence Comparison Scoring Matrices • • The choice of a scoring matrix can strongly influence the outcome of sequence analysis • • Scoring matrices implicitly represent a particular theory of evolution • • Elements of the matrices specify the similarity or the • distance of replacing one residue (base) by another • • Distance and similarity matrices are inter-convertible by some mathematical transformation.
Protein Scoring Matrices • The two most popular matrices are the PAM and the BLOSUM matrix
Scoring Insertions and Deletions A T G T A A T G C A T A T G T G G A A T G A A T G T - - A A T G C A T A T G T G G A A T G A insertion / deletion The creation of a gap is penalized with a negative score value.
Why Gap Penalties? • The optimal alignment of two similar sequences is usually that which • maximizes the number of matches and • minimizes the number of gaps. • Permitting the insertion of arbitrarily many gaps can lead to high • scoring alignments of non-homologous sequences. • Penalizing gaps forces alignments to have relatively few gaps.
Why Gap Penalties? Gaps not permitted Score: 0 1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Match = 5 Mismatch = -4 Gaps allowed but not penalized Score: 88 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
Gap Penalties Linear gap penalty score: γ(g) = - gd Affine gap penalty score: γ(g) = -d - (g -1)e γ(g) = gap penalty score of a gap of length g d = gap opening penalty e = gap extension penalty g = gap length
Scoring Indels: Naive Approach • A fixed penalty σ is given to every indel: • -σ for 1 indel, • -2σ for 2 consecutive indels • -3σ for 3 consecutive indels, etc. • Can be too severe penalty for a series of 100 consecutive indels
This is more likely. This is less likely. Affine Gap Penalties • In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments
Accounting for Gaps • Gaps- contiguous sequence of spaces in one of the rows • Score for a gap of length x is: • -(ρ +σx) • where ρ >0 is the penalty for introducing a gap: • gap opening penalty • ρ will be large relative to σ: • gap extension penalty • because you do not want to add too much of a penalty for extending the gap.
Affine Gap Penalty Recurrences si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ(vi, wj) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom
Scoring Insertions and Deletions match = 1 mismatch = 0 Total Score: 4 A T G T T A T A C T A T G T G C G T A T A Total Score: 8 - 3.2 = 4.8 A T G T - - - T A T A C T A T G T G C G T A T A Gap parameters: d = 3 (gap opening) e = 0.1 (gap extension) g = 3 (gap lenght) γ(g) = -3 - (3 -1) 0.1 = -3.2 insertion / deletion
1 ...VLSPADKFLTNV 12 |||| 1 VFTELSPAKTV.... 11 1 V...LSPADKFLTNV 12 | |||| | | | 1 VFTELSPA.K..T.V 11 gap opening penalty = 3 gap extension penalty = 0.1 score = 6.3 gap opening penalty = 0 gap extension penalty = 0.1 score = 11.3 Modification of Gap Penalties Score Matrix: BLOSUM62
Pairwise Sequence AlignmentLocal AlignmentSemi-Global Alignment
Local Alignment • A local Alignment between sequence s and • sequence t is an alignment with maximum • similarity between a substring of s and a • substring of t. • T. F. Smith & M. S. Waterman, “Identification of Common Molecular Subsequences”, J. Mol. Biol., 147:195-197
Why choose a local alignment algorithm? • More meaningful – point out conserved regions between two sequences • Aligns two sequences of different lengths to be matched • Aligns two partially overlapping sequences • Aligns two sequences where one is a subsequence of another 43
Dynamic ProgrammingLocal Alignment • Si,j = MAXIMUM • [ Si-1, j-1 + s(ai,bj)(match/mismatch in the diagonal), • Si,j-1 + w (gap in sequence #1), • Si-1,j + w (gap in sequence #2), • 0] 44