Pairwise alignment

Pairwise alignment Biology 162 Computational Genetics Todd VisionFall 2004 31 Aug 2004

Preview • General concepts in alignment • How to read a dotplot • Scoring matrices and gap penalties • Basic dynamic programming algorithms • Needleman-Wunsch • Smith-Waterman • Using more realistic gap penalties

Homology • Two sequences are homologous if they aredescended from a common ancestral sequence. • Homology is either all or nothing • Only similarity can be quantitative • An alignment is a model of positional homology between nucleotide or amino acid residues

Some applications of sequence alignment • Sequence/genome assembly • Locating exons within genomic sequences • Functional annotation by homology search against a database • Identification of conserved signatures/motifs/domains • Molecular evolution and phylogenetics • Structural homology modeling

Alignments classified by • Span • Global, encompassing full-length sequences • Local, restricted to conserved segments • Number of sequences • Pairwise, involving only two sequences • Multiple, involving more than two • Algorithm • Optimality guarantee • Heuristic

Amino acids versus DNA • DNA sequences give much worse alignments than amino acid sequences • Fewer letters • Less realistic scoring matrices • Some applications can align codons • How large would that scoring matrix be? • If that’s not possible • Use aa alignment to guide DNA alignment of coding sequences • Use conceptual translations (6 potential coding frames) for database searches

B C Dotplots:phage l cI vs. P22 c2 repressor A Window size 1 11 25 Stringency 1 7 15

Internal repeats:human LDL protein Window size = 23 bp Stringency = 7 bp

Inversions

Hanging ends y y x x Overlap Nesting x y

Twilight Zone 100 90 80 70 60 50 40 30 20 10 0 Percent amino acid identity

Scoring an alignment • Possible relationships at a position • Match (identity) • Mismatch (substitution) • Gap (insertion/deletion, or indel) • A scoring matrix is used for matches and mismatches • Typically binary for nucleic acids • PAM, BLOSUM, & others for amino acids • Gap penalties must be “tacked on” • The alignment score is the sum of the scores at each position in the alignment, including gaps

LOD scores • Let pabbe the expected frequency of aligned residue pair a and b among all aligned residues • Let qa be the frequency of individual amino acid a

Point Accepted Mutation (PAM) matrices • Trained on alignments of closely related proteins • PAM1 implies 1 substitution per 100 amino acids • PAM250 = (PAM1)250 • Training set strongly biased toward globular proteins (more suitable matrices are available for more specific protein classes)

Choosing the right PAM matrix • Low PAM values discriminate among amino acids more dramatically • As the exponent increases, values within a row converge on amino acid frequencies • Choice of matrix typically depends on observed % identity • Classic chicken and egg problem • PAM250 corresponds to 20% identity • Assumes substitution rate is equal among sites (Poisson model), which we know to be false

BLOSUM scoring matrices • Trained on ungapped alignments (blocks) of divergent sequences to capture ‘long-term’ substitution patterns • Named BLOSUMx, where x (from 0 to 100) is the minimum percent identity of the sequences in the alignment. (The smaller the value of x, the more divergent the sequences). • Note that numbers have opposite meanings for PAM and BLOSUM! • BLOSUM62 is in wide use (eg it is the default in BLAST)

BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 –4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 –4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 –4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 –4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 –4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 –4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 –4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 –4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 –4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 –4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 –4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 –4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 –4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 –1 1 5 -2 –2 0 -1 -1 0 –4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 –4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 –4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 –4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 –4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 –4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 BLOSUM62

Gap penalties • Naïve score • Each gap position receives independent penalty of d • Affine scores • Score depends on length of contiguous gap • Gap opening penalty d • Gap extension penalty e

Dynamic programming • A problem solving technique that employs recursion to solve a larger problem by solving a nested set of similar subproblems

Application to pairwise alignment • Imagine • We know the score for the first i-1 and j-1 residues of sequences x and y • i-1 and j-1 are aligned in the optimal alignment • There are three possibilities for the next position in the alignment • A gap in sequence x • A gap in sequence y • A match or mismatch between i and j • The maximum scoring alignment among these has to be in the optimal global alignment

Overview of algorithm • We can use this fact to recursively fill out a matrix containing the score F(i,j) of the optimal alignment for every pair of residues i and j • We also store a pointer to one of three previously filled out cells in the matrix, forming a path graph • The optimal global alignment must be a path within the path graph • It can be found by performing a traceback from the final cell in the recursion

Path Graph Diagonal moves represent matches and mismatches Horizontal and vertical moves represent gaps (indels)

xi, yj match xialigned to gap yj aligned to gap A G A G 3 3 2 3 A C A C 3 2 ?C ?- ?- ?G Needleman-Wunsch recursion Let s(C,C)=1, s(C,G)=s(G,C)=-2, and d=-1 y A C 3 3 x A C 3 x: AC y: AC

Initialize F(0,0) = 0 Use pointers to remember path Match=+1, Mismatch=-1, Gap=-2 arbitrary order of precedence: , ,  Needleman-Wunsch c g t g c g t c t gtg a t

Needleman-Wunsch: completed path graph c g t g c g t c t gtg a t

Needleman-Wunsch: traceback c g t g c g t c t gtg a t cgtgcg-t | || | | c-tgtgat optimal global alignment:

Complexity of algorithm • For sequences of length m and n • We consider 3 options at each cell • We store mn scores and pointers • We trace back no more than m+n steps • 3mn +m+n in time, 3mn in memory • O(mn) in both time and memory • If m=n, O(n2)

Smith-Waterman algorithm • Local pairwise alignment • Cells with negative scores are set to zero • Traceback from highest scoring cell • Stop when 0 is encountered • Also O(nm)

score≤ 0 xi, yj match xialigned to gap yj aligned to gap Smith-Waterman recursion

Smith-Waterman algorithm c g t g c g t c t gtg a t cgtgcgt ||| ctgtgat optimal local alignment:

Guaranteeing a local alignment • Use of SW algorithm alone does not guarantee “local” behavior • Sensitive to the scoring function (should be negative for random sequences) • Use of LOD scores help ensure this • Gap penalties must also be chosen carefully • If it is cheaper to insert a gap than to tolerate a mismatch, then gaps will be inserted where no alignment is possible

More realistic gap penalties • General gap function g(g) • Requires O(n3) operations

Affine gap penalties • Gap score: g(g)=-d-(g-1)e • Can be done in O(mn) • We need to keep track of three scores (and pointers) at each cell

A theme with variations • Overlapping or nested sequences • Do not penalize hanging ends • Repeated sequences • Asymmetric algorithm can find multiple local alignments of x in y or y in x • The basic idea admits many variations

Things to keep in mind about pairwise alignments • There may be multiple optima • Optimality is only guaranteed with respect to the scoring function – the alignment may still be biologically wrong! • O(mn) is still too big when n is the size of a major sequence database

Summary • Dotplots are an excellent visual tool to decide whether and what kind of alignment is appropriate • PAM and BLOSUM series matrices provide empirical LOD values for scoring alignments • Different flavors of alignment are produced by variants on a basic dynamic programming algorithm • Needleman-Wunsch for global alignments • Smith-Waterman for local alignments • Affine gap penalties balance biological realism with computational feasibility

Reading assignment • Nicholas et al. (2002) Strategies for multiple sequence alignment. Biotechniques 32, 572-591 (handout)

Pairwise alignment