CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 8: Multiple Sequence Alignment

Roadmap • Homework? • Review of last lecture • Multiple sequence alignment

Homework • #1: dsDNA => mRNA => protein Coding strand Template strand The genetic code mRNA Template strand mRNA protein

Problem #2 • For two strings of lengths m and n, the number of alignment is equal to the number of paths from (0, 0) to (m, n) • How many ways we can get to (i, j) depend on how many ways we can get to its preceding neighbors

Problem #3 • Similar to problem #2 • But there are some limitations on certain paths • (i-1, j-1)→(i-1, j)→(i, j) is illegal • So is (i-1, j-1)→(i, j-1)→(i, j) • How many ways to get to (i-1, j) without using (i-1, j-1)→(i-1, j)? • How many ways to get to (i, j-1) without using (i-1, j-1)→(i, j-1)?

Problem #4 • Implementation is easy • Histogram: how you bin it may affect your results • bin for each discrete value you observed in your scores • Scores related to base frequency? • Scores differ between global and local alignments? • Score distribution?

BLAST …… Main idea: Construct a dictionary of all the words in the query Alignment initiated between words of alignment score  T Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold query …… scan DB query

BLAST A C G A A G T A A G G T C C A G T Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A

Gapped BLAST A C G A A G T A A G G T C C A G T Added features: • Pairs of words can initiate alignment • Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A

Advantages • Fast!!!! • A few minute to search a database of 1011 bases • Disadvantages • Sensitivity may be low • Often misses weak homologies

New improvement • Make it even faster • But even less sensitive • Mainly for aligning very similar sequences or really long sequences • E.g. whole genome vs whole genome • Make it more sensitive • PSI-BLAST: iteratively add more homologous sequences • PatternHunter: discontinuous seeds

Things we’ve covered so far • Global alignment • Needleman-Wunsch and variants • Improvement on space and time • Local Alignment • Smith-Waterman • Heuristic algorithms • BLAST families • Statistics for sequence alignment • Extreme value distribution

Commonality • They all deal with aligning two sequences • Pair-wise sequence alignment

Today • Aligning multiple sequences all together • Multiple sequence alignment

Motivation • A faint similarity between two sequences becomes very significant if present in many • Protein domains • Motifs responsible for gene regulation

Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences • Same for a multiple alignment!

Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor x y z ? Phylogenetic tree or evolution tree w v

Scoring Function (cont’d) • Unfortunately: too many parameters • Compromises: • Ignore phylogenetic tree • Compute from pair-wise scores • Based on sum of all pair-wise scores • Based on scores with a consensus sequence

First assumption • Columns are independent • Similar in pair-wise alignment • Therefore, the score of an alignment is the sum of all columns • Need to decide how to score a single column

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG (A,A) + (A,G) x 2 = -1 (C,C) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Sum Of Pairs (cont’d) • Drawback: no evolutionary characterization • Every sequence derived from all others • Heuristic way to incorporate evolution tree • Weighted Sum of Pairs: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance Human Mouse Duck Chicken

Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Consensus sequence: • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

Multiple Sequence Alignments Algorithms

Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube • As opposed to a two-dimensional grid • Uses a N-dimensional matrix • As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): • 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i,j,k-1) (i,j-1,k-1) (i,j-1,k) (i,j,k)

Multidimensional Dynamic Programming (MDP) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Faster MDP • Carrillo & Lipman, 1988 • Branch and bound • Other heuristics • Practical for about 6 sequences of length about 200-300.

Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times

Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) y z w

Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • High alignment score => short distance • Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) • Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD Distance matrix

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4

CLUSTALW example Question: how do you align two alignments? • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -ALSK -TNSD NA-SK NT-SD -TNSD NT-SD s1 s3 s2 s4

Aligning two alignments • You can treat each column in an alignment as a single letter • Remember in the case of gene finder, we aligned three nucleic acids at a time • How do we score it? • Naïve: compute Sum of Pair • Better: only compute the cross terms • We already have (K, K) and (D, D) • Need to add 2x(K, D) -ALSK NA-SK -TNSD NT-SD

CLUSTALW & the CINEMA viewer

Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Note: Guaranteed to converge Running time: O(kNL2), k: number of iterations

allow y to vary x,z fixed projection Iterative Refinement (cont’d) For each sequence y • Remove y • Realign y (while rest fixed) z x y

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

Iterative Refinement • Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA Realigning any single yi changes nothing

Restricted MDP • Similar to bounded DP in pair-wise alignment • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

Other approaches • Profile Hidden Markov Models • Statistical learning methods • Will discuss in future

Multiple alignment tools • Clustal W (Thompson, 1994) • Most popular • PRRP (Gotoh, 1993) • HMMT (Eddy, 1995) • DIALIGN (Morgenstern, 1998) • T-Coffee (Notredame, 2000) • MUSCLE (Edgar, 2004) • Align-m (Walle, 2004) • PROBCONS (Do, 2004)

CS 5263 Bioinformatics