Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Alignment versus Pairwise Alignment • Up until now we have only tried to align two sequences. • What about more than two? And what for? • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

Generalizing the Notion of Pairwise Alignment • Alignment of 2 sequences is represented as a 2-row matrix • In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T - G C G - A _ C G T - A A T C A C - A • Score: more conserved columns, better alignment

Aligning Three Sequences source • Same strategy as aligning two sequences • Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align • For global alignments, go from source to sink sink

2-D vs 3-D Alignment Grid V W 2-D edit graph 3-D edit graph

2-D cell versus 2-D Alignment Cell In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube

Architecture of 3-D Alignment Cell (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j,k) (i-1,j-1,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k)

si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k + (vi, wj, _ ) si-1,j,k-1 + (vi, _, uk) si,j-1,k-1 + (_, wj, uk) si-1,j,k + (vi, _ , _) si,j-1,k + (_, wj, _) si,j,k-1 + (_, _, uk) Multiple Alignment: Dynamic Programming cube diagonal: no indels • si,j,k = max • (x, y, z) is an entry in the 3-D scoring matrix face diagonal: one indel edge diagonal: two indels

Multiple Alignment: Running Time • For 3 sequences of length n, the run time is 7n3; O(n3) • For ksequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk) • Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

Practically speaking • Multiple alignment is a hard problem • Yet, it is of extreme practical importance • Comparing several species is a very common task • Doing this for the entire genome is an increasingly common demand! • Several heuristic-based algorithms have been developed that employ greedy, divide-and-conquer, dynamic programming approaches in various combinations • The algorithms we will see today are actually used by current multiple aligners

Multiple Alignment Induces Pairwise Alignments Every multiple alignment induces pairwise alignments x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them?

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them? NOT ALWAYS Pairwise alignments may be inconsistent

Inferring Multiple Alignment from Pairwise Alignments • From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal • It is difficult to infer a ``good” multiple alignment from optimal pairwise alignments between all sequences

Profile Representation of Multiple Alignment - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 In the past we were aligning a sequence against a sequence Can we align a sequence against a profile? Can we align a profile against a profile?

Multiple Alignment: Greedy Approach • Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat • This is a heuristic greedy method u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k

Progressive Alignment • Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. • Use profiles to compare sequences • Gaps in consensus string are permanent

ClustalW • Popular multiple alignment tool today • ‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). • Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree

v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 1: Pairwise Alignment • Aligns each sequence again each other giving a similarity matrix • Similarity = exact matches / sequence length (percent identity)

Step 2: Guide Tree • Create Guide Tree using the similarity matrix • ClustalW uses the “neighbor-joining method” • Guide tree roughly reflects evolutionary relations

v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 2: Guide Tree (cont’d) v1 v3 v4 v2 Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

Step 3: Progressive Alignment • Start by aligning the two most similar sequences • Following the guide tree, add in the next sequences, aligning to the existing alignment • Insert gaps as necessary

Threaded Blockset Aligner (TBA) Blanchette et al. 2004 • A relatively new multiple alignment method • Proposes a new way to represent sequence similarity among multiple sequences • slightly different from usual multiple alignment • more suited for whole genome comparisons • Evaluation of alignment

Drawbacks of usual alignment • To align several full genomes, one common approach is to align a “reference genome” (e.g., human) to all others • Simple approach, easy to visualize • Drawback: if some sequence is conserved in a subset of species, but not in reference, will not see that • Drawback: choice of reference makes a difference. Different references may result in different alignments

New approach • Instead of producing a standard multiple alignment (one large matrix of characters and gaps)… • … Produce a set of “blocks”. Each block is a local alignment of some or all of the sequences

Figure 1 (A) Blocks (alignments) of a hypothetical threaded blockset for sequences h (400 bp), m (400 bp) and r (350 bp) (B) Projection of the threaded blockset onto m Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715 Set of blocks

Figure 1 (A) Blocks (alignments) of a hypothetical threaded blockset for sequences h (400 bp), m (400 bp) and r (350 bp) (B) Projection of the threaded blockset onto m Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715 Projected on “m”

Definition: Blockset • Block: rectangular array (matrix) of symbols (bases and dashes). • Removing dashes from any row gives a substring of an original sequence • No column can be dashes only • A set of blocks is a blockset

Definition: Ref-blockset • “S-Ref” blockset: A blockset where every block has a row for species S. • S is called the reference for this blockset • Every position of the original sequence from S is in at most one block (not repeated) • Example: “human-ref blockset”, where a segment of the human chromosome is the reference

Definition: Threaded blockset • A sequence S “threads” a blockset if every position in S occurs precisely once in some block of the blockset • By definition, S threads an “S-ref blockset” • If a blockset is threaded by each of the original sequences, it is a “threaded blockset”

Threaded blockset advantages • “Project” a threaded blockset onto any of the original sequences: • pick the blocks having that sequence • order them according to position in sequence • no inconsistency when projecting to different reference sequences • Threaded blockset allows for evolutionary operations like inversions and duplications • Traditional multiple alignment does not accommodate these

Figure 2 (A) Alignments between the chloroplast genomes of Arabidopsis thaliana and Oenothera elata (evening primrose) Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715

Threaded Blockset Aligner (TBA) • Finds threaded blocksets • Uses an alignment algorithm called MULTIZ • MULTIZ uses ideas similar to progressive alignment

Figure 5 Pictorial representation of an application of MULTIZ Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715

Figure 5 Pictorial representation of an application of MULTIZ Mathieu Blanchette et al. Genome Res. 2004; 14: 708-715 • M is an HUMAN-ref blockset • N is a COW-ref blockset • G is a pairwise HUMAN-ref blockset • MULTIZ combines M and N, using G as a guide • Finds a segment “h” of HUMAN in M and • a segment “c” of COW in N, such that • “h” and “c” are aligned according to G. • Then aligns “h” and “c” using dynamic programming, using G as guide.

Evaluation • Evaluating alignment programs is very difficult • What is a benchmark here ? • We haven’t witnessed the process of evolution, so we cannot say for certain what the true alignment of “extant” sequences should be • One approach: “simulate” evolution

Simulating evolution • Generate a random sequence and introduce realistic evolutionary changes to it, along branches of an assumed phylogeny • Evolutionary changes made, and their rates, must be realistic • Substitutions, insertions, deletions, insertion & deletion rates, duplications, introduction of repeat elements, etc.

Evaluating alignment • Once simulation done, take all the sequences at the leaf nodes of the phylogeny (started with root) • Align these sequences using software • Compare computed alignment and known (“true”) alignment

Multiple Sequence Alignment