Rapid Global Alignments

Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Methods toCHAINLocal Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Sparse DP for rectangle chaining • 1,…, N: rectangles • (hj, lj): y-coordinates of rectangle j • w(j): weight of rectangle j • V(j): optimal score of chain ending in j • L: list of triplets (lj, V(j), j) • L is sorted by lj • L is implemented as a balanced binary tree h l y

Sparse DP for rectangle chaining Main idea: • Sweep through x-coordinates • To the right of b, anything chainable to a is chainable to b • Therefore, if V(b) > V(a), rectangle a is “useless” – remove it • In L, keep rectangles j sorted with increasing lj-coordinates  sorted with increasing V(j) V(b) V(a)

Sparse DP for rectangle chaining Go through rectangle x-coordinates, from left to right: • When on the leftmost end of rectangle i, compute V(i) • j: rectangle in L, with largest lj < hi • V(i) = w(i) + V(j) • When on the rightmost end of i, possibly store V(i) in L: • j: rectangle in L, with largest lj li • If V(i) > V(j): • INSERT (li, V(i), i) in L • REMOVE all (lk, V(k), k) with V(k)  V(i) & lk li j i

Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y

Time Analysis • Sorting the x-coords takes O(N log N) • Going through x-coords: N steps • Each of N steps requires O(log N) time: • Searching L takes log N • Inserting to L takes log N • All deletions are consecutive, so log N per deletion • Each element is deleted at most once: N log N for all deletions • Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Putting it All Together:Fast Global Alignment Algorithms • FIND local alignments • CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer: Suffix Tree sparse DP Avid: Suffix Tree hierarchical DP LAGAN CHAOS sparse DP

LAGAN: Pairwise Alignment FIND local alignments CHAIN local alignments DP restricted around chain

LAGAN • Find local alignments • Chain -O(NlogN) L.I.S. • Restricted DP

LAGAN: recursive call • What if a box is too large? • Recursive application of LAGAN, more sensitive word search

A trick to save on memory “necks” have tiny tracebacks …only store tracebacks

Multiple Sequence Alignments

Sequence Comparison • Introduction • Comparison • Homogy -- Analogy • Identity -- Similarity • Pairwise -- Multiple • Scoring Matrixes • Gap -- indel • Global -- Local • Manual alignment, dot plot • visual inspection • Dynamic programming • Needleman-Wunsch • exhaustive global alignment • Smith-Waterman • exhaustive local alignment • Multiple alignment • Database search • BLAST • FASTA

Sequence Comparison Multiple alignment (Multiple sequence alignment: MSA)

Overview • Definition • Scoring Schemes • Algorithms

Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve the pairwise alignments

Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor, according to some phylogenetic model • More on phylogenetic models later x y z ? w v

Scoring Function • A comprehensive model would have too many parameters, too inefficient to optimize • Possible simplifications • Ignore phylogenetic tree • Statistically independent columns: S(m) = G(m) + i S(mi) m: alignment matrix G: function penalizing gaps

Scoring Function: Sum Of Pairs Definition:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance

Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

Multiple Sequence Alignments Algorithms

Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity V S N S _ S N A A N S V S N S

Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity Alignment of protein sequences with 200 amino acids using dynamic programming # of sequences CPU time (approx.) 2 1 sec 4 104 sec – 2,8 hours 5 106 sec – 11,6 days 6 108 sec – 3,2 years 7 1010 sec – 371 years

Sequence Comparison Multiple alignment Approximate methods for MSA • Multidimensional dynamic programming(MSA, Lipman 1988) • Progressive alignments(Clustalw, Higgins 1996; PileUp, Genetics Computer Group (GCG)) • Local alignments(e.g. DiAlign, Morgenstern 1996; lots of others) • Iterative methods (e.g. PRRP, Gotoh 1996) • Statistical methods (e.g. Bayesian Hidden Markov Models)

Sequence Comparison Multiple alignment Multiple sequence alignment - Programs Progressive Multidimentional Dynamic programming Clustal Tree based T-Coffee DCA MSA Combalign Dalign OMA Interalign Prrp Non tree based GA SAGA Sam HMMER GAs Iterative HMMS

Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity Program Seq type Alignment Methode Comment ClustalW Prot/DNA Global Progressive No format limitation Run on Windows too! PileUp Prot/DNA Global Progressive Limited by the format and UNIX based MultAlin Prot/DNA Global Progressive/Iterativ Limited by the format T-COFFEE Prot/DNA Global/local Progressive Can be slow

1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

1. Multidimensional Dynamic Programming • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, xk), F(i ,j ,k-1)+S( -, -, xk) }

1. Multidimensional Dynamic Programming Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

2. Progressive Alignment • Multiple Alignment is NP-complete • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O( N L2 )

2. Progressive Alignment x y • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) z w

CLUSTALW: progressive alignment CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • Construct a tree (Neighbor-joining hierarchical clustering) • Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW & the CINEMA viewer

MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree • Find local alignments for every pair of sequences x, y • Find anchors between every pair of sequences, similar to LAGAN anchoring • Progressive alignment • Multi-Anchoring based on reconciling the pairwise anchors • LAGAN-style limited-area DP • Optional refinement steps

MLAGAN: multi-anchoring To anchor the (X/Y), and (Z) alignments: X Z Y Z X/Y Z

Heuristics to improve multiple alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Note: Guaranteed to converge

allow y to vary x,z fixed projection Iterative Refinement For each sequence y • Remove y • Realign y (while rest fixed) z x y

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

Restricted MDP Here is another way to improve a multiple alignment: • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m Running Time: O(2N RN-1 L)

Restricted MDP • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

Rapid Global Alignments