1 / 51

Rapid Global Alignments

Rapid Global Alignments. How to align genomic sequences in (more or less) linear time. Methods to CHAIN Local Alignments. Sparse Dynamic Programming O(N log N). The Problem: Find a Chain of Local Alignments. (x,y)  (x’,y’) requires x < x’ y < y’. Each local alignment has a weight

hei
Download Presentation

Rapid Global Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rapid Global Alignments How to align genomic sequences in (more or less) linear time

  2. Methods toCHAINLocal Alignments Sparse Dynamic Programming O(N log N)

  3. The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

  4. Sparse DP for rectangle chaining • 1,…, N: rectangles • (hj, lj): y-coordinates of rectangle j • w(j): weight of rectangle j • V(j): optimal score of chain ending in j • L: list of triplets (lj, V(j), j) • L is sorted by lj • L is implemented as a balanced binary tree h l y

  5. Sparse DP for rectangle chaining Main idea: • Sweep through x-coordinates • To the right of b, anything chainable to a is chainable to b • Therefore, if V(b) > V(a), rectangle a is “useless” – remove it • In L, keep rectangles j sorted with increasing lj-coordinates  sorted with increasing V(j) V(b) V(a)

  6. Sparse DP for rectangle chaining Go through rectangle x-coordinates, from left to right: • When on the leftmost end of rectangle i, compute V(i) • j: rectangle in L, with largest lj < hi • V(i) = w(i) + V(j) • When on the rightmost end of i, possibly store V(i) in L: • j: rectangle in L, with largest lj li • If V(i) > V(j): • INSERT (li, V(i), i) in L • REMOVE all (lk, V(k), k) with V(k)  V(i) & lk li j i

  7. Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y

  8. Time Analysis • Sorting the x-coords takes O(N log N) • Going through x-coords: N steps • Each of N steps requires O(log N) time: • Searching L takes log N • Inserting to L takes log N • All deletions are consecutive, so log N per deletion • Each element is deleted at most once: N log N for all deletions • Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

  9. Putting it All Together:Fast Global Alignment Algorithms • FIND local alignments • CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer: Suffix Tree sparse DP Avid: Suffix Tree hierarchical DP LAGAN CHAOS sparse DP

  10. LAGAN: Pairwise Alignment FIND local alignments CHAIN local alignments DP restricted around chain

  11. LAGAN • Find local alignments • Chain -O(NlogN) L.I.S. • Restricted DP

  12. LAGAN: recursive call • What if a box is too large? • Recursive application of LAGAN, more sensitive word search

  13. A trick to save on memory “necks” have tiny tracebacks …only store tracebacks

  14. Multiple Sequence Alignments

  15. Sequence Comparison • Introduction • Comparison • Homogy -- Analogy • Identity -- Similarity • Pairwise -- Multiple • Scoring Matrixes • Gap -- indel • Global -- Local • Manual alignment, dot plot • visual inspection • Dynamic programming • Needleman-Wunsch • exhaustive global alignment • Smith-Waterman • exhaustive local alignment • Multiple alignment • Database search • BLAST • FASTA

  16. Sequence Comparison Multiple alignment (Multiple sequence alignment: MSA)

  17. Overview • Definition • Scoring Schemes • Algorithms

  18. Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve the pairwise alignments

  19. Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor, according to some phylogenetic model • More on phylogenetic models later x y z ? w v

  20. Scoring Function • A comprehensive model would have too many parameters, too inefficient to optimize • Possible simplifications • Ignore phylogenetic tree • Statistically independent columns: S(m) = G(m) + i S(mi) m: alignment matrix G: function penalizing gaps

  21. Scoring Function: Sum Of Pairs Definition:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  22. Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

  23. Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance

  24. Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

  25. Multiple Sequence Alignments Algorithms

  26. Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity V S N S _ S N A A N S V S N S

  27. Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity Alignment of protein sequences with 200 amino acids using dynamic programming # of sequences CPU time (approx.) 2 1 sec 4 104 sec – 2,8 hours 5 106 sec – 11,6 days 6 108 sec – 3,2 years 7 1010 sec – 371 years

  28. Sequence Comparison Multiple alignment Approximate methods for MSA • Multidimensional dynamic programming(MSA, Lipman 1988) • Progressive alignments(Clustalw, Higgins 1996; PileUp, Genetics Computer Group (GCG)) • Local alignments(e.g. DiAlign, Morgenstern 1996; lots of others) • Iterative methods (e.g. PRRP, Gotoh 1996) • Statistical methods (e.g. Bayesian Hidden Markov Models)

  29. Sequence Comparison Multiple alignment Multiple sequence alignment - Programs Progressive Multidimentional Dynamic programming Clustal Tree based T-Coffee DCA MSA Combalign Dalign OMA Interalign Prrp Non tree based GA SAGA Sam HMMER GAs Iterative HMMS

  30. Sequence Comparison Multiple alignment Multiple sequence alignment - Computational complexity Program Seq type Alignment Methode Comment ClustalW Prot/DNA Global Progressive No format limitation Run on Windows too! PileUp Prot/DNA Global Progressive Limited by the format and UNIX based MultAlin Prot/DNA Global Progressive/Iterativ Limited by the format T-COFFEE Prot/DNA Global/local Progressive Can be slow

  31. 1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

  32. 1. Multidimensional Dynamic Programming • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, xk), F(i ,j ,k-1)+S( -, -, xk) }

  33. 1. Multidimensional Dynamic Programming Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

  34. 2. Progressive Alignment • Multiple Alignment is NP-complete • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O( N L2 )

  35. 2. Progressive Alignment x y • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) z w

  36. CLUSTALW: progressive alignment CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • Construct a tree (Neighbor-joining hierarchical clustering) • Align nodes in order of decreasing similarity + a large number of heuristics

  37. CLUSTALW & the CINEMA viewer

  38. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

  39. MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree • Find local alignments for every pair of sequences x, y • Find anchors between every pair of sequences, similar to LAGAN anchoring • Progressive alignment • Multi-Anchoring based on reconciling the pairwise anchors • LAGAN-style limited-area DP • Optional refinement steps

  40. MLAGAN: multi-anchoring To anchor the (X/Y), and (Z) alignments: X Z Y Z X/Y Z

  41. Heuristics to improve multiple alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

  42. Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = GA-CTT

  43. Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Note: Guaranteed to converge

  44. allow y to vary x,z fixed projection Iterative Refinement For each sequence y • Remove y • Realign y (while rest fixed) z x y

  45. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

  46. Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

  47. Restricted MDP Here is another way to improve a multiple alignment: • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m Running Time: O(2N RN-1 L)

  48. Restricted MDP • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

  49. Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

More Related