1 / 32

CS 5263 & CS 4233 Bioinformatics

CS 5263 & CS 4233 Bioinformatics. Multiple Sequence Alignment. Multiple Sequence Alignment. Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition

essieh
Download Presentation

CS 5263 & CS 4233 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 & CS 4233 Bioinformatics Multiple Sequence Alignment

  2. Multiple Sequence Alignment • Motivation: • A faint similarity between two sequences becomes very significant if present in many sequences • Definition • Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the alignment is maximum • Two issues • How to score an alignment? • How to find a (nearly) optimal alignment?

  3. Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -

  4. Sum Of Pairs (cont’d) • The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

  5. Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG (A,A) + (A,G) x 2 = -1 (G,G) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

  6. Multiple Sequence Alignments Algorithms • Can also be global or local • We only talk about global for now • A simple method • Do pairwise alignment between all pairs • Combine the pairwise alignments into a single multiple alignment • Is this going to work?

  7. Compatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG AAAATTTT---- AAAA----GGGG AAAATTTT---- ----TTTTGGGG AAAA----GGGG TTTTGGGG AAAAGGGG ----TTTTGGGG AAAA----GGGG

  8. Incompatible pairwise alignments AAAATTTT AAAATTTT---- ----TTTTGGGG ----AAAATTTT GGGGAAAA---- ? TTTTGGGG GGGGAAAA TTTTGGGG---- ----GGGGAAAA

  9. Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube • As opposed to a two-dimensional grid • Uses a N-dimensional matrix • As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

  10. Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): • 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, yj, zk), F(i-1,j-1,k ) + S(xi, yj, -), F(i-1,j ,k-1) + S(xi, -, zk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, yj, zk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, yj, -), F(i ,j ,k-1) + S(-, -, zk) (i,j,k-1) (i,j-1,k-1) (i,j-1,k) (i,j,k)

  11. Multidimensional Dynamic Programming (MDP) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

  12. Faster MDP • Carrillo & Lipman, 1988 • Branch and bound • Other heuristics • Implemented in a tool called MSA • Practical for about 6 sequences of length about 200-300.

  13. Faster MDP • Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed • Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) = k<l s(mk, ml)  k<l s(k, l) • lower-bounded: score computed by any approximate algorithm (such as the ones we’ll talk next) • For any partial path, if Scurrent + Sperspective < lower-bound, can give up that path • Guarantees optimality Optimal msa Score of optimal alignment between k and l Score of the alignment between k and l induced by m

  14. Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times

  15. Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) y z w

  16. Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • High alignment score => short distance • Construct a tree (similar to hierarchical clustering. Will discuss in future) • Align nodes in order of decreasing similarity + a large number of heuristics

  17. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD

  18. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD Distance matrix

  19. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD s1 s3 s2 s4

  20. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK s1 s3 s2 s4

  21. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4

  22. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -ALSK -TNSD NA-SK NT-SD -TNSD NT-SD s1 s3 s2 s4

  23. Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT

  24. Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Progressive alignment

  25. allow y to vary x,z fixed projection Iterative Refinement (cont’d) For each sequence y • Remove y • Realign y (while rest fixed) z x y Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations

  26. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

  27. Iterative Refinement • Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA Realigning any single yi changes nothing

  28. Restricted MDP • Similar to bounded DP in pair-wise alignment • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

  29. Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

  30. Other approaches • Statistical learning methods • Profile Hidden Markov Models • Consistency-based methods • Still rely on pairwise alignment • But consider a third seq when aligning two seqs • If block A in seq x aligns to block B in seq y, and both aligns to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable • Essentially: change scoring system according to consistency • Then apply DP as in other approaches • Pioneered by a tool called T-Coffee

  31. Multiple alignment tools • Clustal W (Thompson, 1994) • Most popular • T-Coffee (Notredame, 2000) • Another popular tool • Consistency-based • Slower than clustalW, but generally more accurate for more distantly related sequences • MUSCLE (Edgar, 2004) • Iterative refinement • More efficient than most others • DIALIGN (Morgenstern, 1998, 1999, 2005) • “local” • Align-m (Walle, 2004) • “local” • PROBCONS (Do, 2004) • Probabilistic consistency-based • Best accuracy on benchmarks • ProDA (Phuong, 2006) • Allow repeated and shuffled regions

  32. In summary • Multiple alignment scoring functions • Sum of pairs • Other funcs exist, but less used • Multiple alignment algorithms: • MDP • Optimal • too slow • Branch & Bound doesn’t solve the problem entirely • Progressive alignment: clustalW • Iterative refinement • Restricted MDP • Consistency-based Heuristic

More Related