1 / 86

Multiple Global Alignment and Phylogenetic tree

Multiple Global Alignment and Phylogenetic tree. Outline. Multiple sequence alignment—MSA Motivation The sum of pairs method (SP) Phylogenetic tree Clustering Neighbour joining Clustalw. What is a Multiple Sequence Alignment. MSA is the alignment of more than two sequences.

garvey
Download Presentation

Multiple Global Alignment and Phylogenetic tree

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Global Alignment and Phylogenetic tree

  2. Outline • Multiple sequence alignment—MSA • Motivation • The sum of pairs method (SP) • Phylogenetic tree • Clustering • Neighbour joining • Clustalw

  3. What is a Multiple Sequence Alignment • MSA is the alignment of more than two sequences VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG— * * An example of MSA alignment

  4. Dynamic Programming in 3D QUESTION: Which alignment would be generated For DQLF, DNVQ, QGL?

  5. Dynamic Programming in 3D D--Q-LF DNVQ--- ---QGL-

  6. si1 - si1si1 --si1 sj2 sj2 -sj2 - sj2 - sk3 sk3 sk3 - sk3 - - How many cases do we need to consider? • In standard dynamic programming we considered 3 cases, namely match/mismatch, insert, and delete • For three sequences s1, s2, s3 there are 7 possibilities: • For m sequences there are 2m -1 possibilities QUESTION: Why is it “2”?

  7. Complexity • For m sequences each of length n the matrix has nm cells and for each we must check 2m -1 possibilities: That’s prohibitive! • Solution: Use pruning techniques (cut-offs) and heuristics to guide the search for the best solution

  8. A little excursion to Romania:A* Search Further reading Russel/Norvig, Artificial Intelligence, Chapter 4. Prentice-Hall

  9. Problem: Find the shortest path from Arad to Bucharest Neamt Oradea 71 Zerind 87 Iasi 75 151 92 Arad Optimal route is (140+80+97+101) = 418 miles 140 140 Vaslui 118 99 Sibiu Sibiu Faragas Timisoara 142 80 80 111 211 Lugoj Rimnicu Rimnicu 98 Urziceni Hirsova 70 86 97 97 Pitesti Pitesti Mehadia 146 101 101 Bucharest 86 75 138 Dobreta 90 120 Craiova Eforie Giurgui

  10. Straight Line Distances to Bucharest

  11. Greedy search Go to neighboring city v, which minimizes distance Fv to goal Oradea Zerind Arad Sibiu Faragas Timisoara Lugoj Rimnicu Urziceni Hirsova Pitesti Mehadia Bucharest Dobreta Craiova Eforie Giurgui

  12. Greedy search Go to neighboring city v, which minimizes distance Fv to goal Oradea Zerind QUESTION: Any problems? Why is it called “greedy” search? Arad Sibiu Faragas Timisoara Lugoj Rimnicu Urziceni Hirsova Pitesti Mehadia Bucharest Dobreta Craiova Eforie Giurgui

  13. Problems of greedy search • Not optimal • Greedy search from Arad to Bucharestvia Fagaras, optimum via Rimnicu • Problem: Greedy algorithm does not include distance already covered • A*: • Pursue best node first with scoring function of distance so far plus under estimate to goal (e.g. shortest line distance) • v is a node • SvBest score to go from start to node v • FvEstimate for going from v to goal • Tv = Sv + Fv Total score • Organize nodes to be visited sorted by total score(TODO list in next slides)

  14. A* search of the Romanian map featured in the previous slide. Note: Nodes are labelled with Tv = Sv+ Fv. However,we will be using the abbreviations T, S and F to make the notation simpler Zerind Oradea T= 0 + 366 T= 366 Arad Fagaras Sibiu Bucharest(2) Rimnicu Timisoara Bucharest Bucharest Bucharest Pitesti Craiova

  15. We begin with the initial state of Arad. The cost of reaching Arad from Arad (or S value) is 0 miles. The straight line distance from Arad to Bucharest (or F value) is 366 miles. This gives us a total value of ( T = S + F )366 miles. Expand the initial state of Arad. DONE = [] TODO = [Arad/366] Zerind Oradea T= 0 + 366 T= 366 Arad Fagaras Sibiu Bucharest(2) Rimnicu Timisoara Bucharest Bucharest Bucharest Pitesti Craiova

  16. Once Arad is expanded we look for the node with the lowest cost. Sibiu has the lowest value for T. (The cost to reach Sibiu from Arad is 140 miles, and the straight line distance from Sibiu to the goal state is 253 miles. This gives a total of 393 miles). DONE = [Arad] TODO = [Sibiu/393, Timisoara/447, Zerind/449] Zerind Oradea T= 75 + 374 T= 449 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 140 + 253 T= 393 Bucharest(2) T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti Craiova

  17. We now expand Sibiu (that is, we expand the node with the lowest value of T). DONE = [Arad, Sibiu] TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti Craiova

  18. We now expand Rimnicu (that is, we expand the node with the lowest value of T ). DONE = [Arad, Sibiu] TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti Craiova

  19. Once Rimnicu is expanded we look for the node with the lowest cost. As you can see, Pitesti has the lowest value for T. (The cost to reach Pitesti from Arad is 317 miles, and the straight line distance from Pitesti to the goal state is 98 miles. This gives a total of 415 miles DONE = [Arad, Sibiu, Rimnicu] TODO = [Pitesti/415, Fagaras/417, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  20. We now expand Pitesti (that is, we expand the node with the lowest value of T). DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  21. In actual fact, the algorithm will not really recognise that we have found Bucharest. It just keeps expanding the lowest cost nodes (based on T ) until it finds a goal state AND it has the lowest value of T. So, we must now move to Fagaras and expand it. DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  22. We have just expanded a node (Pitesti) that revealed Bucharest, but it has a cost of 418. If there is any other lower cost node (and in this case there is one cheaper node, Fagaras, with a cost of 417) then we need to expand it in case it leads to a better solution to Bucharest than the 418 solution we have already found. DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  23. We now expand Fagaras (that is, we expand the node with the lowest value of T ). DONE = [Arad, Sibiu, Rimnicu, Pitesti] TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  24. Once Fagaras is expanded we look for the lowest cost node. As you can see, we now have two Bucharest nodes. One of these nodes ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ) has an T value of 418. The other node (Arad – Sibiu – Fagaras – Bucharest(2) ) has an T value of 450. We therefore move to the first Bucharest node and expand it. DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras] TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  25. We have now arrived at Bucharest. As this is the lowest cost node AND the goal state we can terminate the search. If you look back over the slides you will see that the solution returned by the A* search pattern ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ), is in fact the optimal solution. DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras] TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671] Zerind Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest Bucharest Bucharest Bucharest T= 418 + 0 T= 418 Pitesti T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  26. Additional optimization • Let‘s assume we have an (over)-estimate K for the best solution, i.e. the optimal solution will be better than K • Do not consider any node with total score Tv worse than K • If Tv > K then remove v from TODO list

  27. Additional optimization Zerind • Assume K = 430, then we can remove nodes Zerind, Oradea, Timisoara, Craiova Oradea T= 75 + 374 T= 449 T= 291 + 380 T= 671 T= 0 + 366 T= 366 Arad Fagaras Sibiu T= 239 + 178 T= 417 T= 140 + 253 T= 393 Bucharest(2) T= 220 + 193 T= 413 T= 118 + 329 T= 447 T= 450 + 0 T= 450 Rimnicu Timisoara Bucharest Bucharest Bucharest Bucharest T= 418 + 0 T= 418 Pitesti QUESTION: What if K is equal to optimum? What if K is poorely chosen? What if rule is “If Tv >= K then remove v“? Problem? T= 317 + 98 T= 415 Craiova T= 366 + 160 T= 526

  28. F must be under-estimate • For algorithm to work F must be an under-estimate • Example: Direct distance is always shorter than road QUESTION: What happens if F is not under-estimate?

  29. F must be under-estimate • For algorithm to work F must be an under-estimate • Example: Direct distance is always shorter than road • Then it cannot be guaranteed that optimal solution is found • E.g. FRiminicu = 10.000 in example for Riminicu? • Then TRiminicu = 10.220 > K = 450, so Riminicu would be removed, and optimal solution would not be found QUESTION: What happens if F is not under-estimate?

  30. From Romania to Dresden • So, what does that mean for multiple sequence alignment? QUESTIONS: What does a node (city) correspond to? What does an edge between nodes correspond to? What does the cost between two nodes correspond to? How could we define S? How could we define F? How could we define K?

  31. The Sum of Pairs Method • As in the pairwise case, not all MSA’s are equally good. We need a scoring method to determine when one MSA is better than another one • The Sum of Pairs (SP) method: • For each column in the alignment, sum up the score of each pair of residues. • M: a MSA of the sequences of (s1, s2, ...sm) • s’iis the projection of si, i.e. the sequence siwith gaps • S(s’i,s’j): the score of the projections • The final score is

  32. An Example of Using the SP Method • Example s1 = AVP s’1: A-VP- s2 = AVT s’2: A-V-T s3 = PSVPT s’3: PSVPT • Scores: • Match = 1 • Mismatch, insertion, deletion = -1 • S(-, -) = 0 to prevent the double counting of gaps. QUESTION: What is the score of the alignment?

  33. An Example of Using the SP Method • Example s1 = AVP s’1: A-VP- s2 = AVT s’2: A-V-T s3 = PSVPT s’3: PSVPT • Scores: • Match = 1 • Mismatch, insertion, deletion = -1 • S(-, -) = 0 to prevent the double counting of gaps. • Then the SP score is S(s’1,s’2) + S(s’1,s’3) + S(s’2, s’3) = 0 + (-1) + (-1) = -2

  34. 1 MSA vs. n SA • What is the difference between making one multiple sequence alignment to making many pairwise sequence comparisons? • The score S(s’i,s’j) for the alignment s’i,s’j in a multiple sequence alignment is less than score S(si,sj) for aligning si,sj directly • S(s’i,s’j) <= S(si,sj)

  35. Computing all cells in the dynamic programming solution is expensive, therefore we want to avoid computing as many cells as possible Can we rule out any cells? Let us assume that we know already that there is a known alignment of score K Let v = (i1,i2,….im) be a cell of the DP matrix for which want to determine whether we need to consider it (and its neighbours) or not Let Svbe the score of the best path from the start cell to cell v Let FVbe an upper bound for the highest-scoring alignment from v to the end of DP matrix, i.e. we can only find a path from v to the end which is less than FV Then we know the following: If Sv+ Fv< K, then v cannot lie on the path of the best alignment SV v <=Fv Pruning the search space

  36. Dynamic Pruning with Forward Recursion • D(v,w) is the score to be added when moving from v to its forward (east, southeast, south) neighbor w. • I.e. the overall score Sv+D(v,w) is sent to w. • The value of Sw is the maximum of all values sent to w from its backward (west, north, northwest) neighbor cells. sj2 SV - g v SV - g SV + R(si1,sj2) si1 From cell v values are sent to all its neighbor cells

  37. One more thing: A queue • We need a data structure before we list the algorithm • A queue is a list of elements with two special operators • Push: to add an element at the end of the queue • Pop: to remove an element from the top of a queue

  38. proc F(v, hN) a procedure which finds an upper bound of the score of the alignment from a cell v to the end-cell hN begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from its neighbours IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end for end while end const h0 the start cell of the DP matrix (H0,0…0) hN the end cell of the DP matrix (Hn1,n2…nm) K a lower bound for the score of the whole alignment var u, v,w denote cells S(u) the best score of an alignment from h0 to u P(u) the score of the best alignment from h0 to u found so far D(u, v) the score for extending the alignment from cell u to cell v Q a queue of the cells u for which a value for P(u) is found but u is not visited yet Algorithm: Forward-recursion with pruning

  39. Finding upper limits for scores • For any multiple sequence alignment M of sequences {s1,s2,….sm} we know that • the score for the multiple sequence alignment S(M) is less then the sum of pairwise comparisons of the sequences {s1,s2,….sm} • The procedure F should find an upper bound for the alignment of the subsequences s1i1+1…n1 , s2i2+1…n2 , ….. smim+1…nm • This can be done as follows: (4.6)

  40. Questions QUESTION: What is the score of the multiple sequence alignmentwhen the algorithm is done? QUESTION: How can we get alignment from algorithm?

  41. Answers • The score for the multiple sequence alignment is S(hN) • How can we get an alignment from the algorithm? • We need another variable Dir to store the direction from which we were coming • Let‘s assume we are at node v and its neighbour w is not pruned • If w is new in queue then Dir(w)={v} • If w is already in queue and S(v)+D(v,w)>P(w) then • P(w) = S(v)+D(v,w) and Dir(w) = {v} • If w is already in queue and S(v)+D(v,w)=P(w) then • Add v to Dir(w)

  42. proc F(v, hN) a procedure which finds an upper bound of the score of the alignment from a cell v to the end-cell hN begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from its neighbours IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) Dir(w) = {v} else if S(v)+D(v,w) > P(w) then P(w) = S(v)+D(v,w) Dir(w) = {v} else if S(v)+D(v,w) = P(w) then Add v to Dir(w) end for end while end const h0 the start cell of the DP matrix (H0,0…0) hN the end cell of the DP matrix (Hn1,n2…nm) K a lower bound for the score of the whole alignment var u, v,w denote cells S(u) the best score of an alignment from h0 to u P(u) the score of the best alignment from h0 to u found so far D(u, v) the score for extending the alignment from cell u to cell v Q a queue of the cells u for which a value for P(u) is found but u is not visited yet Dir(w) stores nodes v from which best scores were obtained Algorithm: Forward-recursion with pruning

  43. Printing the alignment: printMSA(hN,0) • printMSA is recursive function, which takes a node v and a position k in the alignment to be generated as input • B is a matrix, which contains the aligment • printMSA(v,k): • If v = h0then print B • Else • Let i1,…,im be the indices of v • For all u in Dir(v) do • Let i‘1,…,i‘m be the indices of w • For j from 0 to m-1 do • If ij = i‘j then Bk,j = „-“ • Else Bk,j = sequence j at position ij • printMSA(u,k+1)

  44. Questions QUESTION: Why is Dir a set and not a single node? QUESTION: Does printMSA print one multiple sequence alignmentor all possible ones?

  45. Example Let’s align DQLF, DNVQ, QGL with match = 3 and insertion, deletion, mismatch = -1 <0,0,0> <3,3,2>

  46. Example • We need a lower bound for the overall result.Let’s assume we have got already the following alignment • What is K, the sum of pairs for this alignment? DQ-LF DNVQ- -QGL-

  47. Example • We need a lower bound for the overall result.Let’s assume we have got already the following alignment • K = -1 -4 + 3 = -2 DQ-LF DNVQ- -QGL-

  48. Example • Upper bound for the score from <0,0,0> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1) • F( <0,0,0>, <3,3,2> ) = +2 +3 -2 = +3 D--QLF DQ-LF DNVQ-- DNVQ-- -QGL- ---QGL +2 +3 -2

  49. begin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end end end Q: <0,0,0> P( <0,0,0> ) = 0 S( <0,0,0> ) = 0 Example

  50. begin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) IfS(v) + F(v, hN) >= K then for all forward neighbours w of vdo ifw doesn’t belong to Q then push (w,Q); P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end end end S( <0,0,0> ) + F( <0,0,0>, <3,3,2>) = 0+3 >= -2 Q: <0,0,1>, <0,1,0>, <0,1,1>, … , <1,1,1> P( <0,0,1> ) = 0 + -2 --Q P( <0,1,0> ) = 0 + -2 -D- P( <0,1,1> ) = 0 + -3 -DQ P( <1,0,0> ) = 0 + -2 D-- P( <1,0,1> ) = 0 + -3 D-Q P( <1,1,0> ) = 0 + 1 DD- P( <1,1,1> ) = 0 + 1 DDQ Example

More Related