1 / 39

Large-Scale Global Alignments Multiple Alignments

Large-Scale Global Alignments Multiple Alignments. Lecture 10, Thursday May 1, 2003. Rapid global alignment. Genomic regions of interest contain ordered islands of similarity E.g. genes Find local alignments Chain an optimal subset of them. Suffix Trees.

Download Presentation

Large-Scale Global Alignments Multiple Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003

  2. Rapid global alignment Genomic regions of interest contain ordered islands of similarity • E.g. genes • Find local alignments • Chain an optimal subset of them Lecture 10, Thursday May 1, 2003

  3. Suffix Trees • Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d a b d a c 1 a c b d a b 4 c d c a c c 3 2 6 5 Lecture 10, Thursday May 1, 2003

  4. Application: Find all Matches Between x and y • Build suffix tree for x, mark nodes with x • Insert y in suffix tree, mark all nodes y “passes from” with y • The path label of every node marked both 0 and 1, is a common substring Lecture 10, Thursday May 1, 2003

  5. Example of Suffix Tree Construction for x, y y y 2. Insert a b a d a $ 3. Insert b a d a $ a x y 4. Insert a d a $ y d 4 x y a 5. Insert d a $ 6 a $ 6. Insert a $ d d a 6. Insert $ 2 $ a 5 3 $ 1 x = d a b d a $ y = a b a d a $ d a b d a $ 1 1. Construct tree for x x x a $ b d a b 4 $ x d $ a 6 $ $ 3 2 5 Lecture 10, Thursday May 1, 2003

  6. Application: Online Search of Strings on a Database Say a database D = { s1, s2, …sn } (eg. proteins) Question: given new string x, find all matches of x to database • Build suffix tree for {s1,…, sn} • All new queries x take O( |x| ) time (somewhat like BLAST) Lecture 10, Thursday May 1, 2003

  7. Application: Common Substrings of k Strings • Say we want to find the longest common substring of s1, s2, …sn • Build suffix tree for s1,…, sn • All nodes labeled {si1, …, sik} represent a match between si1, …, sik Lecture 10, Thursday May 1, 2003

  8. Methods toCHAINLocal Alignments Sparse Dynamic Programming O(N log N)

  9. The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight Lecture 10, Thursday May 1, 2003

  10. Quadratic Time Solution • Build Directed Acyclic Graph (DAG): • Nodes: local alignments (xa,xb) (ya,yb) Directed Edges: any two local alignments that can be chained • Each local alignment is a node vi with alignment score si Lecture 10, Thursday May 1, 2003

  11. Quadratic Time Solution (cont’d) Dynamic programming: Initialization: Find each node va s.t. there is no edge (u,v0) Set score of V(a) to be sa Iteration: For each vi, optimal path ending in vi has total score: V(i) = max ( weight(vj, vi) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from vj Worst case time: quadratic Lecture 10, Thursday May 1, 2003

  12. Faster Solution: Sparse DP • 1,…, N: rectangles • (hj, lj): y-coordinates of rectangle j • w(j): weight of rectangle j • V(j): optimal score of chain ending in j • L: list of triplets (lj, V(j), j) L is sorted by lj h l y Lecture 10, Thursday May 1, 2003

  13. Sparse DP x y Go through rectangle x-coordinates, from lowest to highest: • When on the leftmost end of i: • j: rectangle in L, with largest lj < hi • V(i) = w(i) + V(j) • When on the rightmost end of i: • j: rectangle in L, with largest lj li • If V(i) > V(j): • INSERT (li, V(i), i) in L • REMOVE all (lk, V(k), k) with V(k)  V(i) & lk li Lecture 10, Thursday May 1, 2003

  14. Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y Lecture 10, Thursday May 1, 2003

  15. Time Analysis • Sorting the x-coords takes O(N log N) • Going through x-coords: N steps • Each of N steps requires O(log N) time: • Searching L takes log N • Inserting to L takes log N • All deletions are consecutive, so log N Lecture 10, Thursday May 1, 2003

  16. Putting it All Together:Fast Global Alignment Algorithms • FIND local alignments • CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer: Suffix Tree Sparse DP Avid: Suffix Tree hierarchical DP LAGAN CHAOS Sparse DP Lecture 10, Thursday May 1, 2003

  17. LAGAN: Pairwise Alignment FIND local alignments CHAIN local alignments DP restricted around chain

  18. LAGAN • Find local alignments • Chain -O(NlogN) L.I.S. • Restricted DP Lecture 10, Thursday May 1, 2003

  19. LAGAN: recursive call • What if a box is too large? • Recursive application of LAGAN, more sensitive word search Lecture 10, Thursday May 1, 2003

  20. 3. LAGAN: memory “necks” have tiny tracebacks …only store tracebacks Lecture 10, Thursday May 1, 2003

  21. Multiple Sequence Alignments

  22. Lecture 10, Thursday May 1, 2003

  23. Lecture 10, Thursday May 1, 2003

  24. Overview • Definition • Scoring Schemes • Algorithms Lecture 10, Thursday May 1, 2003

  25. Multiple Sequence Alignments Definition and Motivation

  26. Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences • Same for a multiple alignment! Lecture 10, Thursday May 1, 2003

  27. Motivation • A faint similarity between two sequences becomes very significant if present in many • Protein domains • Motifs responsible for gene regulation Lecture 10, Thursday May 1, 2003

  28. Another way to view multiple alignments • Pairwise alignment: Good alignment  Evolutionary relationship • Multiple alignment: Something in common  Sequence in common “inverse” to pairwise alignment Lecture 10, Thursday May 1, 2003

  29. Multiple Sequence Alignments Scoring Function

  30. Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor x y z ? w v Lecture 10, Thursday May 1, 2003

  31. Scoring Function (cont’d) • Unfortunately: too many parameters • Compromises: • Ignore phylogenetic tree • Statistically independent columns: S(m) = G(m) + i S(mi) m: alignment matrix G: function penalizing gaps Lecture 10, Thursday May 1, 2003

  32. Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG Lecture 10, Thursday May 1, 2003

  33. Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l) Lecture 10, Thursday May 1, 2003

  34. Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance Lecture 10, Thursday May 1, 2003

  35. Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l) Lecture 10, Thursday May 1, 2003

  36. Multiple Sequence Alignments Algorithms

  37. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr)) Lecture 10, Thursday May 1, 2003

  38. Multidimensional Dynamic Programming • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, xk), F(i ,j ,k-1)+S( -, -, xk) } Lecture 10, Thursday May 1, 2003

  39. Multidimensional Dynamic Programming Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN) Lecture 10, Thursday May 1, 2003

More Related