1 / 61

Outline

Outline. Today’s topic: greedy algorithms Genome rearrangements Sorting by reversals Breakpoints Shotgun sequencing Fragment assembly: the shortest superstring problem Finishing problem Next time: greedy algorithms for motif finding.

Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Today’s topic: greedy algorithms • Genome rearrangements • Sorting by reversals • Breakpoints • Shotgun sequencing • Fragment assembly: the shortest superstring problem • Finishing problem • Next time: greedy algorithms for motif finding

  2. Turnip vs Cabbage: Almost Identical mtDNA gene sequences • In 1980s Jeffrey Palmer studied evolutionary change in plant organelles by comparing mitochondrial genomes of the cabbage and turnip • 99% similarity between genes • These surprisingly identical gene sequences differed in gene order • This helped pave the way to analyzing genome rearrangements in molecular evolution

  3. Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:

  4. Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:

  5. Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:

  6. Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:

  7. Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:

  8. Transforming Cabbage into Turnip

  9. Genome rearrangements Mouse (X chrom.) • What are the similarity blocks and how to find them? • What is the architecture of the ancestral genome? • What is the evolutionary scenario for transforming one genome into the other? Unknown ancestor~ 75 million years ago Human (X chrom.)

  10. Mouse vs Human Genome • Humans and mice have similar genomes, but their genes are ordered differently • ~245 rearrangements • Reversal, fusion, fission, translocation • Reversal: flipping a block of genes within a genomic sequence

  11. Reversals • Will assume that genes in genomic segmentp do not have direction p = p1------ pi-1 pi pi+1 ------pj-1 pjpj+1 -----pn p’ = p1------ pi-1pj pj-1 ------pi+1 pipj+1 -----pn • r( i, j ) is defined as the operation that reverses the gene sequence ofp(original genomic segment) between pi and pj r(i,j)

  12. Reversals: Example • Example: p = 1 2 3 4 5 6 7 8 r(3,5) p’= 1 2 5 4 3 6 7 8

  13. Reversals: Example 5’ ATGCCTGTACTA 3’ 3’ TACGGACATGAT 5’ Break and Invert 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’

  14. Reversal Distance Problem • Goal: Given two permutations, find the shortest series of reversals that forms one from the other • Input: Permutations pand s • Output: A series of reversals r1,…rttransforming p into s, such that t is minimum • t = reversal distance between p and s • d(p, s) denotes smallest possible value of t, given p and s

  15. Sorting By Reversals Problem • Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) • Input: permutation p • Output: A series of reversals r1,… rt transforming p into the identity permutation such that t is minimum

  16. Sorting By Reversals: Example • Minimal value of t is denoted as d(p) and this value is the reversal distance of permutation p • Example : input: p = 3 4 2 1 5 6 7 10 9 8 output: 4 3 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 So d(p) <= 3 Is this the smallest #reversals?

  17. Sorting By Reversals: A Greedy Algorithm • If sorting permutation p = 1 2 3 6 4 5, the first three numbers are already in order so it does not make any sense to break them. These already sorted numbers of p will be defined as prefix(p) • prefix(p) = 3 • This results in an idea for an greedy algorithm to sort by reversals; increase prefix(p) at every step

  18. Doing so, p can be sorted 1 2 36 4 5 1 2 3 46 5 1 2 3 4 5 6 d(p) = 2 Number of steps to sort permutation of length nis at most (n – 1) Greedy Algorithm: An Example

  19. Greedy Algorithm: Pseudocode SimpleReversalSort(p) 1 for i 1 to n – 1 2 j position of element i in p(i.e., pj = i) 3 if j != i 4 p p * r(i, j) 5 output p 6 if p is the identity permutation 7 return

  20. Analyzing SimpleReversalSort • Greedy algorithm; does not guarantee the smallest number of reversals • For example, let p = 6 1 2 3 4 5… • SimpleReversalSort(p) takes five steps: • Step 1: 1 6 2 3 4 5 • Step 2: 1 2 6 3 4 5 • Step 3: 1 2 3 6 4 5 • Step 4: 1 2 3 4 6 5 • Step 5: 1 2 3 4 5 6

  21. But it can be done in two steps: p = 6 1 2 3 4 5 Step 1: 5 4 3 2 1 6 Step 2: 1 2 3 4 5 6 So, SimpleReversalSort(p) is a terrible algorithm that is not optimal Analyzing SimpleReversalSort (cont’d)

  22. Pancake Flipping Problem Problem Description • The chef is sloppy; he prepares an unordered stack of pancakes of different sizes • The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) • He does it by flipping over several from the top, repeating this as many times as necessary

  23. Christos Papadimitrou and Bill Gates Flip Pancakes

  24. Pancake Flipping Problem: Formulation • Goal: Given n pancakes, what is the minimum number of flips required to rearrange them? • Input: An ordered stack of n pancakes of distinct size 1 < 2 < 3 < ……< n • Output: An ordered stack of pancakes, smallest on top, largest on the bottom

  25. Pancake Flipping Problem: Greedy Algorithm • Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2n – 2 steps total at most (link to online simulator) • William Gates and Christos Papadimitriou showed in the mid-1970s that this problem can be solved by at most 5/3 (n + 1) prefix reversals • The problem is still open…

  26. Approximation Algorithms • Optimal algorithms are unknown for many problems • Approximation algorithms find approximate solutions rather than optimal solutions • The approximation ratio of an algorithm A on input p is A(p) / OPT(p) Where: • A(p) is the solution produced by algorithm A • OPT(p) is the optimal solution of the problem

  27. Approximation Ratio/Performance Guarantee • Approximation ratio (performance guarantee) of algorithm A: worst approximation ratio over all inputs of size n • For a minimization problem Approx ratio = max|p| = n A(p) / OPT(p) • Key to proving approximation ratios: good lower-bound on OPT

  28. Breakpoints p = p1p2p3…pn-1pn • A pair of elements pi and pi + 1are adjacent if pi+1 = pi + 1 • For example: p = 1 9 3 4 7 8 2 6 5 • (3, 4) or (7, 8) and (6,5) are adjacent pairs

  29. Breakpoints: An Example There is a breakpoint between any pair of non-adjacent elements: p = 1 9 3 4 7 8 2 6 5 • Pairs (1,9), (9,3), (4,7), (8,2) and (2,5) form breakpoints of permutation p

  30. Extending Permutations • We put two blocks p0 and pn + 1 at the ends of p p0 = 0 and pn + 1 = n + 1 • This gives us the goal to sort the elements between the end blocks to the identity permutation • Example: p = 1 9 3 4 7 8 2 6 5 Extending with 0 and 10 • p = 01 9 3 4 7 8 2 6 510 Note: A new breakpoint was created after extending

  31. Reversal Distance and Breakpoints • b(p) = number of breakpoints • Each reversal eliminates at most 2 breakpoints  d(p) >= b(p) / 2 (lower bound on OPT) p = 2 3 1 4 6 5 0 2 3 1 4 6 5 7 b(p) = 5 01 3 24 6 5 7 b(p) = 4 0 1 2 3 4 6 57 b(p) = 2 0 1 2 3 4 5 6 7b(p) = 0

  32. Sorting By Reversals: A Better Greedy Algorithm BreakPointReversalSort(p) 1while b(p) > 0 2Among all possible reversals, chooserminimizing b(p * r) 3p p * r(i, j) 4output p 5return

  33. Strip: an interval between two consecutive breakpoints in p Decreasing strips: strips that are in decreasing order (e.g. 6 5 and 3 2 ). Increasing strips: strips that are in increasing order (e.g. 7 8). A single-element strip is both increasing and decreasing Strips

  34. Strips: An Example • For permutation 1 9 3 4 7 8 2 5 6: • There are 7 strips: 01 9 4 3 7 8 2 5 6 10

  35. Things To Consider • Fact 1: • If permutation pcontains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p *r) < b(p) )

  36. Things To Consider (cont’d) • Forp = 1 4 6 5 7 8 3 2 • 01 4 6 5 7 8 3 2 9b(p) = 5 • Choose decreasing strip with the smallest element k in p ( k = 2 in this case) • Find k – 1 in the permutation, reverse the segment between k and k-1: • 01 4 6 5 7 8 3 29b(p) = 5 • 012 3 8 7 5 6 4 9b(p) = 4

  37. Things To Consider (cont’d) • Fact 2: • If there is no decreasing strip, no reversal r can reduce the number of breakpoints (i.e. b(p * r) = b(p) ). • By reversing an increasing strip ( # of breakpoints stay unchanged ), the number of strips can be reduced in the next step.

  38. Things To Consider (cont’d) • There are nodecreasing strips in p, for: p = 01 2 5 6 7 3 4 8 b(p) = 3 p * r(3,4) = 01 2 5 6 7 4 3 8b(p) = 3 • r(3,4) does not change the # of breakpoints • r(3,4) creates a decreasing strip, guaranteeing that the next step will decrease the # of breakpoints.

  39. ImprovedBreakpointReversalSorting ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 ifphas a decreasing strip 3 Among all possible reversals, chooserthat minimizes b(p*r) 4 else 5 Choose a reversalr that flips an increasing strip in p 6 p p * r 7 output p 8 return

  40. ImprovedBreakpointReversalSorting: Performance Guarantee • ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 • It eliminates at least one breakpoint in every two steps  outputs at most 2b(p) reversals • Optimal algorithm eliminates at most 2 breakpoints in every step: d(p) >= b(p) / 2 • Performance guarantee: [ 2b(p) / d(p) ] <= [ 2b(p) / (b(p) / 2) ] = 4

  41. 5’ 3’ Signed Permutations? • Up to this point, all permutations to sort were unsigned • But genes have directions… so we should consider signed permutations p = 1-23-4

  42. Outline • Today’s topic: greedy algorithms • Genome rearrangements • Sorting by reversals • Breakpoints • Shotgun sequencing • Fragment assembly: the shortest superstring problem • Finishing problem • Next time: greedy algorithms for motif finding

  43. DNA Sequencing • Chain termination can be used to sequence DNA strings a few hundred nucleotides long • Start with single strand template • Complementary strand synthesis terminated with small probability • Lengths of fragments read by gel electrophoresis ATACGGA ATACGG ATACG ATAC ATA AT A • How to sequence longer DNA strings?

  44. Shotgun Method

  45. Shortest Superstring Problem • Given: set of strings s1, s2, …, sn(we may assume that no siis substring of another sj) • Find: shortest string s containing each si as a substring • Simplified model for fragment assembly • Ignores experimental errors • May collapse repeat DNA sequences

  46. Shortest Superstring Example • Given: • s1= ATAT • s2= TATT • s3= TTAT • s4= TATA • s5= TAAT • s6= AATA • Superstring of s1,…,s6: S = TTATTTAATATA s1 = ATAT s2 = TATT s3 = TTAT s4 = TATA s5 = TAAT s6 = AATA

  47. Greedy Merging Algorithm • Approximation factor no better than 2: • s1 = abk, s2 =bkc, s3 = bk+1 • Greedy output: abkcbk+1 length = 2k+3 • Optimum: abk+1c length = k+3 • S = {s1,s2,…,sn} • While |S| > 1 do • Find s,t in S with longest overlap • S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} • Output final string

  48. Overlap & Prefix of 2 strings • Overlap of s and t: longest suffix of s that is a prefix of t • Prefix of s and t: s after removing overlap(s,t) s = a1 a2 a3 … a|s|-k+1…a|s| t = b1 … bk … b|t| prefix(s,t) overlap(s,t)

  49. Prefix Graph (not all arcs shown) 2 ATAT 4 2 3 1 TATT TTAT 3 3 2 1 3 3 3 TATA TAAT 2 2 2 2 3 AATA 3

  50. Lower Bound on OPT OPT = prefix(s1,s2) … prefix(sn-1,sn) prefix(sn,s1) overlap(sn,s1) cost of tour 12…nin the prefix graph

More Related