630 likes | 752 Views
Outline. Today’s topic: greedy algorithms Genome rearrangements Sorting by reversals Breakpoints Shotgun sequencing Fragment assembly: the shortest superstring problem Finishing problem Next time: greedy algorithms for motif finding.
E N D
Outline • Today’s topic: greedy algorithms • Genome rearrangements • Sorting by reversals • Breakpoints • Shotgun sequencing • Fragment assembly: the shortest superstring problem • Finishing problem • Next time: greedy algorithms for motif finding
Turnip vs Cabbage: Almost Identical mtDNA gene sequences • In 1980s Jeffrey Palmer studied evolutionary change in plant organelles by comparing mitochondrial genomes of the cabbage and turnip • 99% similarity between genes • These surprisingly identical gene sequences differed in gene order • This helped pave the way to analyzing genome rearrangements in molecular evolution
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison:
Genome rearrangements Mouse (X chrom.) • What are the similarity blocks and how to find them? • What is the architecture of the ancestral genome? • What is the evolutionary scenario for transforming one genome into the other? Unknown ancestor~ 75 million years ago Human (X chrom.)
Mouse vs Human Genome • Humans and mice have similar genomes, but their genes are ordered differently • ~245 rearrangements • Reversal, fusion, fission, translocation • Reversal: flipping a block of genes within a genomic sequence
Reversals • Will assume that genes in genomic segmentp do not have direction p = p1------ pi-1 pi pi+1 ------pj-1 pjpj+1 -----pn p’ = p1------ pi-1pj pj-1 ------pi+1 pipj+1 -----pn • r( i, j ) is defined as the operation that reverses the gene sequence ofp(original genomic segment) between pi and pj r(i,j)
Reversals: Example • Example: p = 1 2 3 4 5 6 7 8 r(3,5) p’= 1 2 5 4 3 6 7 8
Reversals: Example 5’ ATGCCTGTACTA 3’ 3’ TACGGACATGAT 5’ Break and Invert 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’
Reversal Distance Problem • Goal: Given two permutations, find the shortest series of reversals that forms one from the other • Input: Permutations pand s • Output: A series of reversals r1,…rttransforming p into s, such that t is minimum • t = reversal distance between p and s • d(p, s) denotes smallest possible value of t, given p and s
Sorting By Reversals Problem • Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) • Input: permutation p • Output: A series of reversals r1,… rt transforming p into the identity permutation such that t is minimum
Sorting By Reversals: Example • Minimal value of t is denoted as d(p) and this value is the reversal distance of permutation p • Example : input: p = 3 4 2 1 5 6 7 10 9 8 output: 4 3 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 So d(p) <= 3 Is this the smallest #reversals?
Sorting By Reversals: A Greedy Algorithm • If sorting permutation p = 1 2 3 6 4 5, the first three numbers are already in order so it does not make any sense to break them. These already sorted numbers of p will be defined as prefix(p) • prefix(p) = 3 • This results in an idea for an greedy algorithm to sort by reversals; increase prefix(p) at every step
Doing so, p can be sorted 1 2 36 4 5 1 2 3 46 5 1 2 3 4 5 6 d(p) = 2 Number of steps to sort permutation of length nis at most (n – 1) Greedy Algorithm: An Example
Greedy Algorithm: Pseudocode SimpleReversalSort(p) 1 for i 1 to n – 1 2 j position of element i in p(i.e., pj = i) 3 if j != i 4 p p * r(i, j) 5 output p 6 if p is the identity permutation 7 return
Analyzing SimpleReversalSort • Greedy algorithm; does not guarantee the smallest number of reversals • For example, let p = 6 1 2 3 4 5… • SimpleReversalSort(p) takes five steps: • Step 1: 1 6 2 3 4 5 • Step 2: 1 2 6 3 4 5 • Step 3: 1 2 3 6 4 5 • Step 4: 1 2 3 4 6 5 • Step 5: 1 2 3 4 5 6
But it can be done in two steps: p = 6 1 2 3 4 5 Step 1: 5 4 3 2 1 6 Step 2: 1 2 3 4 5 6 So, SimpleReversalSort(p) is a terrible algorithm that is not optimal Analyzing SimpleReversalSort (cont’d)
Pancake Flipping Problem Problem Description • The chef is sloppy; he prepares an unordered stack of pancakes of different sizes • The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) • He does it by flipping over several from the top, repeating this as many times as necessary
Pancake Flipping Problem: Formulation • Goal: Given n pancakes, what is the minimum number of flips required to rearrange them? • Input: An ordered stack of n pancakes of distinct size 1 < 2 < 3 < ……< n • Output: An ordered stack of pancakes, smallest on top, largest on the bottom
Pancake Flipping Problem: Greedy Algorithm • Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2n – 2 steps total at most (link to online simulator) • William Gates and Christos Papadimitriou showed in the mid-1970s that this problem can be solved by at most 5/3 (n + 1) prefix reversals • The problem is still open…
Approximation Algorithms • Optimal algorithms are unknown for many problems • Approximation algorithms find approximate solutions rather than optimal solutions • The approximation ratio of an algorithm A on input p is A(p) / OPT(p) Where: • A(p) is the solution produced by algorithm A • OPT(p) is the optimal solution of the problem
Approximation Ratio/Performance Guarantee • Approximation ratio (performance guarantee) of algorithm A: worst approximation ratio over all inputs of size n • For a minimization problem Approx ratio = max|p| = n A(p) / OPT(p) • Key to proving approximation ratios: good lower-bound on OPT
Breakpoints p = p1p2p3…pn-1pn • A pair of elements pi and pi + 1are adjacent if pi+1 = pi + 1 • For example: p = 1 9 3 4 7 8 2 6 5 • (3, 4) or (7, 8) and (6,5) are adjacent pairs
Breakpoints: An Example There is a breakpoint between any pair of non-adjacent elements: p = 1 9 3 4 7 8 2 6 5 • Pairs (1,9), (9,3), (4,7), (8,2) and (2,5) form breakpoints of permutation p
Extending Permutations • We put two blocks p0 and pn + 1 at the ends of p p0 = 0 and pn + 1 = n + 1 • This gives us the goal to sort the elements between the end blocks to the identity permutation • Example: p = 1 9 3 4 7 8 2 6 5 Extending with 0 and 10 • p = 01 9 3 4 7 8 2 6 510 Note: A new breakpoint was created after extending
Reversal Distance and Breakpoints • b(p) = number of breakpoints • Each reversal eliminates at most 2 breakpoints d(p) >= b(p) / 2 (lower bound on OPT) p = 2 3 1 4 6 5 0 2 3 1 4 6 5 7 b(p) = 5 01 3 24 6 5 7 b(p) = 4 0 1 2 3 4 6 57 b(p) = 2 0 1 2 3 4 5 6 7b(p) = 0
Sorting By Reversals: A Better Greedy Algorithm BreakPointReversalSort(p) 1while b(p) > 0 2Among all possible reversals, chooserminimizing b(p * r) 3p p * r(i, j) 4output p 5return
Strip: an interval between two consecutive breakpoints in p Decreasing strips: strips that are in decreasing order (e.g. 6 5 and 3 2 ). Increasing strips: strips that are in increasing order (e.g. 7 8). A single-element strip is both increasing and decreasing Strips
Strips: An Example • For permutation 1 9 3 4 7 8 2 5 6: • There are 7 strips: 01 9 4 3 7 8 2 5 6 10
Things To Consider • Fact 1: • If permutation pcontains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p *r) < b(p) )
Things To Consider (cont’d) • Forp = 1 4 6 5 7 8 3 2 • 01 4 6 5 7 8 3 2 9b(p) = 5 • Choose decreasing strip with the smallest element k in p ( k = 2 in this case) • Find k – 1 in the permutation, reverse the segment between k and k-1: • 01 4 6 5 7 8 3 29b(p) = 5 • 012 3 8 7 5 6 4 9b(p) = 4
Things To Consider (cont’d) • Fact 2: • If there is no decreasing strip, no reversal r can reduce the number of breakpoints (i.e. b(p * r) = b(p) ). • By reversing an increasing strip ( # of breakpoints stay unchanged ), the number of strips can be reduced in the next step.
Things To Consider (cont’d) • There are nodecreasing strips in p, for: p = 01 2 5 6 7 3 4 8 b(p) = 3 p * r(3,4) = 01 2 5 6 7 4 3 8b(p) = 3 • r(3,4) does not change the # of breakpoints • r(3,4) creates a decreasing strip, guaranteeing that the next step will decrease the # of breakpoints.
ImprovedBreakpointReversalSorting ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 ifphas a decreasing strip 3 Among all possible reversals, chooserthat minimizes b(p*r) 4 else 5 Choose a reversalr that flips an increasing strip in p 6 p p * r 7 output p 8 return
ImprovedBreakpointReversalSorting: Performance Guarantee • ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 • It eliminates at least one breakpoint in every two steps outputs at most 2b(p) reversals • Optimal algorithm eliminates at most 2 breakpoints in every step: d(p) >= b(p) / 2 • Performance guarantee: [ 2b(p) / d(p) ] <= [ 2b(p) / (b(p) / 2) ] = 4
5’ 3’ Signed Permutations? • Up to this point, all permutations to sort were unsigned • But genes have directions… so we should consider signed permutations p = 1-23-4
Outline • Today’s topic: greedy algorithms • Genome rearrangements • Sorting by reversals • Breakpoints • Shotgun sequencing • Fragment assembly: the shortest superstring problem • Finishing problem • Next time: greedy algorithms for motif finding
DNA Sequencing • Chain termination can be used to sequence DNA strings a few hundred nucleotides long • Start with single strand template • Complementary strand synthesis terminated with small probability • Lengths of fragments read by gel electrophoresis ATACGGA ATACGG ATACG ATAC ATA AT A • How to sequence longer DNA strings?
Shortest Superstring Problem • Given: set of strings s1, s2, …, sn(we may assume that no siis substring of another sj) • Find: shortest string s containing each si as a substring • Simplified model for fragment assembly • Ignores experimental errors • May collapse repeat DNA sequences
Shortest Superstring Example • Given: • s1= ATAT • s2= TATT • s3= TTAT • s4= TATA • s5= TAAT • s6= AATA • Superstring of s1,…,s6: S = TTATTTAATATA s1 = ATAT s2 = TATT s3 = TTAT s4 = TATA s5 = TAAT s6 = AATA
Greedy Merging Algorithm • Approximation factor no better than 2: • s1 = abk, s2 =bkc, s3 = bk+1 • Greedy output: abkcbk+1 length = 2k+3 • Optimum: abk+1c length = k+3 • S = {s1,s2,…,sn} • While |S| > 1 do • Find s,t in S with longest overlap • S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} • Output final string
Overlap & Prefix of 2 strings • Overlap of s and t: longest suffix of s that is a prefix of t • Prefix of s and t: s after removing overlap(s,t) s = a1 a2 a3 … a|s|-k+1…a|s| t = b1 … bk … b|t| prefix(s,t) overlap(s,t)
Prefix Graph (not all arcs shown) 2 ATAT 4 2 3 1 TATT TTAT 3 3 2 1 3 3 3 TATA TAAT 2 2 2 2 3 AATA 3
Lower Bound on OPT OPT = prefix(s1,s2) … prefix(sn-1,sn) prefix(sn,s1) overlap(sn,s1) cost of tour 12…nin the prefix graph