640 likes | 731 Views
Genomic Sorting with Length-Weighted Intervals. 236818 - Seminar in Bioinformatics Advanced Algorithms in Computational Biology Spring 2005, Technion Asaf Merschon. What we saw so far.
E N D
Genomic Sorting with Length-Weighted Intervals 236818 - Seminar in Bioinformatics Advanced Algorithms in Computational Biology Spring 2005, Technion Asaf Merschon
What we saw so far • Current algorithms of genome rearrangements ignore the length of reversals; rather they only count their number. • Traditionally, such analysis assumes that each reversal is of unit cost.
Motivation • The assumption of unit cost reversals is not completely defensible biologically: • A longer genomic reversal will cause more upheaval to the organism, resulting in a lower likelihood of the organism surviving to pass the mutation. • The mechanics of genome reversal may suggest that probabilities of reversals depends on their length (among other factors).
The topics covered • On top of the surface: • Introduction to Genomic Sorting with Length-Weighted Intervals. • Lower and upper bounds on complexity of solution. • Proofs (Partial). • Down under: • Improved bounds on Sorting with Length-Weighted Reversals (Extended Abstract). • Concept and examples. • Sorting by Length-weighted Reversals: Dealing with Signs and Circularity. • General approach to solutions.
Goal • Find an algorithm that efficiently sorts one sequence into another by reversals under length sensitive cost models. • Focus is on sorting unsigned permutations by reversals. • The problem remains NP-hard in our new model and hence we will try to reach approximation results.
Definitions (1) • Let the function denote the cost of a reversal of length . • Traditionally, . • We say a function is: • Additive if • Subadditive if • Superadditive if
Definitions (2) • A Reversal Graph of permutations of length n is a graph where: • The vertices are all the permutations of length n. • There is an edge (p1,p2) if of weight if there exists one -reversal that transforms the permutation p1into the permutation p2.
Wanted Results • Minimize the cost sufficient to sort any permutation of n elements (actually achieving an upper bound). Equivalent to computing the diameter of the reversal graph under the shortest-path metric. • Approximate the minimum-cost reversal sequence for a given permutation. We would like a heuristic that assures the resulting sequence costs no more than a slowly growing function of n times that of the optimal sequence.
Important notes • The relatively coarse bounds generated by the following techniques applying them to biological data. • The work presented leads to interesting algorithmic results and raises some interesting questions as a basis for further bioinformatics studies.
Previous Work • Unit cost, unsigned reversals was shown to be NP-hard by Caprara. Our problem inherits hardness under more general metrics from this result. • Kececloglu & Sankoff gave approximation algorithms on reversal distance that guarantee results at most 2 times optimal. • Bafna & Pevzner improved this to a factor of 7/4. • Berman et al improved this factor to 1.375. • Minimum-cost unsigned reversal sorting has been studied also under models where cost increases so dramatically that only length-2 reversals are afforded. • Experiments were done on both mitochondrial genomes of two fungi as well as on random samples. They suggest that length may play an important role in biasing certain rearrangement patterns.
Goal 1 – Bounding the diameter of the Reversal Graph • By bounding the diameter of the Reversal Graph, we establish an upper bound on the cost of sorting any n-element permutation. • Standard sorting algorithms exhibit interesting performance on highly subadditive and superadditive functions, but not additive measures. The primary result of this section is a new reversal-based sorting algorithm which performs well on additive cost functions. (Examples in next slides).
Examples on Highly Subadditive & Superadditive functions • Subadditive: A reversal-based version of selection sort performs at most n-1 reversals, a fraction of which are potentially in length. Thus selection sort gives an diameter algorithm. • Especially efficient for • Superadditive: Bubble sort and insertion sort perform transpositions of neighboring elements, one for each inversion in the input permutation. This gives an diameter algorithm. • Particularly efficient for
The interesting case • Additive functions, particularly • Presented is an algorithm for sorting any permutation of n elements incost using divide and conquer. • The key operation is MedianEject.
Definitions (3) Sorting a permutation involves putting element i in position i. • Let denote the element in the position in the permutation. • Let denote the position of the element in the permutation. • An element x is wrong-sided if x & are on different sides of the median . Meaning or vice versa.
MedianEject • We apply MedianEject to portions of the permutation from position a to b. One round of MedianEject moves all wrong-sided elements in the interval [a,b] to the correct side relative to its median in the following manner: • MedianEject(a,b)= Identify the maximal runs of wrong-sided elements r, the median (b-a)/2. for (i = 1 to log r) reduce the number of wrong-sided runs by half using non-overlapping reversals, none crossing the median. With two reversals, move remaining wrong-sided runs to median boundary. Reverse the left and right wrong-sized runs using a single reversal.
Lemmas (1) • Lemma 1:MedianEject costs O(f(b-a)logr) for any additive cost function f. • Proof (intuitively): There are O(logr) reversals since with each pass there are half as many maximal runs of wrong-sided elements on each side of the median. Each reversal reveres at most b-a elements and hence costs O(f(b-a)) resulting in a total of O(f(b-a)logr).
Reversal Sort • MedianEject is the partitioning operation of the following Quicksort-like algorithm:
Lemmas (2) • Lemma 2:ReversalSort runs intime for any additive cost function f(n). • Proof: By the master theorem, the recurrenceevaluates to .
Goal 2 – Approximating Distance • From a biological point of view, constructing the least expensive transformation from a given permutation A to another permutation B is more interesting than minimizing diameter. This is because we want to reconstruct the evolutionary history from A and B, a history which presumably took the most parsimonious possible path.
Definitions (4) • We now show that for all permutations, the reversal sorting algorithm yields a cost which is times optimal for any additive cost function. • Our analysis requires the definition of a weighted graph G(p) associated with a given permutation p. • The vertices of G(p) will be the n elements (positions) of p. There will be an edge (i,j) in G(p) where . The weight of this edge is .
Definitions (5) • G(p)may be used to provide lower bounds on the optimal cost of sorting. However, these bounds can be very coarse. • Instead, we bound the optimal cost in terms of the weight of the heaviest non-crossing matchingM(G(p)). • We say a matching M(G(p)) (namely a group of edges from G(p)) is non-crossing ifSuch maximal matching can be easily found using dynamic programming.
Lemmas (3) • Theorem 3: The greedy breakpoint-merging heuristic can yield a reversal sequence whose cost is optimal. • Proof: Won’t be provided in this presentation. • Lemma 4: The weight of M(G(p)) is a lower bound on the reversal-sorting cost for permutation p under additive weight functions. • Proof: Consider the simpler task of just placing the elements defining edges from M(G(p)) into their proper position. This task can be done in cost f(w), where w is the total weight of M(G(p)), by performing the reversals defined by the edges in the matches. Because none of the intervals overlap or nest, no longer reversal can be helpful to move multiple elements into the proper position; because the cost function is additive we cannot benefit by using shorter reversals.
Lemmas (4) • To argue that the weight of M(G(p)) is a good lower bound, we will bound certain properties of p & G(p) in the size of this matching. • Lemma 5:Let denote the kth edge of M(G(p)), where . Let be a function which equals 1 if intersects the interval [i,…, j] and is zero otherwise. Then edge if • Proof: By definition, M(G(p)) is the maximum cost non-crossing matching. Hence such an edge (i, j) cannot exist in G(p), for if so we could remove all intersected matching edges and insert (i, j) into M(G(p)) to yield a higher cost non-crossing matching.
Lemmas (5) • Lemma 6: The number of out-of-position elements in p is at most . • Proof: Won’t be provided in this presentation. • Lemma 7: No element outside of the penumbra moves during the execution of MedianEject. • Definition: The penumbra is the set of positions where out-of-position elements potentially lie unioned with all positions overlapped by edges of M(G(p)). • Proof: Won’t be provided in this presentation. • Implied (By Lemma 7): Every round of non-overlapping reversals costs at most throughout the execution of ReversalSort.
Lemmas (6) • Corollary 1: The cost of the each round of MedianEject is , and therefore ReversalSort costs . • Theorem 8: The ReversalSort heuristic solution is at most a factor of times the optimal solution. • Proof: Derived from the previous lemmas.
Coming Up Next • Improved bounds on Sorting with Length-Weighted Reversals (Extended Abstract). • Sorting by Length-weighted Reversals: Dealing with Signs and Circularity. • Conclusions, Suggestions & Questions raised. • Comments!?
Improved bounds on Sorting with Length-Weighted Reversals • We will now approach the problem of sorting integer sequences by length weighted reversals using a wider range of cost functions. • For the cost function we consider a wide class of functions, namelywhere l is the length of the reversal. • So far we have mainly dealt with the case where .
Sorting Sequences of 0’s and 1’s • To sort a sequence of 0’s and 1’s. • Recursively sort the left and right halves. • Perform one more reversal across the median for a sorting cost of: • Pinter and Skiena used this algorithm to obtain an upper bound of on diameter for linear cost reversals. • As was shown in first part of the presentation.
Bounds and Approximation Ratios for different values • The table summarizes the found bounds and approximations ratios for different values. • Proofs for some of the bounds and approximation ratios will be presented as proof of concept.
Upper Bounds on Diameter (1) • In the case of additive cost functions we saw that the upper bound on sorting any given permutation is . • Similarly, we would like to find such bounds for other functions in the class we are using (i.e. ). • To do this, we will use the concept of sorting sequences of 0’s and 1’s.
Upper Bounds on Diameter (2) • Case 1 – : • Consider the divide and conquer sorting algorithm described in the previous slide. The recursion relation for sorting the 0’s and 1’s becomes: • For permutations, the cost for the recursion sorting becomes: • Obviously, these results are upper bounds.
Upper Bounds on Diameter (3) • Case 2 – : • Consider the divide and conquer sorting algorithm described in the previous slide. The recursion relation for sorting the 0’s and 1’s becomes: • For permutations, the cost for the recursion sorting becomes: • Obviously, these results are upper bounds.
Upper Bounds on Diameter (4) • Case 3 – : • This case has no use for reversals of more than two elements. As such, bubble sort is an asymptotically optimal solution. • As a result of this, a tight bound (Upper and Lower) on the diameter is:
Lower Bounds on Diameter:Concept • Proving the lower bounds on the diameters for different values of is much more complex than proving the upper bounds. • We will see the proof of a lower bound for a linear cost function . • Tighter than what we have already seen.
Lemmas (7) • Theorem 2.3: The cost to sort n elements by reversals with a linear cost function is , even when all elements are 0’s and 1’s. • Thus, our bounds for sorting 0/1 sequences are tight (same Upper and Lower Bounds), but a multiplicative gap of exists for sorting permutations.
Proof of Lower Bound on Diameter for the Linear Cost Function (1) • We will approach the problem by exhibiting a difficult sorting instance. • Specifically, we will prove a lower bound of on the cost of sorting the length-n sequence 010101…01 by reversals. • The proof follows a potential function argument.
Definitions (6) • Before the sorting begins, we match the0 with the 1. Throughout the sorting algorithm we will keep this matching. • Let be the current distance between the 0 and the 1 after the reversal. • When there is no ambiguity, we abbreviate by . • The potential function is:
Lemmas (8) • Lemma 2.1: The initial value of the potential function is 0, and the final value is . • We will show how a reversal affects the value of in the potential function by considering the ith(0,1) pair. • Observation 2.1: The distance can only change when one element of the pair is inside the reversal and the other is outside. • Lemma 2.2: A reversal of length k increases the potential P(t) by at most 4k. • Proof of these two lemmas results in theorem 2.3.
Proof of Lower Bound on Diameter for the Linear Cost Function (2) • Proof: Suppose that for a reversal of length k, one the elements of a (0,1) pair is inside the reversal and another is outside so that is affected by the reversal. • At the most, the distance between the two elements of this pair can increase by k because each element is moved at most by a distance k.
Before reversal After reversal Proof of Lower Bound on Diameter for the Linear Cost Function (3)
Proof of Lower Bound on Diameter for the Linear Cost Function (4) • Let us assume by symmetry that 0 is outside the reversed sequence and the 1 is inside. Suppose that the distance from the 0 to the closest element in the reversal is l. • The increase of the potential caused by the change in for this pair is at most:
Proof of Lower Bound on Diameter for the Linear Cost Function (5) • The distance l must be a natural number and occurs at most twice in one reversal, once on the left side and once on the right side of the reversed sequence. • According to observation 2.1, there are at most k such pair whose distance changes the value of the potential function.
Proof of Lower Bound on Diameter for the Linear Cost Function (6) • As a result, the increase in the value of the potential function increases by at most: • Notice that grows as l gets smaller.
Proof of Lower Bound on Diameter for the Linear Cost Function (6) • By Sterling’s approximation,therefore and the potential thus increases by at most .
Sorting by Length-weighted Reversals: Dealing with Signs and Circularity. • Abstract: • Sorting linear and circular permutations and 0/1 sequences by reversals in a length sensitive cost model. • We consider both the signed and unsigned case.
What Lies Ahead • Lower and upper bounds on the various cases. • Mentions of some approximations that guarantee the bounds shown • Partial proofs some of the bounds and approximations. • Cost functions are still of the class .
A Word (or Two) on Circularity • Circularity generally offers more opportunities to reduce the optimal cost to sort a given permutation by reversals. • At the same time, it presents a greater challenge of finding a more efficient solution. • A non unit cost model exacerbates these problems even further. • Take as an example the permutation . • One can sort it by using two reversals. • In the circular case, where the two ends of the permutation meet, one can sort it by using one reversal. • In the case of a unit cost model, the ratio of the costs is 2. • However, in the case of a linear cost model, the ratio is .
Relationship of Costs for the Different Cases • The following relationships hold for the four different cases:
Lower and upper bounds for SBR of singed or unsigned and linear or circular 0/1 sequences and permutations. Approximation ratios for SBR of signed linear as well as signed and unsigned circular 0/1 sequences and permutations. Bounds and Approximation Ratios