180 likes | 383 Views
Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena SUNY Stony Brook. Genome Rearrangement. events duplication translocation reversal (inversion) occur primarily during reproduction
E N D
Computer Science DepartmentTechnion – Israel Institute of TechnologyGenomic Sorting with Length-Weighted ReversalsRon Y. PinterTechnionSteve SkienaSUNY Stony Brook
Genome Rearrangement • events • duplication • translocation • reversal (inversion) • occur primarily during reproduction • allow large-scale genomic comparisons
Sorting by Reversals • genome represented as a permutation on 1, 2, …, n • n = # homologous genes among species • assumptions • can identify genes • genes are distinct • operation: reversalof a subsequence (of genes) • models inversion(occurs during crossover) • one of the permutations can be 1, 2, …, n • appropriately relabel others
Example • 6 reversal • in our model (for f(l) = l): cost = 18
Our Model • unsigned • cost of reversal of subsequence of length l is f(l) • total sorting cost (or distance) is f (length(sj)) S Sj are reversed subsequences
f(l) f(l) Cost Functions • additive f(x+y) = f(x) + f(y) • subadditive f(x+y) < f(x) + f(y) • superadditive f(x+y) > f(x) + f(y) • other • e.g. bitonic
Problems • algorithm to sort any permutation • worst-case min cost • approximate min cost for a given permutation
Extremal Costs • highly subadditive: e.g. unit cost, f(l) = 1 • NP complete [Caprara, ’97] • series of approximation ratios: 2, 1.75, 1.375 • highly superadditive: f(l) > l2 • essentially bubblesort
Our Results • additive cost function • specifically f(l) = l • QuickSort-like algorithm for worst-case • complexity: O(n lg2n) • min cost approximation ratio of O(lg2n)
MedianEject(a,b) • find r maximal blocks of wrong-sided elements with respect to median • for lg r do: flip every other pair of blocks of wrong-sided and adjacent blocks • move wrong-sided blocks to median boundary • reverse left and right blocks
Sample Run complexity: O((b-a) lg r)
n 2 ReversalSort(a,b) MedianEject (a,b); ReversalSort (a, ); ReversalSort ( ,b); Complexity T(n) = 2 T ( ) + O(f(n) lg n) O(f(n)lg2n) = O(n lg2n) for f(n)~n
p q p Algorithmic Improvements Isimplify “short” phases II merge 2 last steps of MedianEject when possible (2p+qvs. 3p+q) III apply II recursively
Approximation Ratio • M(p) is the maximal total distance between pairs of out-of order elements Lemma 4: min cost is (M(p)) but Lemma 6: # of out-of order elts < 3 M(p) + Lemma 7: MedianEject touches only elements within linear range from out-of-order elements yields: • each round of MedianEject takes O(M(p) lg2n) • ReversalSort costs O(M(p) lg2n) • ReversalSort is at most O((lg2n) timesoptimal
Bioinformatic “Validation” • use our cost (= distance) to build phylogenetic trees • 4 plants (chloroplastic genes) • consistent with [Martin et al., PNAS Sept ‘02] • work in progress [M. Shoham] Cyanophora Cyanidium Guilardia Porphyra
Open Problems: Algorithmic • weighted genes • tighter approximation ratio • close to O(lg n) • can get to constant? • other cost functions (incl. bitonic) • the signed case
Open Problems: Modeling • chromosomal ordering • what is the right cost function? • consider cost(l) = ld • combine with constant-based models • restricted regions • “undesired” reversal sequences • deal with duplication and translocation events