1 / 17

Genome Rearrangement

Computer Science Department Technion – Israel Institute of Technology Genomic Sorting with Length-Weighted Reversals Ron Y. Pinter Technion Steve Skiena SUNY Stony Brook. Genome Rearrangement. events duplication translocation reversal (inversion) occur primarily during reproduction

munin
Download Presentation

Genome Rearrangement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Science DepartmentTechnion – Israel Institute of TechnologyGenomic Sorting with Length-Weighted ReversalsRon Y. PinterTechnionSteve SkienaSUNY Stony Brook

  2. Genome Rearrangement • events • duplication • translocation • reversal (inversion) • occur primarily during reproduction • allow large-scale genomic comparisons

  3. Sorting by Reversals • genome represented as a permutation on 1, 2, …, n • n = # homologous genes among species • assumptions • can identify genes • genes are distinct • operation: reversalof a subsequence (of genes) • models inversion(occurs during crossover) • one of the permutations can be 1, 2, …, n • appropriately relabel others

  4. Example • 6 reversal • in our model (for f(l) = l): cost = 18

  5. Our Model • unsigned • cost of reversal of subsequence of length l is f(l) • total sorting cost (or distance) is f (length(sj)) S Sj are reversed subsequences

  6. f(l) f(l) Cost Functions • additive f(x+y) = f(x) + f(y) • subadditive f(x+y) < f(x) + f(y) • superadditive f(x+y) > f(x) + f(y) • other • e.g. bitonic

  7. Problems • algorithm to sort any permutation • worst-case min cost • approximate min cost for a given permutation

  8. Extremal Costs • highly subadditive: e.g. unit cost, f(l) = 1 • NP complete [Caprara, ’97] • series of approximation ratios: 2, 1.75, 1.375 • highly superadditive: f(l) > l2 • essentially bubblesort

  9. Our Results • additive cost function • specifically f(l) = l • QuickSort-like algorithm for worst-case • complexity: O(n lg2n) • min cost approximation ratio of O(lg2n)

  10. MedianEject(a,b) • find r maximal blocks of wrong-sided elements with respect to median • for lg r do: flip every other pair of blocks of wrong-sided and adjacent blocks • move wrong-sided blocks to median boundary • reverse left and right blocks

  11. Sample Run complexity: O((b-a) lg r)

  12. n 2 ReversalSort(a,b) MedianEject (a,b); ReversalSort (a, ); ReversalSort ( ,b); Complexity T(n) = 2  T ( ) + O(f(n) lg n) O(f(n)lg2n) = O(n lg2n) for f(n)~n

  13. p q p Algorithmic Improvements Isimplify “short” phases II merge 2 last steps of MedianEject when possible (2p+qvs. 3p+q) III apply II recursively

  14. Approximation Ratio • M(p) is the maximal total distance between pairs of out-of order elements Lemma 4: min cost is (M(p)) but Lemma 6: # of out-of order elts < 3  M(p) + Lemma 7: MedianEject touches only elements within linear range from out-of-order elements yields: • each round of MedianEject takes O(M(p)  lg2n) • ReversalSort costs O(M(p)  lg2n) • ReversalSort is at most O((lg2n) timesoptimal

  15. Bioinformatic “Validation” • use our cost (= distance) to build phylogenetic trees • 4 plants (chloroplastic genes) • consistent with [Martin et al., PNAS Sept ‘02] • work in progress [M. Shoham] Cyanophora Cyanidium Guilardia Porphyra

  16. Open Problems: Algorithmic • weighted genes • tighter approximation ratio • close to O(lg n) • can get to constant? • other cost functions (incl. bitonic) • the signed case

  17. Open Problems: Modeling • chromosomal ordering • what is the right cost function? • consider cost(l) = ld • combine with constant-based models • restricted regions • “undesired” reversal sequences • deal with duplication and translocation events

More Related