610 likes | 1.12k Views
Genome Rearrangements. Compare to other areas in bioinformatics we still know very little about the rearrangement events that produced the existing varieties of genomic architectures. Some material of this lecture borrowed from: Nipun Mehra, www.stanford.edu/class/cs374/Notes/lec17.ppt
E N D
Genome Rearrangements Compare to other areas in bioinformatics we still know very little about the rearrangement events that produced the existing varieties of genomic architectures ... Some material of this lecture borrowed from: Nipun Mehra, www.stanford.edu/class/cs374/Notes/lec17.ppt www.sna.csie.ndhu.edu.tw/~lung/seminar/20020502.ppt Bafna V., and P.A. Pevzner. "Sorting by reversals: genome rearrangements in plant organelles and evolutionary history of X Chromosome." Hannenhalli S., and P.A. Pevzner. "Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals.“ “Computational Molecular Biology” book by P.A. Pevzner, MIT press, chapter 10 Bioinformatics III
Processes of Evolution - Substitution - Insertion - Deletion - Translocation - Inversion/ Reversal - Duplication Bioinformatics III
What is a reversal = inversion ? A T G C C T G T A C T A T A C G G A C A T G A T Break and Invert A T G T A C A G G C T A T A C A T G T C C G A T Purines (A, G) and Pyrimidines (C, T) switch strands Many organisms have highly similar genes but very different gene orders. Very prominent in Prokaryotes, Mitochondrial DNA and Mamallian X-chromosome. Bioinformatics III
Types of Genome Rearrangements Two genomes may have many genes in common, but the genes may be arranged in a different sequence or be moved between chromosomes. Such differences in gene orders are the results of rearrangement events that are common in molecular evolution. For example, in unichromosomal genomes, the most common rearrangement events are reversals, in which a contiguous interval of genes is put into the reverse order. For multichromosomal genomes, the most common rearrangement events are reversals, translocations, fissions, and fusions. The pairwise genome rearrangement problem is to find an optimal scenario transforming one genome to another via these rearrangement events. Bioinformatics III
Representation of a genome We consider a unichromosomal genome to bef a sequence of n genes. The genes are represented by numbers 1, 2, ..., n. The two orientations of gene i are represented by i and -i. A genome is represented as a signed permutation of the numbers 1, 2, ..., n. For example, a unichromosomal genome with n = 5 genes is 5 -3 4 2 -1 Bioinformatics III
Multichromosal Genome A multichromosomal genome consists of n genes spread over m chromosomes. We represent it as a signed permutation of 1, 2, ..., n, with delimiters "$" or ";" inserted between the chromosomes. For example, a genome with 12 genes spread over 3 chromosomes is 7 -2 8 3 $ 5 9 -6 -1 12 $11 4 10 $ The order of the chromosomes and the direction of the chromosomes do not matter in the multichromosomal algorithms. Thus, we could represent this same genome by flipping the first chromosome (reverse the order of its entries and negate them) and then moving the last chromosome to the beginning: 11 4 10 $ -3 -8 2 -7 $ 5 9 -6 -1 12 $ Bioinformatics III
Unichromosomal genomes: sorting by reversal A reversal in a signed permutation is an operation that takes an interval in a permutation, reverses the order of the numbers, and changes all their signs. For example, 5 1 3 2 -9 7 -4 6 8 5 1 -7 9 -2 -3 -4 6 8 The reversal distance between two genomes is the minimum number of reversals it takes to get from one genome to the other. For a given pair of genomes, the reversal distance is unique, but there are usually many possible reversal scenarios with this distance. However, it is possible that this mathematical notion of reversal distance can underestimate the actual number of steps that occurred biologically. Bioinformatics III
Multichromosomal genomes: rearrangement operations We treat four elementary rearrangement events in multichromosomal genomes: reversals, translocations, fusions, and fissions. Reversal: An interval within a single chromosome may be reversed in the same fashion as a reversal acts in the unichromosomal case: 7 -2 8 3 $ 7 -2 8 3 $ 5 9 -6 -1 12 $ 5 9 -12 1 6 $ 11 4 10 $ 11 4 10 $ Note: When the programs are run in unichromosomal mode, the genomes 3 1 2 and -2 -1 -3are considered different (one reversal apart, distance=1), while in multichromosomal mode, those same genomes are considered equivalent (distance=0) because we have simply flipped an entire chromosome, which gives an equivalent genome in the multichromosomal mode. Bioinformatics III
Translocation Two chromosomes "A B" and "C D" may be rearranged into "A D" and "C B". (The letters A, B, C, D stand for sequences of genes.) Because flipping chromosomes does not alter a genome (only its representation is altered), "A -C" and "-B D" is another possible translocation. (-B means to reverse the order of the genes in sequence B and negate each one.) For example, a translocation on chromosomes 1 and 3 is 7 -2 8 3 $ 7 -2 8 -4 -11 $ 5 9 -6 -1 12 $ 5 9 -6 -1 12 $ 11 4 10 $ -3 10 $ Bioinformatics III
Fussion & Fission Fusion: Two chromosomes may be fused together into a single chromosome. Due to chromosome flippings, there are four distinct fusions between each pair of chromosomes. Here is one of the fusions between chromosomes 1 and 3: 7 -2 8 3 $7 -2 8 3 -10 -4 -11 $ 5 9 -6 -1 12 $ 5 9 -6 -1 12 $ • 4 10 $ Fission: A chromosome may be broken into two chromosomes between any pair of genes: 7 -2 8 3 $ 7 -2 8 3 $ 5 9 -6 -1 12 $5 9 $ 11 4 10 $ -6 -1 12 $11 4 10 $ Bioinformatics III
Signed and unsigned genomes Most comparative mapping techniques determine the physical locations and relative order of genes in each chromosome, but do not determine which of two orientations each gene has. Current sequencing methods do provide the orientations. It turns out that the genome rearrangement problem (uni- and multichromosomal) for unsigned permutations is NP-hard, but the same problems for signed data can be done in polynomial time. Fortunately, with many genomes currently being sequenced, it is likely that many comparative maps (corresponding to unsigned permutations) will soon be replaced by sequencing data (corresponding to signed permutations). Bioinformatics III
Multichromosomal genomes: rearrangement operations For example, to turn the unsigned genome 1 2 3 4 5 into the unsigned genome 1 4 3 2 5 requires one unsigned reversal. An assignment of signs may be designed in the source and destination genomes that give a signed reversal scenario requiring this same number of steps. Here, we get 1 2 3 4 5 1 -4 -3 -2 5 which also takes one step. Note that there may be other sign assignments taking this minimum number of steps. Bioinformatics III
Multichromosomal genomes: rearrangement operations It is possible that correctly signed data would have increased the number of steps: 1 2 3 4 5 1 -4 -3 -2 5 1 -4 3 -2 5 If the data collection method did not determine signs, it is impossible to know mathematically whether the one step or two step scenario is more biologically accurate; the mathematical problem the genome rearrangement programs solve is to find the signs giving the minimum possible distance. Bioinformatics III
X-Alignments The “X” Factor discovered by Eisen et al Alignment of whole genomes of prokaryotes like bacteria revealed X-like patterns in dot plots – called X-alignments. Implication: The reversals took place equidistant from the center of chromosome. Those along the diagonal are orthologs between species. Those along anti-diagonal are duplicates separated by inversion, within species. Bioinformatics III
10 11 6 5 1 7 8 2 3 9 4 8 7 6 5 4 3 2 1 11 10 9 A biological model case Palmer and Herbon found that the mitochondrial genomes in cabbage and turnip had very similar gene sequences, but with fairly different gene orders. How to design a „transformation“ of cabbage into turnip? Mitochondrial DNA of cabbage and turnip are composed of five conserved blocks of genes that are shuffled in cabbage as compared to turnip. Every conserved block has a direction that is shown by a + or – sign. cabbage turnip Bioinformatics III
Inversion, Transposition and inverted Transposition inversion transposition inverted transposition Bioinformatics III
10 11 6 1 2 5 4 3 9 7 8 8 7 6 5 4 3 2 1 11 10 9 2 1 3 7 5 4 8 6 1 2 3 4 5 6 7 8 Oriented/Unoriented Blocks ORIENTED BLOCKS Polynomial Time UNORIENTED BLOCKS NP-Hard Remember that the unoriented case results in an NP-Hard problem, whereas the oriented case can be solved in polynomial time. Bioinformatics III
10 11 4 3 2 8 9 1 5 6 7 8 8 8 4 4 4 8 7 3 2 2 3 7 3 2 6 3 3 2 2 6 8 8 8 4 5 5 4 4 7 5 4 7 5 5 3 1 6 1 1 3 1 7 5 2 7 2 7 5 1 6 6 1 6 6 1 11 11 11 11 11 11 11 10 10 10 10 10 10 10 9 9 9 9 9 9 9 Sorting by Reversals Cabbage Turnip Bioinformatics III
Permutation () : an ordered arrangement of the set { 1,2,…,n} Reversal () :a rearrangement that inverts a block in {3 4 7 6 1 5 2 } (3,6) ={3 4 5 1 6 7 2} Signed Permutation (): a permutation where the elements are oriented a reversal switches element orientation {+3 -4 +7 -6 +1 -5 +2 } (3,6) ={+3 -4 +5 -1 +6 -7 +2} Bioinformatics III
11 10 5 9 1 7 6 2 3 4 8 8 8 4 4 4 8 8 2 7 2 7 3 3 3 2 6 2 6 2 3 3 5 8 5 8 8 4 4 7 5 5 7 4 5 4 1 3 3 1 6 1 1 2 7 2 5 7 5 7 1 1 6 6 6 1 6 11 11 11 11 11 11 11 10 10 10 10 10 10 10 9 9 9 9 9 9 9 easy to do by eye ... 1 12 123 12….t= = t …. 21 Bioinformatics III
Formal Approach: Sorting by Reversals The order of genes in 2 organisms is represented by permutations • = 12 ... n and = 12 ... n. A reversal of an interval [i,j] is the permutation 1 2 ... i-1 i i+1 ... j-1 j j+1 ... n 1 2 ... i-1 j j-1 ... i+1 i j+1 ... n (i,j) has the effect of reversing the order of ii+1 ... j and transforming 1 ... i-1i ... j j+1 ... n into •(i,j) = 1 ... i-1j ... ij+1 ... n . Given permutations and , the reversal distance problem is to find a series of reversals 12 ... t such that •1•2 ... t = and t is minimal. t is called the reversal distance between and . Bioinformatics III
Breakpoint Graph Sort a permutation is a hard problem. Breakpoints were introduced by Watterson et al. (1982) and by Nadeau and Taylor (1984) and correlations were noticed between the reversal distance and the number of breakpoints. Let i j if |i – j| = 1. Extend a permutation = 12 ... n by adding 0 = 0 and n+1 = n + 1. We call a pair of elements (i,i+1), 0 i n, of an adjacency if i i+1, and a breakpoint if i i+1. 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 8 As the identity permutation has no breakpoints, sorting by reversals corresponds to eliminating breakpoints. An observation that every reversal can eliminate at most 2 breakpoints implies that the reversal distance d() b() / 2 where b() is the number of breakpoints in . However, this is a clear overestimate. adjacencies breakpoints Bioinformatics III
Breakpoint Graph The breakpoint graph of a permutation is an edge-colored graph G() with n + 2 vertices {0, 1 ... n, n+1} {0, 1, ..., n, n+1}. We join vertices i and i+1 by a black edge for 0 i n. We join vertices i and j by a gray edge if i j. Black path 0 2 3 1 4 6 5 7 Grey path 0 2 3 1 4 6 5 7 Superposition of black and grey paths forms the breakpoint graph: A breakpoint graph is obtained by a super- position of a black path traversing the vertices 0, 1, ..., n, n+1 in the order given by the permutation and a gray path traversing the vertices in the order given by the identity permutation. Bioinformatics III
Cycle decomposition A cycle in an edge-colored graph G is called alternating if the colors of every two consecutive edges of this cycle are distinct. In the following, cycles will mean alternating cycles. Cycle decomposition of the breakpoint graph: A vertex v in a graph G is called balanced if the number of black edges incident to v equals the number of grey edges incident to v. A balanced graph is a graph in which every vertex is balanced. G() is a balanced graph. Therefore, there exists a cycle decomposition of G() into edge-disjoint alternating cycles (every edge in the graph belongs to exactly one cycle in the decomposition). Cycles in an edge decomposition may be self-intersecting. The previous breakpoint graph can be decomposed into 4 cycles, one of which is self-intersecting. 0 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 Bioinformatics III
Cycle decomposition What is the decomposition of the breakpoint graph into a maximum number c() of edge-disjoint alternating cycles? Here, c() = 4. Cycle decompositions play an important role in estimating reversal distances. When a reversal is applied to a permutation, the number of cycles in a maximum decomposition can change by at most one (while the number of breakpoints can change by two). Bafna&Pevzner (1996) proved the bound: d() n + 1 - c() Which is much tighter than the bound in terms of breakpoints d() b() / 2. For many biological problems, d() = n + 1 - c(). Therefore, the reversal distance problem reduces to the problem of finding the maximal cycle decomposition. Bioinformatics III
Effects of reversals on cycles • For reversals acting on two cycles, (b – c) = 1. (B) For reversals acting on an unoriented cycle, (b – c) = 0. (C) For reversals acting on an oriented cycle, (b – c) = -1 Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Bioinformatics III
Effect of reversals on gray edges (a) A proper reversal on an oriented gray edge. (b) A nonproper reversal on an unoriented gray edge. Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Bioinformatics III
Transform signed into unsigned permutation • Optimal sorting of a permutation (3 5 8 6 4 7 9 2 1 10 11) by 5 reversals. (b) Breakpoint graph of this permutation: black edges connect adjacent vertices that are not consecutive, gray edges connect consecutive vertices that are not adjacent. (c) Transformation of a signed permutation into an unsigned permutation and the breakpoint graph G(); (d) Interleaving graph H with two oriented and one unoriented unoriented component. Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Bioinformatics III
The Problems Minimum Sorting by Reversals (MinSortRv): Given a permutation , what is the shortest sequence (12….t ) of reversals that sorts ? (Distance: d()) Complexity remains open. (NP-Hard) Minimum Signed Sorting by Reversals (SignedSortRv): Given a signed permutation , what is the shortest sequence (12….t ) of reversals that sorts ? Solvable in polynomial time. Bioinformatics III
Important developments KS93 -Kececioglu and Sanko“Exact and approximation algorithms for the inversion distance between two chromosomes", 4th CPM- studied MinSortRv- introduced notion of breakpoints- 2 approximation algorithmBP93 -Bafna and Pevzner“Genome Rearrangements and Sorting by Reversals", 34th FOCS- breakpoint graph and cycle decomposition- introduced signed sorting SignedSortRv- 3/2 approx algorithm for SignedSortRv- 7/4 approx algorithm for MinSortRvHP95 - Hannenhali and Pevzner “Transforming Cabbage into Turnip”, 27th STOC- SignedSortRv resolved- O(n4) algorithm- introduced hurdles and fortresses- d() = b() - c() + h() + f() Bioinformatics III
KS93-Breakpoints Extend to include element 0 (L) on the left and element n+1 (R) on the right. A breakpoint occurs between two adjacent elements that do not differ by 1 Example: • = { 3 5 6 7 2 1 4 8 } has 5 breakpoints, (b() = 5). R 3 5 6 72 1 4 8 L Breakpoints partition sequence into strips that are increasing or decreasing. Reversals add or remove breakpoints. • Sorted permutation has 0 breakpoints. i-reversal (i = 0,1, 2): a reversal that decreases number of breakpoints by i. Theorem (KS): Let contain a decreasing strip. Then has a 1- or 2-reversal. If every reversal that removes a breakpoint of results in a permutation with no decreasing strips, then has a 2-reversal. Bioinformatics III
Algorithm KS() i 0 while contains a breakpoint i i+1 the reversal that removes the most breakpoints, resolving ties in favor of reversals that leave a decreasing strip return Optimal reversal distance is at least b()/2 KS returns a solution that is at most 2*optimal = b() Bioinformatics III
2 3 1 6 5 4 L R - + 6 6 BP93 – Breakpoint Graph Vertices: elements of (plus 0 (L) and n+1 (R) ) THE DIAGRAM OF REALITY AND DESIRE Bioinformatics III
4 4 3 1 2 2 1 3 5 5 Desire edges L L -3 -3 +3 +3 +2 +2 -2 -2 +1 +1 -1 -1 -4 -4 +4 +4 +5 +5 -5 -5 R R Construction of a diagram of reality and desire Reality L R L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R Reality edges Desire L R Bioinformatics III
L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R L R -3 +3 -5 +5 +2 -2 +4 -4 +1 -1 Bioinformatics III
effect of reversals on c() c() = number of cycles in a maximum cycle decomposition Observation: reversals affect c(). Example: {L [+1 -1] –2 +2 +3 -3 R} • removes 2 breakpoints and 1 cycle. L +1 -1 -2 +2 +3 -3 R L -1 +1 -2 +2 +3 -3 R Bioinformatics III
d() >= b() - c() Cycles of length 4 are eliminated by 2-reversals. Let c4() = number of 4-cycles. (c() - c4()) : Cycles of length > 4 include at least three breakpoints d() >= b() – c4() - (c() - c4()) / 3 Bioinformatics III
Algorithm BP() while contains a breakpoint if has no decreasing strips if a 4-cycle C remains Find cycle C’ that crosses C 0-reversal on C’, 2-reversal on C else Regular 0-reversal else Regular greedy choice Algorithm BP produces a solution that is at most (3*optimal)/2 Bioinformatics III
HP95 – Hurdles and Fortress D B A C F E Interleaving Graph Bioinformatics III
Hurdles A hurdle is a bad component that does not separate any other two bad components. Separation is an important concept, in that a reversal through reality edges in different components A and C will result in every component B, that separates A and C being twisted. A bad component becomes good when twisted. Bad Components Non-Hurdles Hurdles E B Simple Hurdles Super Hurdles C A F D Bioinformatics III
Fortress A permutation a is called fortress f() when its reality and desire diagram contains an odd number of hurdles and all of them are super hurdles. Fortresses are permutations that require one extra reversal to sort, due to their special structure A smallest possible fortress. Bioinformatics III
Algorithm HP() If there is a good component in RD() then pick two divergent edges e,f in this component, making sure the corresponding reversal does not create any bad components Return the reversal characterized by e and f Else if h() is even then Return merging of two opposite hurdles else if h() is odd and there is a simple hurdle return a reversal cutting this hurdle else // fortress return merging of any two hurdles d() b() - c() + h() + f() h(): number of hurdles f(): 0/1, according to being a fortress or not Bioinformatics III
Hurdles • Unoriented component U separates U‘ and U‘‘ by virtue of the edge (0, 1) • Hurdle U does not separate U‘ and U‘‘. Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Bioinformatics III
Effects of reversals on cycles Reversal on a cycle C (i) deletes vertex C from the interleaving graph; (ii) changes the orientation of vertices in V(C); (iii) complements the subgraph induced by V(C). Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Bioinformatics III
Merging hurdles Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Bioinformatics III
Hannenvalli-Pevzner algorithm Hannenvalli, Pevzner, Journal of the ACM 46, 1 (1999) Bioinformatics III
Improvements of Hannenhalli-Pevzner algorithm Several websites offer programs to sort permutations by reversals. At their roots is the Hannenhalli-Pevzner algorithm for sorting signed permutations by reversals. Successive authors improved the algorithm. • By the Hannenhalli&Pevzner algorithm, the distance computation is performed in time O(n4). • improvements in the algorithm developed by Haim Kaplan, Ron Shamir and Robert E. Tarian bring the time to compute distance down to O(n2). • GRAPPA is written by a multitude of authors. It reduces the distance computation time to O(n) using improvements by David A. Bader, Bernard M.E. Moret and Mi Yan. The main purpose of GRAPPA is to construct phylogenetic trees for multiple signed unichromosomal genomes; the distance computation on which we are focused here is but a mere subroutine in that context. Bioinformatics III
Algorithm by Kaplan, Shamir, Tarjan The algorithm has three main stages: 1. Pre-process the permutation. This pre-processing contains 3 sub stages: 1a. Unsign the permutation, e.g., p will be unsigned to the permutation 0, (7,8), (4,3), (1,2), (5,6), (12,11), (9,10), 13. 1b. Define the Overlap graph of the permutation 1c. Find the connected components of the overlap graph 2. Clear the hurdles. A hurdle is a problematic connected component of the overlap graph. In this stage each reversal merges two hurdles in distinct connected components into one non-hurdle component. 3. Generate a sequence of safe reversals. A safe reversal is defined as a reversal that reduces b-c (the number of breakpoints minus the number of cycles) without creating new hurdles. Bioinformatics III
Multichromosomal genomes: more tricky Word problems and insertions/deletions So far we did not consider "word problems" in which some genes are repeated, 1 2 -1 3 4 nor did we allow gaps in the numbering (as may arise from insertion/deletion), 1 3 -9 -7 5 Distinguish between microrearrangements (e.g. intrachromosomal rearrangements with a span < 1 Mb) and macrorearrangements (e.g. intrachromosomal rearrangements of larger span as well as interchromosomal rearrangements). The existing rearrangement algorithms do not distinguish between these two types of rearrangements. First identify conserved synteny blocks (segments that can be converted into conserved segments by microrearrangements). Bioinformatics III
Genome Rearrangements: Synteny (a) Human and mouse synteny blocks of conserved gene order. Every block corresponds to a rectangle, with a diagonal showing whether the arrangements of anchors in human and mouse (within the synteny block) are the same or reversed. (b) Combining anchors into clusters by the GRIMM-Synteny algorithm at G = 100 kb. The edges in the anchor graph connect the closest ends of the anchors. The anchors are color-coded by the resulting clusters. At G = 1 Mb, this forms a single cluster, which in turn forms a synteny block (the lower right block in the human 18/mouse 17 rectangle in a). Pevzner, Tesler, Genome Res 13, 37 (2003) Bioinformatics III