Non-breaking Similarity of Genomes with Gene Repetitions

Non-breaking Similarity of Genomes with Gene Repetitions Binhai Zhu Computer Science Department, Montana State University Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao

Background • Computing genomic distance between genomes is important in evolutionary molecular biology, the problem was first studied by Sturtevant and Dobzhansky in 1936. • A lot of research has been done on computing genomic distances since 1990, assuming that each gene appears in a genome once, e.g., the famous result by Hannenhalli and Pevzner on sorting signed permutations by reversals.

Background (cond.) • On the other hand, gene repetition is very common in genomes. So computing genomic distances with gene repetition is a more realistic problem. • This is a typical optimization problem, it makes sense to study the approximability of the problem.

Definitions • Given n gene families (alphabets) F, a genome G’ is a sequence of elements of F such that each element has a (+/-) sign. Example. F={a,b,c,d}, G’=-bd-cab-d-c • We will focus on unsigned sequences in this work. • A genome G is said to be exemplar if every gene appears exactly once in G.

Definitions (cond.) • Given exemplar genomes G and H, over the same set of gene families, if gene ab is a substring in G but not in H, then ab constitutes a breakpoint in G. Example, G=abcdefg H=efgdcab there are 3 breakpoints in G (and symmetrically in H). • The number of breakpoints between G and H is called the breakpoint distance between G and H.

Exemplar Breakpoint Distance Problem • Given two genomes G’ and H’ over n gene families, compute two exemplar genomes G and H such that the breakpoint distance between G and H is minimized. • We call this the exemplar breakpoint distance problem (between G’ and H’). Denote this distance by eb(G’,H’)=b(G,H).

Approximation Algorithms • Given a minimization (maximization) problem Л, let the optimal solution of Л be OPT, an approximation algorithm A provides a performance guarantee of α for Л if for every instance of Л the solution value returned by A is at most x OPT (at least OPT/). • Usually we say that A is a factor- approximation for Л.

Prior Results (1) • We showed that the exemplar breakpoint distance problem does not admit any approximation, unless P=NP (or, deciding whether eb(G’H’)=0 is NP-complete) [Chen, Fu and Zhu;2006]. • This result holds for any genomic distance d( ) satisfying G=H implies d(G,H)=0. • Based on the above result, even under a weaker model of approximation, we showed that the exemplar conserved interval distance problem does not admit any WEAK approximation of a superlinear factor [Chen, Fowler, Fu and Zhu, 2007].

Prior Results (2) • On the other hand, for the exemplar breakpoint distance problem, Sankoff has used branch-and-bound [Sankoff, 1999] and Nguyen, Tay and Zhang [2005] have used divide-and-conquer on practical datasets to obtain good empirical results. • As a related, but slightly different effort, Chauve, et al. [2006] studied the exemplar genomic similarity problems which does not satisfy G=H implies d(G,H)=0, e.g., the exemplar common interval measure problem.

Background for this work • We try to look at the complement of the breakpoint distance under the gene duplication model. • As the problem is still hard to approximate, we follow Nguyen, et al. by considering genomes satisfying some practical conditions.

Definitions • Given exemplar genomes G and H drawn from the same alphabet, ab is a non-breaking point, if ab appears in both G and H. Example. G = abcdefg H = fegcdab We have two non-breaking points in G and H, which is called the non-breaking similarity of G and H, denoted as nbs(G,H). Note that when |G|=|H|=n, if G=H, nbs(G,H)=n-1. • Given genomes G’ and H’ drawn from the same alphabet, possibly with gene repetitions, the exemplar non-breaking similarity problem is to delete redundant genes to obtain exemplar genomes G and H such that nbs(G,H) is maximized. The corresponding measure is also denoted as enbs(G’,H’).

Example G’ = abcadcefg H’ = cfegcdabf We have 4 possible exemplar genomes for G’: abcdefg, abdcefg, bcadefg, badcefg. We have 4 possible exemplar genomes for H’: cfegdab, cegdabf, fegcdab, egcdabf. enbs(G’,H’)=nbs(abcdefg,fegcdab)=2.

Inapproximability Result Theorem 1. Given an exemplar genome G and another genome H’ such that the genes are all from the same alphabet with size n and each gene appears in H’ at most two times, the Exemplar Non-breaking Similarity Problem over G and H’ does not admit any approximation of factor n1-ε, unless P=NP. Proof Idea: A linear reduction from Independent Set (IS).

e2 v2 v1 N=5 vertices, M=5 edges N+M is even e4 e3 e1 e5 v4 v3 v5 G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5 H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 = x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2 Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4} Input graph has an IS of size K iff enbs(G,H’)=K.

Positive Results Our motivation was from Nguyen, Tay and Zhang [2005], who observed that for certain bacteria genome pairs (Baphi-Wigg, Pmult-Hinft, Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated genes are usually pegged, e.g., …xyx…aba…

Positive Results Definition: occ(g,G’) is the number of occurrence of g in G’. span(g,G’) is the maximum distance between two copies of g in G’. totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’)

Positive Results Definition: occ(g,G’) is the number of occurrence of g in G’. span(g,G’) is the maximum distance between two copies of g in G’. totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’) Example. G’=abcdaebd span(a,G’)=4, span(b,G’)=5, span(d,G’)=4, totalocc(4,G’)=6

Positive Results Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

Positive Results Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time. Idea 1: Given an exemplar genome G and another genome H” satisfying span(g,H”)≤c, for every g in H”, we can use divide and conquer to compute enbs(G,H”) in O(nc+2+ε) time. Roughly speaking, H”=H1H2H3, |H2|=c, then enumerate all solutions on H2 and recurse. T(n) ≤ 2c+1[2T(n/2+c)] + O(n) ≤ O(nc+2+ε)

Positive Results Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time. Idea 2: As t is considered as a constant, we enumerate all possibilities for deleting duplicated genes in G’ (to obtain G) and for deleting genes with span greater than c in H’ (to obtain H”). By Lemma 6, there are at most 43└t/3┘ such combinations. Therefore, the total running time is 43└t/3┘O(nc+2+ε) = O(3└t/3┘nc+2+ε) time.

Positive Results Theorem 3. Let G’ and H’ be two genomes with a total of t genes g satisfying shift(g,G’,H’) >c, for some constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘n2c+1+ε) time. Example. G’=abcadef H’=bcedefad shift(a,G’,H’) = 6

Conclusion • We introduce non-breaking similarity, which is the complement of the famous breakpoint distance, for genome comparison. • The general exemplar non-breaking similarity problem is hard to approximate. 3. For some special cases, we can obtain polynomial solutions.

Non-breaking Similarity of Genomes with Gene Repetitions