630 likes | 800 Views
Computing close bounds on the minimum number of recombinations Dan Gusfield UCD. Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis. Appl. Math 2005) D. Gusfield, V. Bansal (Recomb 2005). Reconstructing the Evolution of Binary Bio-Sequences (SNPs) .
E N D
Computing close bounds on the minimum number of recombinationsDan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis. Appl. Math 2005) D. Gusfield, V. Bansal (Recomb 2005)
Reconstructing the Evolution of Binary Bio-Sequences (SNPs) • Perfect Phylogeny (tree) model • Phylogenetic Networks (DAG) with recombination (ARG) • Efficiently Computed Lower and Upper Bounds on the number of recombinations needed. • Decomposition Theorem and open question.
Geneological or Phylogenetic Networks • The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations. • Some of the algorithmic and mathematical results also have phylogenetic applications, for example in hybrid speciation, lateral gene transfer. • Critical distinction is whether there is a fixed linear ordering of the sites (some results need it and some do not).
The Perfect Phylogeny Model for binary sequences sites 12345 Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 The tree derives the set M: 10100 10000 01011 01010 00010 2 10100 5 10000 01010 01011 Extant sequences at the leaves
When can a set of sequences be derived on a perfect phylogenywith the all-0 root? Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test
A richer model 10100 10000 01011 01010 00010 10101 added 12345 00000 1 4 3 00010 2 10100 5 pair 4, 5 fails the three gamete-test. The sites 4, 5 ``conflict”. 10000 01010 01011 Real sequence histories often involve recombination.
Sequence Recombination 01011 10100 S P 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix). called ``crossing over” in genetics
Network with Recombination 10100 10000 01011 01010 00010 10101 new 12345 00000 1 4 3 00010 2 10100 5 P 10000 01010 The previous tree with one recombination event now derives all the sequences. 01011 5 S 10101
Multiple Crossover Recombination 4-crossovers 2-crossovers = ``gene conversion”
A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100
Minimizing recombinations • Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. • However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. • Two algorithmic problems: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations . Find a network generating M that has some biologically-motivated structural properties.
Minimization is NP-hard The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000) (Semple 2004) A super-exponential-time method computes the exact min (Song and Hein). Works only for a small number of sequences. Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time. A full solution is in Gusfield, Eddhu, Langley O(nm + n^3) time (also root-unknown case solved later in the same time bound).
Practical and Accurate Lower Bound Computation • Unconstrained recombination. • Want efficient computation of lower bounds on the minimum number of recombinations needed. Many methods: HK, Haplotype, History, Connected Comps, Composite Interval, Gall-Interval Composite Method, Subsets, Recmin, Composite ILP, HapBound (-S,-M). Some require a linear order, and others are more general.
Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.
1 2 3 4 5 Incompatibility Graph a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 M 1 3 2 5 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test.
The grandfather of all lower bounds - HK 1985 • Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. This bound requires a linear order. • The HK bound is the minimum number of vertical lines needed to cut every edge in the incompatibility graph. Weak bound, but widely used - not only to bound the number of recombinations, but also to suggest their locations.
Justification for HK If two sites are incompatible, there must have been some recombination where the crossover point is between the two sites.
HK Lower Bound 1 2 3 4 5
HK Lower Bound = 1 1 2 3 4 5
More general view of HK Given a set of intervals on the line, and for each interval I, a number N(I), define the composite problem: Find the minimum number of vertical lines so that every interval I intersects at least N(I) of the vertical lines. In HK, each incompatibility defines an interval I where N(I) = 1. The composite problem is easy to solve by a left-to-right myopic placement of vertical lines: Sort the intervals by right end-point; Process the intervals left to right in that order; when the right endpoint of an interval I is reached, place there (if needed) additional vertical so that N(I) lines intersect I.
If each N(I) is a ``local” lower bound on the number of recombinations needed in interval I, then the solution to the composite problem is a valid lower bound for the full sequences. The resulting bound is called the composite bound given the local bounds. This general approach is called the Composite Method (Simon Myers 2002).
Example: New bound The number of non-trivial connected components in the incompatibility graph is a lower bound on the needed number of recombinations (Gusfield, Hickerson; Bafna, Bansal). CC Bound = 2 in the prior example.
The CC bound can be arbitrarily larger than the HK bound, but can also be smaller. When the CC bound is used for the local bound N(I) in each interval I, then the resulting composite bound is provably as large as the HK bound, and is usually larger (in our simulations). Better composite bounds are obtained by using the ``haplotype bound” as the local bound.
Haplotype Bound (Simon Myers) • Rh = Number of distinct sequences (rows) - Number of distinct sites (columns) -1 (folklore) • Before computing Rh, remove any site that is compatible with all other sites. A valid lower bound results - generally increases the bound. • Generally really bad bound, often negative, when used on large intervals, but Very Good when used as local bounds in the Composite Interval Method, and other methods.
Composite Interval Method using RH bounds Compute Rh separately for each of the C(m,2) intervals of the m sites; let N(I) = Rh(I) be the local lower bound for interval I. Then compute the composite bound using these local bounds. Polynomial time and gives bounds that often double HK in our simulations.
Composite Subset Method (Myers) • Let S be subset of sites, and Rh(S) be the haplotype bound for subset S. If the leftmost site in S is L and the rightmost site in S is R, then use Rh(S) as a local bound N(I) for interval I = [S,L]. • Compute Rh(S) on many subsets, and then solve the composite problem to find a composite bound.
RecMin (Myers) • World Champion Lower Bound Program (available). Often RecMin gives a bound three times as large as HK. • Computes Rh on subsets of sites, but limits the size and the span of the subsets. Default parameters are s = 6, w = 15 (s = size, w = span). • Generally, impractical to set s = w = m, so generally one doesn’t know if increasing the parameters would increase the bound. (example: Myers bound of 70 on the LPL data).
Optimal RecMin Bound (ORB) • The Optimal RecMin Bound is the lower bound that RecMin would produce if both parameters were set to their maximum values (s = w= m). • In general, RecMin cannot compute (in practical time) the ORB. • Practical computation of the ORB is our first contribution.
Computing the ORB • Gross Idea: For each interval I, use ILP to find a subset of sites S that maximizes Rh(S) over all subsets in interval I. Call the result Opt(I). • Set N(I) = Opt(I), and compute the composite bound using those local bounds. • The composite bound is the ORB. -- the result one would get by using all 2^m subsets in RecMin, with s = w = m.
We have moved from doing an exponential number of simple computations (computing Rh for each subset), to solving a quadratic number of (possibly expensive) ILPs. Is this a good trade-off in practice? Our experience - very much so!
How to compute Opt(I) by ILP Create one variable Xi for each row i; one variable Yj for each column j in interval I. All variables are 0/1 variables. Define S(i,i’) as the set of columns where rows i and i’ have different values. Each column in S(i,i’) is a witness that rows i and i’ differ. For each pair of rows (i,i’), create the constraint: Xi + Xi’ <= 1 + ∑ [Yj: jin S(i,i’)] Objective Function: Maximize ∑ Xi - ∑ Yj -1
Alternate way to compute Opt(I) by ILP First remove any duplicate rows. Let N be the number of rows remaining. Create one variable Yj for each column j in interval I. All variables are 0/1 variables. S(i,i’) as before. For each pair of rows (i,i’) create the constraint: 1 <= ∑ [Yj: jin S(i,i’)] Objective Function: Maximize N - ∑(Yj) -1 Finds the smallest number of columns to distinguish the rows.
Two critical tricks Use the second ILP formulation to compute Opt(I). It solves much faster than the first (why?) 2) Reduce the number of intervals where Opt(I) is computed: I If the solution to Opt(I) uses columns that only span interval L, then there is no need to find Opt(I’) in any interval I’ containing L. Each ILP computation directly spawns at most 4 other ILPs. Apply this idea recursively. L
With the second trick we need to find Opt(I) for only 0.5% - 35% of all the C(m,2) intervals (empirical result). Surprisingly fast in practice (with either the GNU solver or CPLEX).
Bounds Higher Than the Optimal RecMin Bound • Often the ORB underestimates the true minimum, e.g. Kreitman’s ADH data: 6 v. 7 • How to derive sharper bound? Idea: In the composite method, check if each local bound N(I) = Opt(I) is tight, and if not, increase N(I) by one. • Small increases in local bounds can increase the composite bound.
Bounds Sharper Than Optimal RecMin Bound • A set of sequences M is called self-derivable if there is a network generating M where the sequence at every node (leaf and intermediate) is in M. • Observation: The haplotype bound for a set of sequences is tight if and only if the sequences are self-derivable. • So for each interval I where Opt(I) is computed, we check self-derivability of the sequences induced by columns S*, where S* are the columns specified by Opt(I). Increase N(I) by 1 if the sequences are not self-derivable.
Algorithm for Self-Derivability • Solution is easy when there are no mutations --only recombinations are allowed and one initial pair of sequences is chosen as ``reached”. • Two reached sequences S1 and S2 can reach a third sequence S3, if S3 can be created by recombining S1 and S2. • Do BFS to see if all sequences can be reached by successive application of the ``reach operation”. • Clearly polynomial time and can be optimized with suffix-trees etc. (Kececiouglu, Gusfield)
Self-derivability Test with mutations allowed • For each site i, construct set MUT(i) of sequence pairs (S1, S2) in M where S1 and S2 differ at only site. • Try each sequence in M as root (which is the only reached sequence initially). • For each root, enumerate all ways of choosing exactly one ordered pair of sequences (Sp, Sq) from each MUT(i). Sp is allowed to ``reach” Sq. • Run the prior self-derivability algorithm with these new permitted reach relations, to test if all sequences in M can be reached. If so, then M is self-derivable, otherwise it is not.
Checking if N(I) should be increased by two If the set of sequences is not self-derivable, we can test if adding one new sequence makes it self-derivable. the number of candidate sequences is polynomial and for each one added to M we check self-derivability. N(I) should be increased by two if none of these sets of sequences is self-derivable.
Program HapBound • HapBound –S. Checks each Opt(I) subset for self-derivability. Increase N(I) by 1 or 2 if possible. This often beats ORB and is still practical for most datasets. • HapBound –M. Explicitly examine each interval directly for self-derivability. Increase local bound if possible. Derives lower bound of 7 for Kreitman’s ADH data in this mode.
Example where RecMin has difficulity in Finding Optimal Bound on a 25 by 376 Data Matrix
Frequency of HapBound –S Bound Sharper Than Optimal RecMin Bound For every ms parameters, 1000 data sets are used.
Computing Upper Bounds • The method is an adaptation of the ``history” lower bound (Myers). • A non-informative columnis one with fewer than two 0’s or fewer than two 1’s.
Single History Computation • Set W = 0 • Collapse identical rows together, and remove non-informative columns. Repeat until neither is possible. • Let A be the data at this point. If A is empty, stop, else remove some row r from A, and set W = W + 1. Go to step 2). Note that the choice of r is arbitrary in Step 3), so the resulting W can vary.
History Lower Bound Theorem (Myers) Let W* be the minimum W obtained from all possible single history computations. Then W* is a valid lower bound on the number of recombinations needed. Naïve time: theta(n!) (RecMin), but can be reduced to theta(2^n) (Bafna, Bansal).
Converting the History Lower Bound to an Upper Bound • Given a set of rows A and a single row r, define w(r | A - r) as the minimum number of recombinations needed to create r from A-r (well defined in our application). • w(r | A-r) can be computed in linear time by a greedy-type algorithm.
Upper Bound Computation • Set W = 0 • Collapse identical rows together, and remove non-informative columns. Repeat until neither is possible. • Let A be the data at this point. If A is empty, stop, else remove some row r from A, and set W = W + W(r | A-r). Go to step 2). Note that the choice of r is arbitrary in Step 3), so the resulting W can vary. This is the Single History Computation, with a change in step 3).
Note, even a single execution of the upper bound computation gives a valid upper bound, and a way to construct a network. This is in contrast to the History Bound which requires finding the minimum W over all histories. However, we also would like to find the lowest upper bound possible with this approach. This can be done in O(2^n) time by DP. In practice, we can use branch and bound to find the lowest upper bound, but we have also found that branching on the best local choice, or randomizing gives good results.
Branch and Bound (Branching) In Step 3) choose r to minimize w(r | A-r) + L(A-r), where L(A-r) is some fast lower bound on the number of recombinations needed for the set A-r. Even HK is good for this purpose. (Bounding) Let C be the min for an full solution found so far; If W + L(A) >= C, then backtrack.