930 likes | 948 Views
Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination. Dan Gusfield UC Davis. Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yun Song, Yufeng Wu. Reconstructing the Evolution of Binary Bio-Sequences.
E N D
Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yun Song, Yufeng Wu.
Reconstructing the Evolution of Binary Bio-Sequences • Perfect Phylogeny (tree) model • Phylogenetic Networks (DAG) with recombination • Phylogenetic Networks with disjoint cycles: Galled-Trees • Phylogenetic Networks with unconstrained cycles: Blobbed-Trees • Combinatorial Structure and Efficient Algorithms • Efficiently Computed Lower and Upper bounds on the number of recombinations needed
Geneological or Phylogenetic Networks • The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations. • Also relates to phylogenetic-based haplotyping. • Some of the algorithmic and mathematical results have phylogenetic applications, for example in hybrid speciation, lateral gene transfer.
Why binary? Single nucleotide polymorphisms are the key data that we use. SNPs imply that the sequences are binary, and also that the order of the sites is fixed (on a chromosome). This is in contrast to a set of taxonomic characters, which may be binary, but where the given order is arbitrary.
The Perfect Phylogeny Model for binary sequences sites 12345 Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 The tree derives the set M: 10100 10000 01011 01010 00010 2 10100 5 10000 01010 01011 Extant sequences at the leaves
The converse problem Given a set of sequences M we want to find, if possible, a perfect phylogeny that derives M. Remember that each site can change state from 0 to 1 only once. n will denote the number of sequences in M, and m will denote the length of each sequence in M. m 01101001 11100101 10101011 M n
When can a set of sequences be derived on a perfect phylogeny? Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test
A richer model 10100 10000 01011 01010 00010 10101 added 12345 00000 1 4 3 00010 2 10100 5 pair 4, 5 fails the three gamete-test. The sites 4, 5 ``conflict”. 10000 01010 01011 Real sequence histories often involve recombination.
Sequence Recombination 01011 10100 S P 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix).
Network with Recombination 10100 10000 01011 01010 00010 10101 new 12345 00000 1 4 3 00010 2 10100 5 P 10000 01010 The previous tree with one recombination event now derives all the sequences. 01011 5 S 10101
Multiple Crossover Recombination 4-crossovers 2-crossovers = ``gene conversion”
Elements of a Phylogenetic Network (single crossover recombination) • Directed acyclic graph. • Integers from 1 to m written on the edges. Each integer written only once. These represent mutations. • A choice of ancestral sequence at the root. • Every non-root node is labeled by a sequence obtained from its parent(s) and any edge label on the edge into it. • A node with two edges into it is a ``recombination node”, with a recombination point r. One parent is P and one is S. • The network derives the sequences that label the leaves.
A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100
Which Phylogenetic Networks are meaningful? Given M we want a phylogenetic network that derives M, but which one? A: A perfect phylogeny (tree) if possible. As little deviation from a tree, if a tree is not possible. Use as little recombination or gene-conversion as possible.
Minimizing recombinations • Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. • However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. • Two algorithmic problems: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations (Hein’s problem). Find a network generating M that has some biologically-motivated structural properties.
Minimization is NP-hard The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000) (Semple 2004) Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time.
Recombination Cycles • In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. • The cycle specified by those two paths is called a ``recombination cycle”.
Galled-Trees A recombination cycle in a phylogenetic network is called a “gall” if it shares no node with any other recombination cycle. A phylogenetic network is called a “galled-tree” if every recombination cycle is a gall.
A galled-tree generating the sequences generated by the prior network. 4 3 1 s p a: 00010 3 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101
Sales pitch for Galled-Trees Galled-trees represent a small deviation from true trees. There are sufficient applications where it is plausible that a galled tree exists that generates the sequences. Observable recombinations tend to be recent; block structure of human DNA; recombination is sparse, so the true history of observable recombinations may be a galled-tree. The number of recombinations is never more than m/2. Moreover, when M can be derived on a galled-tree, the number of recombinations used is the minimum number over any phylogenetic network, even if multiple cross-overs at a recombination event are counted as a single recombination. A galled-tree for M is ``almost unique” - implications for reconstructing the correct history.
Old (Aug. 2003) Results • O(nm + n^3)-time algorithm to determine whether or not M can be derived on a galled-tree with all-0 ancestral sequence. • Proof that the galled-tree produced by the algorithm is a “nearly-unique” solution. • Proof that the galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used, over all phylogenetic-networks with all-0 ancestral sequence.
New work We derive the galled-tree results in a more general setting that addresses unconstrained recombination cycles and multiple crossover recombination. This also solves the problem of finding the ``most tree-like” network when a perfect phylogeny is not possible. In this algorithm, no ancestral sequence is known in advance.
Blobbed-trees: generalizing galled-trees • In a phylogenetic network a maximal set of intersecting cycles is called a blob. • Contracting each blob results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. • So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network.
Every network is a tree of blobs. How do the tree parts and the blobs relate? How can we exploit this relationship? Ugly tangled network inside the blob.
Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.
1 2 3 4 5 Incompatibility Graph a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 M 1 3 2 5 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.
The connected components of G(M) are very informative • The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network (Bafna, Bansal; Gusfield, Hickerson). • When each blob is a single-cycle (galled-tree case) all the incompatible sites in a blob must come from a single connected component C, and that blob must contain all the sites from C. Compatible sites need not be inside any blob. (Gusfield et al 2003-5)
Simple Fact If sites two sites i and j are incompatible, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j. (This is a general fact for all phylogenetic networks.) Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4; as are 2, 5.
A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100
Simple Consequence of the simple fact All sites on the same (non-trivial) connected component of the incompatibility graph must be on the same blob in any blobbed-tree. Follows by transitivity. So we can’t subdivide a blob into a tree-like structure if it only contains sites from a single connected component of the incompatibility graph.
Key Result about Galls: For galls, the converse of the simple consequence is also true. Two sites that are in different (non-trivial) connected components cannot be placed on the same gall in any phylogenetic network for M. Hence, in a galled-tree T for M each gall contains all and only the sites of one (non-trivial) connected component of the incompatibility graph. All compatible sites can be put on edges outside of the galls. This is the key to the galled-tree solution.
Incompatibility Graph A galled-tree generating the sequences generated by the prior network. 4 4 3 1 3 2 5 1 s p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101
Motivated by the one-one correspondence between galls and non-trivial connected components, we ask: To what extent does this one-one correspondence hold in general blobbed-trees, i.e. with no constraints on how recombination cycles interweave?
The Decomposition Theorem (Recomb 2005) For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. A blobbed-tree with this structure is called fully-decomposed.
General Structure So, for any set of sequences M, there is a phylogenetic network T(M) that is fully decomposed. Moreover, the tree part of T(M) is unique. And it is easy to find the tree part.
A fully-decomposed network for the sequences generated by the prior network. Incompatibility Graph 4 4 3 1 3 2 5 1 s p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101
Moreover Since all sites from a single connected component must be together on some blob in any phylogenetic network, no network is more decomposed than the fully decomposed network.
Proof Ideas Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.
1 2 3 4 5 a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 C2 C1 M 1 3 2 5 B1 B2 1 3 4 2 5 a b c d e f g 0 0 0 0 0 0 0 0 1 0 1 1 0 1 a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g 0 1 0 M[C1] M[C2]
Faux Proof Pick one site from each connected component C in G(M) to ``represent” C. No pair of those sites are incompatible, so by the NASC for a perfect phylogeny, there will be a perfect phylogeny T for the sites. Expand each node to a network generating the sequences in M[C]. Incorrect, because the structure of T can be wrong. We need to use information about all the sites in each C.
Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicator matrix for the supercharacters. So W indicates which rows of M contain which particular supercharacters. 1 2 3 4 5 6 7 8 1 3 4 2 5 a b c d e f g 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 5 5 5 5 6 7 8 a b c d e f g 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 2 3 4 3 3 3 a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g 01 0 W M[C1] M[C2]
Proof Ideas Lemma: No pair of supercharacters are incompatible. So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.
Proof Ideas For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.
Algorithmically, T is easy to find and is the tree resulting from contracting each blob in the fully-decomposed blobbed-tree T(M) for M. T can be constructed from M in O(nm^2) time.
Broader Biological Applications Our major interest is in recombination, but the proof of the decomposition theorem does not explicitly use recombination. So it holds for whatever biological phenomena caused the incompatibility of sites. For example, back or recurrent mutation, gene-conversion, lateral gene transfer etc.
The main open question The Decomposition Theorem says there is always a fully-decomposed blobbed-tree for any M, but Is there always a fully-decomposed blobbed-tree that minimizes the number of recombinations over all possible phylogenetic networks for M?
We conjecture the answer is yes. If true, then we can decompose the problem of minimizing the total number of recombinations into separate problems on each connected component, and also find lower bounds on the needed number of recombinations, in each component separately, adding those bounds to get a valid overall lower bound for M. This computation of lower bounds is known to be correct for certain lower bounds (Bafna, Bansal 2004).
Progress on Proving the Conjecture Definition: If N is a phylogenetic network for M, and a node v in N is labeled with a sequence in M, then v is said to be visible in N. Theorem: If every node in N is visible, then there is a fully-decomposed network for M where the number of recombinations is at most the number in N. Corollary: The conjecture is true for any M where the Haplotype or History lower bounds (S. Myers) on the number of recombinations needed to generate M, is tight.