380 likes | 439 Views
Constructing a level-2 phylogenetic network from a dense set of input triplets. Leo van Iersel 1 , Judith Keijsper 1 , Steven Kelk 2 , Leen Stougie 12 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam Email: S.M.Kelk@cwi.nl
E N D
Constructing a level-2 phylogenetic network from a dense set of input triplets Leo van Iersel1, Judith Keijsper1, Steven Kelk2, Leen Stougie12 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam Email: S.M.Kelk@cwi.nl Web: http://homepages.cwi.nl/~kelk
Triplet-based methods (1) w y x z w y x z w y x z Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) Find the tree that by contracting and deleting edges can give each of the triplet subgraphs as a minor solution algorithm w z x y
Triplet-based methods (2) w y x y x z w y z Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) Find the tree that by contracting and deleting edges can give each of the triplet subgraphs as a minor solution z w x algorithm w z x y
Triplet-based methods (2) w y x z w y Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) Find the tree that by contracting and deleting edges can give each of the triplet subgraphs as a minor solution z w x algorithm x y z w z x y
From trees to networks… • The algorithm of Aho et al. (1981) can be used to construct trees from rooted triplets. • But…what if the algorithm fails? Why might the algorithm fail? • Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors. • Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like! • Response: try and construct not phylogenetic trees, but phylogenetic networks
From trees to networks (2) x y z x z y • For example, suppose the input is {xy|z, xz|y}. z y x (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
From trees to networks (2) x z y • For example, suppose the input is {xy|z, xz|y}. x y z z y x (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
From trees to networks (2) x y z • For example, suppose the input is {xy|z, xz|y}. z y x z y x (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
Level-k phylogenetic networks root (only one!) A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. split-vertex z y x leaf- vertex recombination-vertex
Level-1 Networks • A set of input triplets is dense iff, for every subset of 3 species, there is at least one triplet corresponding to those 3 species. • Therefore, a dense set of input triplets for n species contains O(n3) triplets. • Jansson & Sung (2006) showed: Given a dense set of triplets T for a set L of species, it is possible to determine in polynomial-time whether a level-1 phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.) • They later showed, together with Nguyen, how to do this in time linear in |T|. They also showed that, in the non-dense case, the problem is NP-hard. • But what about level-2 networks, and higher?
Here is an example of a level-2 network. Main result: Given a dense set of triplets T for a set L of species, it is possible to determine in time O(|T|3) whether a level-2 phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.)
Algorithm, basic idea • The basic idea behind Aho’s algorithm for trees is that we are able to determine, recursively, which species belong to which of the two subtrees hanging from some root vertex. • For the level-1 and level-2 networks if there again exists such a clear dichotomy, we iterate on the two subsets. root Sub- network Sub- network
Algorithm, basic idea • The basic idea behind Aho’s algorithm for trees is that we are able to determine, recusively, which species belong to which of the two subtrees hanging from some root vertex. • For the level-1 networks if there again exists such a clear dichotomy, we iterate on the two subsets. Otherwise there must exist a network of the form Sub-network Sub-network Sub-network Sub-network Sub-network
Algorithm, basic idea • The basic idea behind Aho’s algorithm for trees is that we are able to determine, recusively, which species belong to which of the two subtrees hanging from some root vertex. • For the level-1 networks if there again exists such a clear dichotomy, we iterate on the two subsets. Otherwise there must exist a network of the form Find the partition of the species (leaves) into the subnetworks Find the blue backbone network Treat each of the partition elements (sub-networks) as leaves to be hanged on the backbone Recurse on the subnetworks Sub-network Sub-network Sub-network Sub-network Sub-network
Algorithm, high-level idea • For level-2 networks the idea is similar: Find the partition of the species (leaves) into the subnetworks There is a complication in level-2 Find the blue backbone network! There are more level-2 backbone forms Treat each of the partition elements (sub-networks) as (meta-)leaves to be hanged on the backbone Recurse on the subnetworks Sub-network Sub-network Sub-network Sub-network Sub-network
Definition: inducing new triplet sets from partitions of the leaf set • Suppose I have a partition P = {P1, P2, …, Pt} of the leaf set L. • Suppose I have a dense set of triplets T on the leaf set L. • Let T’ be a new triplet set on leaf set {q1, q2,…, qt} defined as follows: • qiqj|qk is in T’ if and only if i≠j≠k and there exists a triplet xy|z in T such that x is in Pi, y is in Pj and z is in Pk • Then we say that T’ is the triplet set induced by the partition P of L. • Critically: if T is dense, then T’ is also dense. • In some sense this can be perceived as a ‘coarsening’ of the input set.
Definition: simple level-2 networks Lemma: There are exactly 4 different backbone networks A simple level-2 network is any network obtained by “hanging leaves” off one of the above structures.
A picture description of the simple level-2 algorithm Here the leaves {a,b,c,d,e,f,g,h} have been ‘hung’ from structure 8a, to yield a simple level-2 network.
Level-2 network algorithm Assume some oracle gives us the partition of the leaves into sub-networks Treat each subnetwork as a leaf and construct a simple level-2 network The simple level-2 network algorithm • Guess the right “recombination leaf” • Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar
Suppose we can correctly ‘guess’ that leaf g hangs directly below a recombination node If we remove g, and all triplets that contain g, then we know that a level-1 network must be possible on this new set of triplets (because now fewer recombination nodes are needed)
Level-2 network algorithm Assume some oracle gives us the partition of the leaves into sub-networks Treat each subnetwork as a leaf and construct a simple level-2 network The simple level-2 network algorithm • Guess the right “recombination leaf” • Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar • Guess the right “caterpillar set”
Caterpillar set • A caterpillarset with respect to a dense triplet set T is the set of leaves of a caterpillar subgraph of a network consistent with T Caterpillar The empty set is also a caterpillar set
Suppose we subsequently guess that the caterpillar with h now hangs below a recombination node in the new network. If we remove the h-caterpillar, and all triplets that contain leaves of it, then we know that a level-0 network must be possible on this new set of triplets (because now even fewer recombination nodes are needed.)
Level-2 network algorithm Assume some oracle gives us the partition of the leaves into sub-networks Treat each subnetwork as a leaf and construct a simple level-2 network The simple level-2 network algorithm • Guess the right “recombination leaf” • Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar • Guess the right “caterpillar set” • Remove it and remove the triplets that contain any element of this set • Construct the unique tree for the remaining triplets [Jansson&Sung 2006]
So now we have a tree. We are going to guess how to add the h-caterpillar back in, and then guess how to add leaf g back in.
Level-2 network algorithm Assume some oracle gives us the partition of the leaves into sub-networks Treat each subnetwork as a leaf and construct a simple level-2 network The simple level-2 network algorithm • Guess the right “recombination leaf” • Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar • Guess the right “caterpillar set” • Remove it and remove the triplets that contain any element of this set • Construct the unique tree for the remaining triplets [Jansson&Sung 2006] • Insert the caterpillar set and the recombination leaf in the tree in the correct way For each pair of guesses try all 4 backbone structures
Simple level-2 algorithm Theorem: The simple level-2 network algorithm works in O(|T|^3)
SN-sets to partition the set of leaves • Jansson & Sung introduced the SN-set to partition the set of leaves • SN-sets are special subsets of the leaves L, and are defined w.r.t. T • All sets containing just a single leaf, are SN-sets. • Any other SN-set is any subset of leaves obtained by taking the closure of some subset S of the leaves L w.r.t. the following operation If x,y є S and xz|y є T or yz|x є T then zє S The SN-set that is equal to the total leaf set L, is called the trivial SN-set. An SN-set that is non-trivial, and is not a strict subset of any other non-trivial SN-set, is called a maximal SN-set. • (If the network is a tree there are 2 maximal SN-sets: one the set of leaves of the subtree right and the other the set of leaves of the subtree left of the root)
Definition: maximal SN-set • Jansson and Sung proved that the set of maximal SN-sets indeed partition the leaf set L. So no two maximal SN-sets overlap, and they completely cover the set of input leaves. • All SN-sets and all maximal SN-sets can be found in polynomial-time. • Jansson & Sung solved the level-1 problem by observing that each maximal SN-sets hangs as a ‘meta-leaf’ on the level-1 backbone network; each maximal SN-set can completely be separated from the rest of the network by removing just one edge • There are maximal SN-sets in level-2 networks that can hang under more than one edge!!!!
Definition highest cut-edge • In a phylogenetic network N, a cut-edge (x,y) is an edge whose removal disconnects the undirected graph. • A cut-edge (x,y) is said to be a trivial cut edge iff y is a leaf. • A cut-edge (x,y) is said to be highest iff there is no cut-edge (p,q) such that there is a directed path from q to x in N.
Fact. Let (x,y) be a highest cut-edge and let L’ be the set of leaves reachable from y. Let L* be a strict subset of L’. Then L* is not a maximal SN-set. • Proof: the set of leaves reachable from a highest cut-edge (x,y), is itself an SN-set. Clearly for any two leaves p,q in L’ and leaf r outside L’ there cannot be triplets pr|q and qr|p: the edge (x,y) forms a bottleneck. Thus pq|r must exist. x y p r q L’ p q r So: each maximal SN-set can be expressed as the union of the leaves reachable by one or more highest cut-edges.
Central Theorem (simplified). Suppose there is a dense triplet set T consistent with some simple level-2 network N. Then there exists a level-2 network N’ (not necessarily simple) such that, with the exception of perhaps one maximal SN-set with respect to T, every maximal SN-set appears below a single cut-edge in N’. The remaining, ‘odd-one-out’ maximal SN-set (if it exists) will be equal to the union of leaves below two cut-edges. In other words: there exists at most one maximal SN-set which is the union of the leaves below two highest cut-edges, whereas all other SN-sets consist of the leaves below one highest cut-edge
The algorithm • Determine the maximal SN-sets • Guess the right SN-set to be split • Treat the max SN-sets and the two split sets as leaves {S1,S2,…,Sq} • Adapt T to a new triplet set T’: SiSk|Sh є T’ if and only if there exist xєSi, yєSk,zєShs.t. xy|z є T • Construct a simple level-2 network for T’ • Recursively find the sub-networks for the sets S1,S2,…,Sq
Conclusions & open problems • So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next? • Applicability: how useful is it? • Initial implementation: programming and fine-tuning • Improving running time: in the spirit of the “SN-tree” of J&S&N • Complexity: what about level-3 and higher? • Bounds: worst-case, best-case scenarios • Building all networks • Properties of output networks as function of input • Different triplet restrictions • Confidence: how good are the solutions? • Exponential-time exact algorithms for NP-hard problems