470 likes | 576 Views
Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time). Leo van Iersel 1 , Judith Keijsper 1 , Steven Kelk 2 , Leen Stougie 12 (1) Technische Universiteit Eindhoven (TU/e)
E N D
Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time) Leo van Iersel1, Judith Keijsper1, Steven Kelk2, Leen Stougie12 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam Email: S.M.Kelk@cwi.nl Web: http://homepages.cwi.nl/~kelk
Part 1: Context
Phylogenetic tree reconstruction Phylogenetic tree reconstruction is essentially the science of efficiently inferring and constructingplausible evolutionary trees when we only have limited input data about the ‘species’ concerned… At the intersection of biology, bioinformatics, computer science and mathematics. Orangutan Gorilla Chimpanzee Human (This tree borrowed from a presentation by Tandy Warnow)
Dominant methods in phylogenetic reconstruction • Character-based methods • Maximum Parsimony (= Minimum Steiner Tree) • Maximum Likelihood • Bayesian methods (Markov Chain Monte Carlo) • Distance-based methods • Neighbour Joining • UPGMA • Quartet/triplet-based methods
Triplet-based methods (1) • Quartet-based methods used for constructing unrooted evolutionary trees: no root (= most distant ancestor) and edges have no direction (e.g. edge between species X and Y does not say whether X evolved into Y, or vice-versa.) • Triplet-based methods are used for constructing rooted evolutionary trees: there is a root and edges are directed. • The central idea: build a single, ‘big’ evolutionary tree for a set L of species by combining smaller evolutionary trees on subsets of L such that the big tree respects the structure of the smaller trees. • In triplet-based methods, the small input trees are always defined on size-3 subsets of the species set L (and are called rooted triplets.)
Triplet-based methods (2) w y x z w y x z w y x z • For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}. • I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) solution algorithm w z x y
Triplet-based methods (2) w y x y x z w y z • For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}. • I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) solution z w x algorithm w z x y
Triplet-based methods (2) w y x z w y • For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}. • I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.) solution z w x algorithm x y z w z x y
From trees to networks… • The algorithm of Aho et al. (1981) can be used to construct trees from rooted triplets. • But…what if the algorithm fails? Why might the algorithm fail? • Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors. • Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like! • Response: try and construct not phylogenetic trees, but phylogenetic networks
From trees to networks (2) x y z x z y • For example, suppose the input is {xy|z, xz|y}. z y x (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
From trees to networks (2) x z y • For example, suppose the input is {xy|z, xz|y}. x y z z y x (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
From trees to networks (2) x y z • For example, suppose the input is {xy|z, xz|y}. z y x z y x (Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
Level-k phylogenetic networks root (only one!) A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. This network here is a very simple example of a level-1 network. In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the alternative name “galled tree”. split-vertex z y x leaf- vertex recombination-vertex
Level-k phylogenetic networks root (only one!) A level-k phylogenetic network is a rooted, directed acyclic graph where every biconnected component (in the underlying undirected graph) contains at most k recombination vertices. This network here is a very simple example of a level-1 network. In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the alternative name “galled tree”. split-vertex z y x leaf- vertex recombination-vertex
What Jansson & Sung (& Nguyen) did… • A set of input triplets is dense iff, for every subset of 3 species, there is at least one triplet corresponding to those 3 species. • A dense set of input triplets for n species contains thus O(n3) triplets. • Jansson & Sung (2006) showed the following: Given a dense set of triplets T for a set L of species, it is possible to determine in polynomial-time whether a level-1 phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.) • They later showed, together with Nguyen, how to do this in time linear in |T|. They also showed that, in the non-dense case, the problem is NP-hard. • But what about level-2 networks, and higher?
Here is an example of a level-2 network. Main result: Given a dense set of triplets T for a set L of species, it is possible to determine in time O(|T|3) whether a level-2 phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.)
Part 2: The algorithm
Algorithm, high-level idea • The algorithm is conceptually (fairly) simple, but the proof of correctness and the technical details are rather complex. • The high-level idea is as follows: • PARTITION the set of leaves (i.e. species) into a ‘correct’ partition P; • INDUCE a new set of triplets T’ where every block of the partition P becomes a single leaf (a kind of ‘meta-leaf’ if you like) • SOLVE a simpler version of the problem for T’ to get a network N’ • RECURSE inside each leaf of N’ • Step 3 is the critical part of the algorithm. It brings together two issues: (a) why is it sufficient to only solve a simpler version of the problem? (b) how do we solve this simpler version of the problem?
Definition: inducing new triplet sets from partitions of the leaf set • Suppose I have a partition P = {P1, P2, …, Pt} of the leaf set L. • Suppose I have a dense set of triplets T on the leaf set L. • Let T’ be a new triplet set on leaf set {q1, q2,…, qt} defined as follows: • qiqj|qk is in T’ if and only if i≠j≠k and there exists a triplet xy|z in T such that x is in Pi, y is in Pj and z is in Pk • Then we say that T’ is the triplet set induced by the partition P of L. • Critically: if T is dense, then T’ is also dense. • In some sense this can be perceived as a ‘coarsening’ of the input set.
Definition: simple level-2 networks A simple level-2 network is any network obtained by “hanging leaves” off one of the above structures. Simple level-2 networks capture in some sense the essence of the complexity of general level-2 networks.
An example of a simple level-2 network Here the leaves {a,b,c,d,e,f,g,h} have been ‘hung’ from structure 8a, to yield a simple level-2 network.
Definition: SN-set • Jansson & Sung introduced the idea of the SN-set. • SN-sets are special subsets of the leaves L, and are defined with respect to triplet sets. • All sets containing just a single leaf, are SN-sets. • More generally, an SN-set is any subset of leaves obtained by taking the closure of the following operation on some subset S of the leaves L: z x y some subset S of the leaves
Definition: SN-set • Jansson & Sung introduced the idea of the SN-set. • SN-sets are special subsets of the leaves L, and are defined with respect to triplet sets. • All sets containing just a single leaf, are SN-sets. • More generally, an SN-set is any subset of leaves obtained by taking the closure of the following operation on some subset S of the leaves L: In other words, if there is some pair of leaves x,z in the set S such that xy|z is a triplet and y is not in the set S, add y to S, and repeat until no more leaves can be added. An SN-set is any set that can be constructed this way. z x y
Definition: maximal SN-set • The SN-set that is equal to the total leaf set L, is called the trivial SN-set. • An SN-set that is non-trivial, and is not a strict subset of any other non-trivial SN-set, is called a maximal SN-set. • Jansson and Sung proved that the set of maximal SN-sets partition the leaf set L. So no two maximal SN-sets overlap, and they completely cover the set of input leaves. • It is polynomial-time solveable to find all the SN-sets, and all the maximal SN-sets. • Jansson & Sung solved the level-1 problem by observing that they could treat the maximal SN-sets like ‘meta-leaves’, thus reducing the problem to recursively solving the problem on the triplets induced by the maximal SN-sets. • Our idea is similar, but SN-sets in level-2 networks are (unfortunately) rather more complex creatures than in level-1 networks.
Definition: (highest) cut-edges • In a phylogenetic network N, a cut-edge (x,y) is an edge whose removal disconnects the (underlying) graph. • A cut-edge (x,y) is said to be a trivial cut edge iff y is a leaf. • A cut-edge (x,y) is said to be highest iff there is no cut-edge (p,q) such that there is a directed path from q to x in N.
Fact. Let (x,y) be a highest cut-edge and let L’ be the set of leaves reachable from y. Let L* be a strict subset of L’. Then L* is not a maximal SN-set. • Proof: the set of leaves reachable from a highest cut-edge (x,y), is itself an SN-set. Why? Because it is not possible for there to be leaves p,q in L’ and r outside L’ such that pr|q is in the set of triplets: the edge (x,y) forms a bottleneck and would have to be used twice. x y p r q L’ p q r So: each maximal SN-set can be expressed as the union of the leaves reachable by one or more highest cut-edges.
A first attempt at reducing the problem to simple level-2 networks • Now, suppose we have a dense set of triplets T and there exists a level-2 network N such that all the triplets in T are consistent with N. (Of course we don’t know what N is yet…) • Suppose we construct a partition P of L as follows. The blocks of P are the sets of leaves reachable from highest cut-edges in N. (Each maximal SN-set of N thus corresponds to one or more blocks in P.) • Let T’ be the new set of triplets induced by the partition P. In other words, if we collapse the set of leaves below highest cut-edges into ‘meta-leaves’, T’ is the new set of triplets we get. (Nice property: the maximal SN-sets of T’ are in 1:1 correlation with the maximal SN-sets of T.) • Critical fact 1: the only level-2 networks where all cut-edges are trivial, are simple level-2 networks. • Critical fact 2: there exists some simple level-2 network N’ such that the triplets in T’ are consistent with N’. Furthermore, if we find such an N’, and then recursively construct networks within each meta-leaf, we obtain a network consistent with T!
But….that’s a non-deterministic argument • So, it looks like we can indeed reduce the problem – in some sense – to finding simple level-2 networks. • But that analysis was based on knowing where the highest cut-edges are in a hypothetical solution N. And we don’t know N…this is precisely what we’re looking for! • We can, however, compute the maximal SN-sets of the input triplet set T. • We need to be able to say something more about how maximal SN-sets of T relate to highest cut-edges in hypothetical solutions. Then we can base the recursion on maximal SN-sets, instead of highest cut-edges.
Central Theorem (simplified). Suppose there is a dense triplet set T consistent with some simple level-2 network N. Then there exists a level-2 network N’ (not necessarily simple) such that, with the exception of perhaps one maximal SN-set with respect to T, every maximal SN-set appears below a single cut-edge in N’. The remaining, ‘odd-one-out’ maximal SN-set (if it exists) will be equal to the union of leaves below two cut-edges.
transformation Observe how SN-set {C,G,F} has been ‘pushed’ below a single cut-edge.
An existence argument • If some solution N exists for T, then a simple level-2 solution N’ exists for T’ (induced by the highest cut-edges of N) where the maximal SN-sets of T’ are tightly correlated with the maximal SN-sets of T. Finding N’ gives the starting point for a solution to T. • But by the Central Theorem, all (except maybe one) of the maximal SN-sets of N’ can be ‘pushed’ below highest cut-edges to give a solution N’’ for T’. • If we re-expand all the meta-leaves of N’’, we obtain a new solution N* for T. Crucially, all (except maybe one) of the maximal SN-sets of T will be beneath single cut-edges in N*. The odd-one-out will be beneath two cut-edges. • So if we substitute N* as N in the first step, we come to the following conclusion: • We can find a solution for T by finding a simple level-2 solution for the set of triplets induced by the maximal SN-sets of T, and recursing. We need to correctly guess the ‘odd-one-out’ maximal SN-set, however, and split that into two meta-leaves. Fortunately we can just try splitting each maximal SN-set in turn.
subnetwork below highest cut-edge
subnetwork below highest cut-edge
C C G G transformation F F S = {C,G,F}
C C G G transformation F F S = {C,G,F}
C C G G transformation F F S = {C,G,F}
C C G G transformation F F S = {C,G,F} whole maximal SN-set is now below a cut-edge!
Finding simple level-2 networks • So we know that, if we analyse the maximal SN-sets carefully, and construct an appropriate new set of triplets, we can recursively reduce the entire problem to finding simple level-2 networks. • But how do we algorithmically construct a simple level-2 network that is consistent with a given dense set of triplets?
Suppose we can correctly ‘guess’ that leaf g hangs directly below a recombination node. If we remove g, and all triplets that contain g, then we know that a level-1 network must be possible on this new set of triplets (because now fewer recombination nodes are needed.)
Suppose we can correctly ‘guess’ that leaf g hangs directly below a recombination node. If we remove g, and all triplets that contain g, then we know that a level-1 network must be possible on this new set of triplets (because now fewer recombination nodes are needed.)
Suppose we subsequently guess that leaf h now hangs below a recombination node in the new network. If we remove h, and all triplets that contain h, then we know that a level-0 network must be possible on this new set of triplets (because now even fewer recombination nodes are needed.)
Suppose we subsequently guess that leaf h now hangs below a recombination node in the new network. If we remove h, and all triplets that contain g, then we know that a level-0 network must be possible on this new set of triplets (because now even fewer recombination nodes are needed.) In such a case the resulting tree is UNIQUE (J&S).
So now we have a tree. We are going to guess how to add leaf h back in, and then guess how to add leaf g back in. This guessing is not a problem because we can simply try all possibilities.
Adding leaf h back in. And finally adding leaf g back in. g
Conclusions & open problems • So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next? • Applicability: how useful is it? • Initial implementation: programming and fine-tuning • Improving running time: in the spirit of the “SN-tree” of J&S&N • Complexity: what about level-3 and higher? • Bounds: worst-case, best-case scenarios • Building all networks • Properties of output networks as function of input • Different triplet restrictions • Confidence: how good are the solutions? • Exponential-time exact algorithms for NP-hard problems
Conclusions & open problems • So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next? • Applicability: how useful is it? • Initial implementation: programming and fine-tuning • Improving running time: in the spirit of the “SN-tree” of J&S&N • Complexity: what about level-3 and higher? • Bounds: worst-case, best-case scenarios • Building all networks • Properties of output networks as function of input • Different triplet restrictions • Confidence: how good are the solutions? • Exponential-time exact algorithms for NP-hard problems Thank you for your attention!