150 likes | 268 Views
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2010. Gene B 1: A G C 2: T G T 3: A A C 4: A G T. Gene A 1: T C G
E N D
Close Lower and Upper Bounds for the MinimumReticulate Network of Multiple Phylogenetic Trees YufengWu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2010
Gene B 1: A G C 2: T G T 3: A A C 4: A G T Gene A 1: T C G 2: T C A 3: C G G 4: C C G Reticulate Networks • Gene trees: phylogenetic trees from gene sequences • Assume: Binary and rooted • - Different topologies at different genes TB TA • Reticulate evolution: one explanation • Hybrid speciation, horizontal gene transfer Keep two red edges Keep two black edges 1 2 3 4 1 3 2 4 Reticulate network: A directed acyclic graph displayingeach of the gene trees 1 2 3 4 Reticulation event(s): nodes with in-degree two or more
The Minimum Reticulation Problem Given: a set of K gene trees G. • NP complete: even for K=2 • Current approaches: • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 For simplicity: a reticulation node with exactly two incoming edges (our approach allows more general case) N 2 reticulation events. Minimum! 1 2 3 4
Close Lower and Upper Bounds for Minimum Reticulation of Multiple Gene Trees Key idea: developing novel lower and upper bounds for Rmin(G): G is the set of K gene trees. Rmin(G) RH(G) < <SIT(G) RH(G): Lower bound Novel: first non-trivial bound Rmin(G): Minimum Challenging for K 3 SIT(G): Upper Bound Works for any K Bounds provides range of Rmin(G) If RH(G)=SIT(G), then Rmin(G) = RH(G) = SIT(G)
Pairwise Distance Pairwisedistances forT1, T2 and T3 T1 Pairwise reticulation distance of T1 and T2: d(T1,T2), the minimum reticulation in any reticulate network for T1 and T2 1 1 1 T2 T3 Rmin(T1,T2,T3) max(1,1,1) = 1 Question: can Rmin(T1,T2,T3) = 1? T3 T2 T1 1 2 3 4 1 2 4 3 1 2 3 4 ? Choosing same reticulate edge same gene trees Imaginary network with one reticulation node Rmin(T1, T2 and T3) 2! v
T’ T Display Vector 1 2 3 4 1 3 2 4 Tree T is displayed in a network Each tree has a display vector 1 0 0 1 v1 v2 1 2 3 4 • VT : Display vector of T, how T is displayed in the network • one bit per reticulation node length of display vector = number of reticulation nodes in the network • value 0/1: at each reticulation node, which edge (the 0-edge or 1-edge) is kept for T? VT: 0 1 VT’: 1 0 Intuition: display vectors can not be too similar Lemma: D(VT1,VT2) d(T1,T2) for any network displaying T1 an T2. D(VT1,VT2): Hamming distance of VT1 and VT2. d(T1,T2): pairwise reticulation distance of T1 and T2
T1 The RH Lower Bound 3 2 • Key: if R reticulation events possible, then exist K length R display vectors, satisfying the distance constraints: Hamming distance D(VT,VT’) d(T,T’) • Analogy: Selecting K points on R dimensional binary hypercube s.t. the points can not be too close • If such K points do not exist, then we must need at least R+1 reticulation events. 2 T2 T3 Question: can Rmin(T1,T2,T3) = 3? T3 T2? T1 RH lower bound: the smallest R s.t. K points can be selected on R-dimensional hypercube satisfying the distance constraints. Rmin(T1, T2, T3) 4! No polynomial time algorithm is known for general HPP problem. We use integer linear programming to solve it. Closed-form formula of RH bound for K=3.
Upper Bound • Problem: how to reconstruct a network for T1, T2, …, TK using small (may not be minimum) number, U, of reticulation events? • U: an upper bound • Key idea: sequentially insert gene trees Ti into a growing network N • Each step inserts a tree into N. • New reticulation events are needed to display Ti in N. • Minimize the new reticulation events at each step. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 N N N 1 2 3 4 1 2 3 4 1 2 3 4
SIT Upper Bound: Stepwise Insertion of Trees • Insertion of tree into a network: given a reticulate network N and a gene tree T, grow N by adding the minimum number of reticulation events to make T displayed in N • NP complete • Practical computation using integer linear programming • SIT bound • Try all ordering of T1, T2, …, TK • For each ordering, insert each tree Ti and compute the number of reticulation events needed for inserting each Ti. Obtain a network for each order. • SIT bound = the smallest reticulation events in these networks. • Heuristics when K is large or trees are large and different
Simulation • PIRN: a downloadable open-source software tool • Implemented in C++ and GNU GLPK (and CPLEX) • Generation of Simulation Data: a two-stage approach • Simulate a reticulate network N backwards in time for n species • Randomly select K trees embedded in N. • Evaluation Creteria: • How often exact minimum is found when lower and upper bounds match? • The gap between the lower and upper bounds • Average running time (see paper)
Performance of PIRN: % of Datasets Optimal Solution Found % LB=UB Horizontal axis: number of taxa Vertical axis: % of datasets lower = upper bounds Lower and upper bounds often match for many data K: number of gene trees r: level of reticulation Average over 100 datasets Number of taxa
Performance of PIRN: Gap of Bounds Gap Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds Gap between the lower and upper bounds is often small for many data K: number of gene trees Number of taxa
Reticulate Network for Five Poaceae Trees ndhF phyB rbcL rpoC2 ITS RH bound: 11 SIT bound: 13
Reticulate Network for Five Poaceae Trees ITS SIT bound: 13 reticulation events used in the network
Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by National Science Foundation