Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees

Close Lower and Upper Bounds for the MinimumReticulate Network of Multiple Phylogenetic Trees YufengWu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2010

Gene B 1: A G C 2: T G T 3: A A C 4: A G T Gene A 1: T C G 2: T C A 3: C G G 4: C C G Reticulate Networks • Gene trees: phylogenetic trees from gene sequences • Assume: Binary and rooted • - Different topologies at different genes TB TA • Reticulate evolution: one explanation • Hybrid speciation, horizontal gene transfer Keep two red edges Keep two black edges 1 2 3 4 1 3 2 4 Reticulate network: A directed acyclic graph displayingeach of the gene trees 1 2 3 4 Reticulation event(s): nodes with in-degree two or more

The Minimum Reticulation Problem Given: a set of K gene trees G. • NP complete: even for K=2 • Current approaches: • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 For simplicity: a reticulation node with exactly two incoming edges (our approach allows more general case) N 2 reticulation events. Minimum! 1 2 3 4

Close Lower and Upper Bounds for Minimum Reticulation of Multiple Gene Trees Key idea: developing novel lower and upper bounds for Rmin(G): G is the set of K gene trees. Rmin(G) RH(G) < <SIT(G) RH(G): Lower bound Novel: first non-trivial bound Rmin(G): Minimum Challenging for K  3 SIT(G): Upper Bound Works for any K Bounds provides range of Rmin(G) If RH(G)=SIT(G), then Rmin(G) = RH(G) = SIT(G)

Pairwise Distance Pairwisedistances forT1, T2 and T3 T1 Pairwise reticulation distance of T1 and T2: d(T1,T2), the minimum reticulation in any reticulate network for T1 and T2 1 1 1 T2 T3 Rmin(T1,T2,T3)  max(1,1,1) = 1 Question: can Rmin(T1,T2,T3) = 1? T3 T2 T1 1 2 3 4 1 2 4 3 1 2 3 4 ? Choosing same reticulate edge  same gene trees Imaginary network with one reticulation node Rmin(T1, T2 and T3) 2! v

T’ T Display Vector 1 2 3 4 1 3 2 4 Tree T is displayed in a network Each tree has a display vector 1 0 0 1 v1 v2 1 2 3 4 • VT : Display vector of T, how T is displayed in the network • one bit per reticulation node  length of display vector = number of reticulation nodes in the network • value 0/1: at each reticulation node, which edge (the 0-edge or 1-edge) is kept for T? VT: 0 1 VT’: 1 0 Intuition: display vectors can not be too similar Lemma: D(VT1,VT2)  d(T1,T2) for any network displaying T1 an T2. D(VT1,VT2): Hamming distance of VT1 and VT2. d(T1,T2): pairwise reticulation distance of T1 and T2

T1 The RH Lower Bound 3 2 • Key: if R reticulation events possible, then exist K length R display vectors, satisfying the distance constraints: Hamming distance D(VT,VT’)  d(T,T’) • Analogy: Selecting K points on R dimensional binary hypercube s.t. the points can not be too close • If such K points do not exist, then we must need at least R+1 reticulation events. 2 T2 T3 Question: can Rmin(T1,T2,T3) = 3? T3 T2? T1 RH lower bound: the smallest R s.t. K points can be selected on R-dimensional hypercube satisfying the distance constraints. Rmin(T1, T2, T3) 4! No polynomial time algorithm is known for general HPP problem. We use integer linear programming to solve it. Closed-form formula of RH bound for K=3.

Upper Bound • Problem: how to reconstruct a network for T1, T2, …, TK using small (may not be minimum) number, U, of reticulation events? • U: an upper bound • Key idea: sequentially insert gene trees Ti into a growing network N • Each step inserts a tree into N. • New reticulation events are needed to display Ti in N. • Minimize the new reticulation events at each step. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 N N N 1 2 3 4 1 2 3 4 1 2 3 4

SIT Upper Bound: Stepwise Insertion of Trees • Insertion of tree into a network: given a reticulate network N and a gene tree T, grow N by adding the minimum number of reticulation events to make T displayed in N • NP complete • Practical computation using integer linear programming • SIT bound • Try all ordering of T1, T2, …, TK • For each ordering, insert each tree Ti and compute the number of reticulation events needed for inserting each Ti. Obtain a network for each order. • SIT bound = the smallest reticulation events in these networks. • Heuristics when K is large or trees are large and different

Simulation • PIRN: a downloadable open-source software tool • Implemented in C++ and GNU GLPK (and CPLEX) • Generation of Simulation Data: a two-stage approach • Simulate a reticulate network N backwards in time for n species • Randomly select K trees embedded in N. • Evaluation Creteria: • How often exact minimum is found when lower and upper bounds match? • The gap between the lower and upper bounds • Average running time (see paper)

Performance of PIRN: % of Datasets Optimal Solution Found % LB=UB Horizontal axis: number of taxa Vertical axis: % of datasets lower = upper bounds Lower and upper bounds often match for many data K: number of gene trees r: level of reticulation Average over 100 datasets Number of taxa

Performance of PIRN: Gap of Bounds Gap Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds Gap between the lower and upper bounds is often small for many data K: number of gene trees Number of taxa

Reticulate Network for Five Poaceae Trees ndhF phyB rbcL rpoC2 ITS RH bound: 11 SIT bound: 13

Reticulate Network for Five Poaceae Trees ITS SIT bound: 13 reticulation events used in the network

Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by National Science Foundation

Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees