1 / 15

Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees

Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2010. Gene B 1: A G C 2: T G T 3: A A C 4: A G T. Gene A 1: T C G

donal
Download Presentation

Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Close Lower and Upper Bounds for the MinimumReticulate Network of Multiple Phylogenetic Trees YufengWu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2010

  2. Gene B 1: A G C 2: T G T 3: A A C 4: A G T Gene A 1: T C G 2: T C A 3: C G G 4: C C G Reticulate Networks • Gene trees: phylogenetic trees from gene sequences • Assume: Binary and rooted • - Different topologies at different genes TB TA • Reticulate evolution: one explanation • Hybrid speciation, horizontal gene transfer Keep two red edges Keep two black edges 1 2 3 4 1 3 2 4 Reticulate network: A directed acyclic graph displayingeach of the gene trees 1 2 3 4 Reticulation event(s): nodes with in-degree two or more

  3. The Minimum Reticulation Problem Given: a set of K gene trees G. • NP complete: even for K=2 • Current approaches: • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 For simplicity: a reticulation node with exactly two incoming edges (our approach allows more general case) N 2 reticulation events. Minimum! 1 2 3 4

  4. Close Lower and Upper Bounds for Minimum Reticulation of Multiple Gene Trees Key idea: developing novel lower and upper bounds for Rmin(G): G is the set of K gene trees. Rmin(G) RH(G) < <SIT(G) RH(G): Lower bound Novel: first non-trivial bound Rmin(G): Minimum Challenging for K  3 SIT(G): Upper Bound Works for any K Bounds provides range of Rmin(G) If RH(G)=SIT(G), then Rmin(G) = RH(G) = SIT(G)

  5. Pairwise Distance Pairwisedistances forT1, T2 and T3 T1 Pairwise reticulation distance of T1 and T2: d(T1,T2), the minimum reticulation in any reticulate network for T1 and T2 1 1 1 T2 T3 Rmin(T1,T2,T3)  max(1,1,1) = 1 Question: can Rmin(T1,T2,T3) = 1? T3 T2 T1 1 2 3 4 1 2 4 3 1 2 3 4 ? Choosing same reticulate edge  same gene trees Imaginary network with one reticulation node Rmin(T1, T2 and T3) 2! v

  6. T’ T Display Vector 1 2 3 4 1 3 2 4 Tree T is displayed in a network Each tree has a display vector 1 0 0 1 v1 v2 1 2 3 4 • VT : Display vector of T, how T is displayed in the network • one bit per reticulation node  length of display vector = number of reticulation nodes in the network • value 0/1: at each reticulation node, which edge (the 0-edge or 1-edge) is kept for T? VT: 0 1 VT’: 1 0 Intuition: display vectors can not be too similar Lemma: D(VT1,VT2)  d(T1,T2) for any network displaying T1 an T2. D(VT1,VT2): Hamming distance of VT1 and VT2. d(T1,T2): pairwise reticulation distance of T1 and T2

  7. T1 The RH Lower Bound 3 2 • Key: if R reticulation events possible, then exist K length R display vectors, satisfying the distance constraints: Hamming distance D(VT,VT’)  d(T,T’) • Analogy: Selecting K points on R dimensional binary hypercube s.t. the points can not be too close • If such K points do not exist, then we must need at least R+1 reticulation events. 2 T2 T3 Question: can Rmin(T1,T2,T3) = 3? T3 T2? T1 RH lower bound: the smallest R s.t. K points can be selected on R-dimensional hypercube satisfying the distance constraints. Rmin(T1, T2, T3) 4! No polynomial time algorithm is known for general HPP problem. We use integer linear programming to solve it. Closed-form formula of RH bound for K=3.

  8. Upper Bound • Problem: how to reconstruct a network for T1, T2, …, TK using small (may not be minimum) number, U, of reticulation events? • U: an upper bound • Key idea: sequentially insert gene trees Ti into a growing network N • Each step inserts a tree into N. • New reticulation events are needed to display Ti in N. • Minimize the new reticulation events at each step. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 N N N 1 2 3 4 1 2 3 4 1 2 3 4

  9. SIT Upper Bound: Stepwise Insertion of Trees • Insertion of tree into a network: given a reticulate network N and a gene tree T, grow N by adding the minimum number of reticulation events to make T displayed in N • NP complete • Practical computation using integer linear programming • SIT bound • Try all ordering of T1, T2, …, TK • For each ordering, insert each tree Ti and compute the number of reticulation events needed for inserting each Ti. Obtain a network for each order. • SIT bound = the smallest reticulation events in these networks. • Heuristics when K is large or trees are large and different

  10. Simulation • PIRN: a downloadable open-source software tool • Implemented in C++ and GNU GLPK (and CPLEX) • Generation of Simulation Data: a two-stage approach • Simulate a reticulate network N backwards in time for n species • Randomly select K trees embedded in N. • Evaluation Creteria: • How often exact minimum is found when lower and upper bounds match? • The gap between the lower and upper bounds • Average running time (see paper)

  11. Performance of PIRN: % of Datasets Optimal Solution Found % LB=UB Horizontal axis: number of taxa Vertical axis: % of datasets lower = upper bounds Lower and upper bounds often match for many data K: number of gene trees r: level of reticulation Average over 100 datasets Number of taxa

  12. Performance of PIRN: Gap of Bounds Gap Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds Gap between the lower and upper bounds is often small for many data K: number of gene trees Number of taxa

  13. Reticulate Network for Five Poaceae Trees ndhF phyB rbcL rpoC2 ITS RH bound: 11 SIT bound: 13

  14. Reticulate Network for Five Poaceae Trees ITS SIT bound: 13 reticulation events used in the network

  15. Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by National Science Foundation

More Related