1 / 13

Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB 2013

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB 2013. Gene B 1: A G C 2: T G T 3: A A C 4: A G T. Gene A 1: T C G

vesna
Download Presentation

Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Algorithm for Constructing ParsimoniousHybridization Networks with MultiplePhylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB 2013

  2. Gene B 1: A G C 2: T G T 3: A A C 4: A G T Gene A 1: T C G 2: T C A 3: C G G 4: C C G Hybridization Networks • Gene trees: phylogenetichistory for individual genes • Inferred from gene sequences • Assume: Binary and rooted • - Different topologies at different genes TB TA • Reticulate evolution: one explanation • Hybrid speciation, horizontal gene transfer Keep two red edges Keep two black edges 1 2 3 4 1 3 2 4 Hybridization network: A directed acyclic graph displayingeachgene tree 1 2 3 4 Reticulation event(s): nodes with in-degree two or more

  3. The Minimum Hybridization Network Problem Given: a set of K gene trees G. • NP complete: even for K=2 • Most current approaches: • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies Problem: reconstruct hybridization networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 N 2 reticulation events. Minimum! 1 2 3 4

  4. What if K  3? The K 3 case is much harder than the two tree case The lowerand upper bounds approach (Wu, 2010) for Rmin(G): (G: Kgene trees) Rmin(G) LB(G) < <UB(G) If LB(G)=UB(G), then Rmin(G) = LB(G) = UB(G) Problem: if LB(G) < UB(G), then do not know the exact value of Rmin This talk: the first exact algorithm for constructing the most parsimonious hybridization network for the K 3 case.

  5. Backward in Time View and Ancestral Configurations 1 2 3 4 2 3 1 4 5 5 1 3 2 4 5 Hybridization Network Three input trees {j} j i {5,i} h {5,g,h} g T5 {5,a,d,g} f T4 {3,5,a,d,f} Backward in time: Two lineages coalesce into one lineage. One lineage reticulates into two lineages. d e T3 {3,5,a,c,d,e} Coalescence c T2 {3,4,5,a,c} Reticulation b a T1 {1,3,4,5,a,b} Ancestral configuration (AC): set of lineages in the network that are alive at time t. T0 {1,2,3,4,5} Time AC 1 2 5 3 4 Hybridization network = A series of ACs Search for ACs: guided by the input trees

  6. Lineage in AC: Display Input Subtrees T2 T3 T1             1 2 3 4 2 3 1 4 5 5 1 3 2 4 5 Display  T5 T4 Display  T3 Display 1 Subtrees labeled {1,} c T2 • Each lineage in network displays one or more input subtrees • Which reticulation edge to follow? • Lineage represented by the set of displayed subtrees a b T1 T0 1 2 5 3 4 • Progress of displayed subtrees when moving back: • The more move backward, the larger input subtrees obtained. • When a single lineage displays each complete input tree, done.

  7. Search for Optimal ACs T2 T3 T1             (1),(2),(3,),(4),(5) (1,),(2),(3),(4),(5) 1 2 3 4 2 3 1 4 5 5 (1),(2,),(3),(4),(5) 1 3 2 4 5 (1),(2),(3),(4,),(5) (1),(2,),(3),(4),(5) (1,),(2),(3),(4),(5) (1),(2),(3),(4,),(5) (1,),(2),(3),(4),(5) (1),(1),(2),(3),(4),(5) (1),(2),(2),(3),(4),(5) (1),(2),(3),(4),(4),(5) (1),(2),(3),(4),(5),(5) (1),(2),(3),(3),(4),(5) (1),(2),(3),(4),(5) . . . • Stop when reaching a final configuration displaying each complete input tree. Level 2 . . . . . . Level k: all ACs reachable from initial AC with kreticulation (and any number of coalescences) Level 1 ACs found by one or more coalescences ACs found by one reticulation from the initial AC Initial AC at level 0 High-level idea: breath-first style search for optimal ancestral configurations

  8. Issues in Searching for Optimal ACs The configuration search algorithm gives optimal network Efficiency: space of ACs is huge. For an AC with n lineages, • E.g. n = 30, up to 465 new ACs with one reticulation or coalescence. • Infeasible: for data with even moderate size • Prune infeasible ACs: sometimes a coalescence leads to an AC that is incompatible with the input trees • Key to make the AC search feasible for relatively large data • Works when Rmin is relatively small

  9. Techniques for Pruning ACs T1 T2 Compatible 1 2 3 4,b c a b Compatible 1 2 3 4 1 2 3 4 Coalesce 3 and 4 a,c 3 4 Compatible 4 3 3 1 2 Incompatible AC: if some input subtrees can not be displayed • A leaf under a lineage: covered • Incompatible:some leaf not covered by any lineage. • There are stronger rules (see paper). Incompatible Coalesce 1 and 2 Reticulate 3 b 1 2 Coalesce 3 and 4 1 2 3 4

  10. Implementation and Simulation • The algorithm is implemented in a downloadable open-source software tool: • An exact method: PIRNC : Can find exact Rmin when Rmin is relatively small (say 5 or less). • Also a heuristic method for larger data:PIRNCh. Search in a smaller space of ACs with a greedy approach. • Simulation Data: from Wu (2010) • Simulate a hybridization network N backwards in time for n species • Randomly select K trees embedded in N. • Evaluation Creteria: • Compare with the original lower and upper bound approach: do the bounds give optimal network?

  11. Performance of Exact Method: PIRNc Only datasets with Rmin 4 are used. 100 datasets in total. Number of taxa: fixed to 10. K: number of gene trees, between 3 to 5 LB: existing lower bound method UB: existing upper bond method # of datasets PIRNc is better PIRNc better: % of datasets PIRNc finds optimal Rmin but not the bounds approach. PIRNc: always find optimal solution (if run to end)

  12. Performance of Heuristic Method: PIRNch PIRNc becomes slow when Rmin increases. PIRNch: heuristic for larger data. Larger data with taxa number: 30, 40 or 50. 100 datasets each. PIRNch UB: # of datasets among 100 datasets PIRNch < Upper Bound PIRNch outperforms the original lower bound/upper bound approach for larger daaets among 100 datasets

  13. Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by US National Science Foundation

More Related