130 likes | 249 Views
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB 2013. Gene B 1: A G C 2: T G T 3: A A C 4: A G T. Gene A 1: T C G
E N D
An Algorithm for Constructing ParsimoniousHybridization Networks with MultiplePhylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA RECOMB 2013
Gene B 1: A G C 2: T G T 3: A A C 4: A G T Gene A 1: T C G 2: T C A 3: C G G 4: C C G Hybridization Networks • Gene trees: phylogenetichistory for individual genes • Inferred from gene sequences • Assume: Binary and rooted • - Different topologies at different genes TB TA • Reticulate evolution: one explanation • Hybrid speciation, horizontal gene transfer Keep two red edges Keep two black edges 1 2 3 4 1 3 2 4 Hybridization network: A directed acyclic graph displayingeachgene tree 1 2 3 4 Reticulation event(s): nodes with in-degree two or more
The Minimum Hybridization Network Problem Given: a set of K gene trees G. • NP complete: even for K=2 • Most current approaches: • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies Problem: reconstruct hybridization networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 N 2 reticulation events. Minimum! 1 2 3 4
What if K 3? The K 3 case is much harder than the two tree case The lowerand upper bounds approach (Wu, 2010) for Rmin(G): (G: Kgene trees) Rmin(G) LB(G) < <UB(G) If LB(G)=UB(G), then Rmin(G) = LB(G) = UB(G) Problem: if LB(G) < UB(G), then do not know the exact value of Rmin This talk: the first exact algorithm for constructing the most parsimonious hybridization network for the K 3 case.
Backward in Time View and Ancestral Configurations 1 2 3 4 2 3 1 4 5 5 1 3 2 4 5 Hybridization Network Three input trees {j} j i {5,i} h {5,g,h} g T5 {5,a,d,g} f T4 {3,5,a,d,f} Backward in time: Two lineages coalesce into one lineage. One lineage reticulates into two lineages. d e T3 {3,5,a,c,d,e} Coalescence c T2 {3,4,5,a,c} Reticulation b a T1 {1,3,4,5,a,b} Ancestral configuration (AC): set of lineages in the network that are alive at time t. T0 {1,2,3,4,5} Time AC 1 2 5 3 4 Hybridization network = A series of ACs Search for ACs: guided by the input trees
Lineage in AC: Display Input Subtrees T2 T3 T1 1 2 3 4 2 3 1 4 5 5 1 3 2 4 5 Display T5 T4 Display T3 Display 1 Subtrees labeled {1,} c T2 • Each lineage in network displays one or more input subtrees • Which reticulation edge to follow? • Lineage represented by the set of displayed subtrees a b T1 T0 1 2 5 3 4 • Progress of displayed subtrees when moving back: • The more move backward, the larger input subtrees obtained. • When a single lineage displays each complete input tree, done.
Search for Optimal ACs T2 T3 T1 (1),(2),(3,),(4),(5) (1,),(2),(3),(4),(5) 1 2 3 4 2 3 1 4 5 5 (1),(2,),(3),(4),(5) 1 3 2 4 5 (1),(2),(3),(4,),(5) (1),(2,),(3),(4),(5) (1,),(2),(3),(4),(5) (1),(2),(3),(4,),(5) (1,),(2),(3),(4),(5) (1),(1),(2),(3),(4),(5) (1),(2),(2),(3),(4),(5) (1),(2),(3),(4),(4),(5) (1),(2),(3),(4),(5),(5) (1),(2),(3),(3),(4),(5) (1),(2),(3),(4),(5) . . . • Stop when reaching a final configuration displaying each complete input tree. Level 2 . . . . . . Level k: all ACs reachable from initial AC with kreticulation (and any number of coalescences) Level 1 ACs found by one or more coalescences ACs found by one reticulation from the initial AC Initial AC at level 0 High-level idea: breath-first style search for optimal ancestral configurations
Issues in Searching for Optimal ACs The configuration search algorithm gives optimal network Efficiency: space of ACs is huge. For an AC with n lineages, • E.g. n = 30, up to 465 new ACs with one reticulation or coalescence. • Infeasible: for data with even moderate size • Prune infeasible ACs: sometimes a coalescence leads to an AC that is incompatible with the input trees • Key to make the AC search feasible for relatively large data • Works when Rmin is relatively small
Techniques for Pruning ACs T1 T2 Compatible 1 2 3 4,b c a b Compatible 1 2 3 4 1 2 3 4 Coalesce 3 and 4 a,c 3 4 Compatible 4 3 3 1 2 Incompatible AC: if some input subtrees can not be displayed • A leaf under a lineage: covered • Incompatible:some leaf not covered by any lineage. • There are stronger rules (see paper). Incompatible Coalesce 1 and 2 Reticulate 3 b 1 2 Coalesce 3 and 4 1 2 3 4
Implementation and Simulation • The algorithm is implemented in a downloadable open-source software tool: • An exact method: PIRNC : Can find exact Rmin when Rmin is relatively small (say 5 or less). • Also a heuristic method for larger data:PIRNCh. Search in a smaller space of ACs with a greedy approach. • Simulation Data: from Wu (2010) • Simulate a hybridization network N backwards in time for n species • Randomly select K trees embedded in N. • Evaluation Creteria: • Compare with the original lower and upper bound approach: do the bounds give optimal network?
Performance of Exact Method: PIRNc Only datasets with Rmin 4 are used. 100 datasets in total. Number of taxa: fixed to 10. K: number of gene trees, between 3 to 5 LB: existing lower bound method UB: existing upper bond method # of datasets PIRNc is better PIRNc better: % of datasets PIRNc finds optimal Rmin but not the bounds approach. PIRNc: always find optimal solution (if run to end)
Performance of Heuristic Method: PIRNch PIRNc becomes slow when Rmin increases. PIRNch: heuristic for larger data. Larger data with taxa number: 30, 40 or 50. 100 datasets each. PIRNch UB: # of datasets among 100 datasets PIRNch < Upper Bound PIRNch outperforms the original lower bound/upper bound approach for larger daaets among 100 datasets
Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by US National Science Foundation