260 likes | 446 Views
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress. Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA. Suffix. Prefix. Breakpoint. Recombination.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA Dagstuhl Seminar, 2010
Suffix Prefix Breakpoint Recombination • One of the principle genetic forces shaping sequence variations within species • Two equal length sequences generate a third new equal length sequence in genealogy • Spatial order is important: different parts of genome inherit from different ancestors. 110001111111001 1100 00000001111 000110000001111
00 1 0 0 1 10 1 1 Ancestral Recombination Graph (ARG) Recombination Mutations 11 01 00 10 01 00 10 S1 = 00 S2 = 01 S3 = 10 S4 = 11 S1 = 00 S2 = 01 S3 = 10 S4 = 10 Networkmodel:beyond tree model Assumption:At most one mutation per site
Reconstruction of Network-based Evolutionary History Different formulation Input: DNA sequences (haplotypes) or phylogenetic trees Biology: meiotic recombination in populations, or reticulate evolutionary processes: horizontal gene transfer or hybrid speciation Same objective • Reconstruct the network-based evolutionary history (and related problems) • Efficiency • Accuracy
Reconstructing ARGs by Parsimony Kreitman’s data for adh locus of D. Malonagaster (1983) • Input: a set of binary sequences M • Goal: reconstruct ARGs deriving M • Parsimony formulation • minARG: Minimize the number of recombination events • NP complete (Wang, et al)
The minARG Problem • Structural constrained ARGs, e.g. galled trees (Wang, et al, Gusfield, et al). • Simplified ARG topology • Heuristic methods, e.g. program MARGARITA (Durbin, et al.), Song, et al., Parida, et al. • Exact minARG by branch and bound (Lyngso, Song and Hein) Uniform sampling of minARGs by treating each minARG as equally likely (Wu) Estimating the range of minARGs: lower and upper bounds
minARGfor Kreitman’s data Rmin: minimum number of recombination for M. L(M): lower bound on Rmin U(M): upper bound on Rmin Several lower bounds give L(M)=7. U(M)=7 for Kreitman’s data (Song, Wu and Gusfield). Thus, Rmin(M)=7 Challenge: accurate inference of ARGs
Local tree near site 3 ARG Induces Local Trees Local trees: evolutionary history at a genomic position. Trace backwards in time. At recombination node, pick the branch passing alleles to the recombinant at this location. Data 0000 0101 0110 1110 1010 0000 0000 0100 0010 1010 0110 Mutations Recombination 0101 0110 1110 1010 0000
Local tree near site 2 Local Trees Change Across the Genome Local trees change when moving across recombination breakpoints. Data 0000 0101 0110 1110 1010 0000 Spatial property: Nearby local tree tends to be more similar. 0000 0100 0010 How good is the inferred ARGs? Compare the inferred local tree topologies with the simulated trees 1010 0110 0101 0110 1110 1010 0000
Inferring Local Trees • Problem: given binary sequences, infer local tree topologies (one tree for each site, ignore branch length) • Key: local trees have different topology due to recombination • Trees or Network? Do not reconstruct full network; local trees are very informative • Parsimony-based approaches • Hein (1990,1993), Song and Hein (2005) • Wu (2010): shared topological features in nearby trees. Accuracy: Robinson-Foulds distances between inferred trees and the simulated tree Challenge: How to improve the accuracy?
RENT: REfining Neighboring Trees • Maintain for each SNP site a (possibly non-binary) tree topology • Initialize to a tree containing the split induced by the SNP • Gradually refining trees by adding new splits to the trees • Splits found by a set of rules (later) • Splits added early may be more reliable • Stop when binary trees or enough information is recovered
A Little Background: Compatibility A B C a b c d e 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 Sites A and B are compatible, but A and C are incompatible. M • Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. • Easily extended to splits.
Fully-Compatible Region: Simple Case • A region of consecutive SNP sites where these SNPs are pairwise compatible. • May indicate no topology-altering recombination occurred within the region • Rule: for site s, add any such split to tree at s. • Compatibility: very strong property and unlikely arise due to chance. A B C
Split Propagation: More General Rule • Three consecutive sites A,B and C. Sites A and B are incompatible. Does site C matter for tree at site A? • Trees at site A and B are different. • Suppose site C is compatible with sites A and B. Then? • Site C may indicate a shared subtreein both trees at sites A and B. • Rule: a split propagates to both directions until reaching a incompatible tree. A B C
Gene B 1: 0 0 0 2: 1 0 1 3: 0 1 0 4: 0 0 1 Gene A 1: 0 0 0 2: 0 0 1 3: 1 1 0 4: 1 0 0 Reticulate Networks ρ ρ • Gene trees: phylogenetic trees from gene sequences • Assume: Binary and rooted • - Different topologies at different genes 1 2 3 4 1 3 2 4 T’ T • Reticulate evolution: one explanation • Hybrid speciation, horizontal gene transfer Keep two red edges Keep two black edges Reticulate network: A directed acyclic graph displayingeach of the gene trees Hybridization event: nodes with in-degree two or more 1 2 3 4
The Minimum Reticulation Problem Given: a set of K gene trees G. • NP complete: even for K=2 • Current approaches: • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. T3 T1 T2 1 2 3 4 1 2 4 3 1 2 3 4 Challenge: efficient and accurate reconstruction of reticulate network for multipletrees. Close lower and upper bounds for arbitrary number of trees (Wu, 2010) N 1 2 3 4
Performance of PIRN: Optimal Solution • Lower and upper bounds often match for many data Horizontal axis: number of taxa Vertical axis: % of data LB=UB K: number of trees r: level of reticulation
Performance of PIRN: Gap of Bounds • Gap between the lower and upper bounds is often small for many data Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds K: number of trees r: level of reticulation
Reticulate Network for Five Poaceae Trees ndhF phyB rbcL rpoC2 ITS Lower bound: 11 Upper bound: 13
Reticulate Network for Five Poaceae Trees Upper bound: 13 used in this network
Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by National Science Foundation and UConn Research Foundation
Coalescent with Recombination Coalescent theory: define probabilistic distribution of genealogy Likelihood computation for coalescent with recombination Likelihood: summation of probability of all the ARGs Challenging: too many ARGs (Lyngso, Song and Hein) Probability of ARGs under certain parameters Importance Sampling approach: draw samples (ARGs) wrt some probablistic distribution Work well with no recombination Not working well with recombination
Coalescent-based ARG Sampling minARG • Uniform sampling of minARGs (Wu, 2007) • Treat each minARG as equally likely. • Algorithm for generating an minARG uniformly at random (exponential time for setting up, but polynomial-time in sampling) Challenge: develop a more general ARG sampling method that can efficiently sample ARGs approximately according to coalescent probabilities. Probability of ARGs under certain parameters A related problem: compute coalescent likelihood with recombination efficiently. Recent work: exact computation of coalescent likelihood under infinite sites model with no recombination (Wu, 2009)
The Mosaic Model M: input sequences Assumption: input sequences are descendent of K founder sequences (unknown) Extant sequences: concatenation of exact copies of founder segment (no shift of position) • Coloring: assign which position of a sequence is from which founder (color); need consistency M, K=2 0000 0101 0111 1111 1110 Total 5 breakpoint breakpoint
The Minimum Mosaic Problem Inferred founders • Problem: given a set of binary sequences and the number of founder K, find a K-coloring of these sequences to minimize the number of color change (recombination breakpoints) • And find the K founder sequences (not part of input) Data from Rastas and Ukkonen 20 sequences 40 sites 55 breakpoints: minimum number of breakpoints
The Minimum Mosaic Problem • Introduced by Ukkonen (2002) • Simple and easier to visualize • Main known results • An exponential-time algorithm which runs in polynomial-time algorithm for K=2 (Ukkonen 2002) • An exact method that works for relatively small K and modest-sized data (Wu and Gusfield, 2007) • Haplovisual program and other extensions by Rastas and Ukkonen (2007). • Heuristic algorithm by Roli and Blum (2009) • Lower bounds for the minimum number of breakpoints needed (Wu, 2010) • Challenges • Polynomial-time algorithm for K 3? • Concrete applications in biology?