240 likes | 253 Views
This paper discusses different methods for sampling genealogies in complex models of divergence, exploring both genealogy sampling and summary statistics approaches. It introduces a new method that generates genealogy samples and approximates the posterior probability of the model parameters.
E N D
Methods for sampling genealogies in complex models of divergence Jody Hey Rutgers University
Acknowledgements • Model development Rasmus Nielsen • Chimpanzee studies Yong-Jin Won Yong Wang Sang Chul Choi
the Isolation with Migration Model Descendant Populations Present (Populations for Data Collection) N1 N2 m1 Migration m2 Splitting Time t NA Θ includes Six Parameters Ancestral Population Past
Treating genealogies as a nuisance variable Θ – parameters of the model (e.g. population sizes, migration rates) X – data G – genealogy (i.e. coalescent tree)
In practice • recombination is assumed to be zero within loci, and to be high between loci • Must be approximated by using samples of genealogies • Is slow
Instead of sampling genealogies -> approximate likelihoods using summary statistics • Summary statistic methods have become common due to the limitations of methods that sample genealogies • Can work with loci that have histories of recombination • Can be fast • But, do not use all of the information in the data • So far do not do so well with models and histories that include gene exchange
Competition between two lines of research: genealogy sampling, and summary statistics • Genealogy Sampling • limited by assumptions on recombination (so far) • Slow • Works well for estimating parameters • Summary Statistics • Not limited by recombination • Faster • Does not work so well (so far)
An new method for sampling genealogies • We would like a smaller MCMC state space, for which it is easier to design an MCMC updating scheme that leads to rapid convergence • We would like to have an approach that generates an analytic likelihood function in multiple dimensions • But that avoids the frailties of that approach that stem from using samples of Gconditioned on a driving value of Θ, Θ0 (Kuhner et al, 1995) Hey & Nielsen 2007 PNAS 104:2785–2790.
Reconsidering the integration over genealogies Consider an alternative expression, that also integrates over G, but that directly yields a posterior probability of Θ
This is an expectation of P(Θ|G) and can be approximated given a sample ofgenealogies drawn at random from the posterior distribution of G, P(G | X) This step does not depend on the data, X. All the information in the data is contained in the sample drawn from P(G|X) Yields an analytic function
The key to generating samples of genealogies from P(G | X)and to approximating P(Θ|X) is the calculation of the prior probability of G, P(G) • In fact this can be calculated analytically for the main demographic components of Θ.
Sequence of operations • Run a Markov chain over G and generate random samples from P(G | X) • For each Gdrawn from this distribution, save P(G) and all necessary information for calculating P(G|Θ). • Build a function that approximates the posterior density of Θ • This is an analytic function, and can be evaluated for any value of Θ • The function can be differentiated and searched for maxima.
Comparing the likelihood ratio for a true nested model with the likelihood for the full model • 100 data sets simulated under a model with just 2 population sizes and 1 migration rate χ2 2 Degrees of Freedom –2×Log-Likelihood Ratio 100 simulated data sets
Original results of Won & Hey P. t. troglodytes P. t. troglodytes New Method P. t. verus P. t. verus Ancestor Ancestor Chimpanzee Divergence Posterior Density for Population Size - Ne
Models for more than two populations • Assume that we know the species phylogeny
For three sampled populations N1 N2 m m N3 m m m m t0 NA0 m m t1 NA1 Θ includes 15 Parameters
Multi-population IMa – The Good News • Adding more populations does not introduce new mathematical issues • Building the application is mostly a programming problem, not a math problem • Can do any number of populations for a known phylogeny • Program will “work” for 10 populations (assuming a known phylogeny) 19 population size parameters 162 migration rate parameters 9 population splitting times
Multiple -populations – The Bad News • A lot of data will be required for many situations (hundreds of loci) • Models with many parameters introduce much more potential for model identifiability problems • Program is still slow and applications with 100’s of loci will require new computing configurations
Chimpanzees in a four population Isolation with Migration Model • Pan paniscus (Bonobo) • P. troglodytes troglodytes (Central African Chimpanzee) • P. t. schweinfurthii (East African Chimpanzee) • P. t. verus (West African Chimpanzee)
Chimpanzee phylogeny* P.t. schweinfurthii P. t. troglodytes P.t. verus P. paniscus Eastern Central West Bonobo *Becquet et al., (2007) PLoS Genet 3:e66. (based on 310 microsatellite loci)
Data • Fischer et al., Curr. Biol. 16:1133-1138. • 26 loci, approx 20 gene copies per species, average length 700 bp • Yu et al., (2003) Genetics 164:1511-1518. • 42 loci, approx 10 gene copies per species, average length 400 bp • Deinard & Kidd (2000), HOXB6 and APOB • Single loci from mitochondria, X chromosome, Y chromosome • Total of 73 loci
Western Eastern Central Bonobo 7,100 26,000 8,200 7,800 79,000 yrs 30,000 440,000 yrs 6,900 Migration Signficantly greater than zero Splitting Times in years Effective Population Sizes Parameter Estimates for Four Chimpanzee Populations 890,000 yrs 17,000