Methods for sampling genealogies in complex models of divergence

Methods for sampling genealogies in complex models of divergence Jody Hey Rutgers University

Acknowledgements • Model development Rasmus Nielsen • Chimpanzee studies Yong-Jin Won Yong Wang Sang Chul Choi

the Isolation with Migration Model Descendant Populations Present (Populations for Data Collection) N1 N2 m1 Migration m2 Splitting Time t NA Θ includes Six Parameters Ancestral Population Past

Treating genealogies as a nuisance variable Θ – parameters of the model (e.g. population sizes, migration rates) X – data G – genealogy (i.e. coalescent tree)

In practice • recombination is assumed to be zero within loci, and to be high between loci • Must be approximated by using samples of genealogies • Is slow

Instead of sampling genealogies -> approximate likelihoods using summary statistics • Summary statistic methods have become common due to the limitations of methods that sample genealogies • Can work with loci that have histories of recombination • Can be fast • But, do not use all of the information in the data • So far do not do so well with models and histories that include gene exchange

Competition between two lines of research: genealogy sampling, and summary statistics • Genealogy Sampling • limited by assumptions on recombination (so far) • Slow • Works well for estimating parameters • Summary Statistics • Not limited by recombination • Faster • Does not work so well (so far)

An new method for sampling genealogies • We would like a smaller MCMC state space, for which it is easier to design an MCMC updating scheme that leads to rapid convergence • We would like to have an approach that generates an analytic likelihood function in multiple dimensions • But that avoids the frailties of that approach that stem from using samples of Gconditioned on a driving value of Θ, Θ0 (Kuhner et al, 1995) Hey & Nielsen 2007 PNAS 104:2785–2790.

Reconsidering the integration over genealogies Consider an alternative expression, that also integrates over G, but that directly yields a posterior probability of Θ

This is an expectation of P(Θ|G) and can be approximated given a sample ofgenealogies drawn at random from the posterior distribution of G, P(G | X) This step does not depend on the data, X. All the information in the data is contained in the sample drawn from P(G|X) Yields an analytic function

The key to generating samples of genealogies from P(G | X)and to approximating P(Θ|X) is the calculation of the prior probability of G, P(G) • In fact this can be calculated analytically for the main demographic components of Θ.

Sequence of operations • Run a Markov chain over G and generate random samples from P(G | X) • For each Gdrawn from this distribution, save P(G) and all necessary information for calculating P(G|Θ). • Build a function that approximates the posterior density of Θ • This is an analytic function, and can be evaluated for any value of Θ • The function can be differentiated and searched for maxima.

Comparing the likelihood ratio for a true nested model with the likelihood for the full model • 100 data sets simulated under a model with just 2 population sizes and 1 migration rate χ2 2 Degrees of Freedom –2×Log-Likelihood Ratio 100 simulated data sets

Chimpanzee Distributions

Original results of Won & Hey P. t. troglodytes P. t. troglodytes New Method P. t. verus P. t. verus Ancestor Ancestor Chimpanzee Divergence Posterior Density for Population Size - Ne

Models for more than two populations • Assume that we know the species phylogeny

For three sampled populations N1 N2 m m N3 m m m m t0 NA0 m m t1 NA1 Θ includes 15 Parameters

Multi-population IMa – The Good News • Adding more populations does not introduce new mathematical issues • Building the application is mostly a programming problem, not a math problem • Can do any number of populations for a known phylogeny • Program will “work” for 10 populations (assuming a known phylogeny) 19 population size parameters 162 migration rate parameters 9 population splitting times

Multiple -populations – The Bad News • A lot of data will be required for many situations (hundreds of loci) • Models with many parameters introduce much more potential for model identifiability problems • Program is still slow and applications with 100’s of loci will require new computing configurations

Chimpanzees in a four population Isolation with Migration Model • Pan paniscus (Bonobo) • P. troglodytes troglodytes (Central African Chimpanzee) • P. t. schweinfurthii (East African Chimpanzee) • P. t. verus (West African Chimpanzee)

Chimpanzee Distributions

Chimpanzee phylogeny* P.t. schweinfurthii P. t. troglodytes P.t. verus P. paniscus Eastern Central West Bonobo *Becquet et al., (2007) PLoS Genet 3:e66. (based on 310 microsatellite loci)

Data • Fischer et al., Curr. Biol. 16:1133-1138. • 26 loci, approx 20 gene copies per species, average length 700 bp • Yu et al., (2003) Genetics 164:1511-1518. • 42 loci, approx 10 gene copies per species, average length 400 bp • Deinard & Kidd (2000), HOXB6 and APOB • Single loci from mitochondria, X chromosome, Y chromosome • Total of 73 loci

Western Eastern Central Bonobo 7,100 26,000 8,200 7,800 79,000 yrs 30,000 440,000 yrs 6,900 Migration Signficantly greater than zero Splitting Times in years Effective Population Sizes Parameter Estimates for Four Chimpanzee Populations 890,000 yrs 17,000

Methods for sampling genealogies in complex models of divergence

Methods for sampling genealogies in complex models of divergence

Presentation Transcript

SAMPLING METHODS

SSRC Eurasia Quantitative Methods Webinar Complex Sampling Designs

Sampling Methods

Sampling Methods

Sampling Methods

Sampling Methods

Complex Methods of Inheritance

Sampling Methods

Sampling Methods

SAMPLING METHODS

Sampling Methods

Sampling Methods

Sampling Methods

Sampling Methods

Sampling Methods

FOR 373: Forest Sampling Methods

SAMPLING METHODS

Sampling Methods

SAMPLING METHODS