Coalescent Models for Genetic Demography

Coalescent Models for Genetic Demography What can the Coalescent do for you? Rosalind Harding University of Oxford

Who was MtEve? • the most recent common ancestor (mcra) to whom all mtDNA haplotype diversity, currently sampled, can be traced.

One possibility: First a bottleneck, then multiple lineages are established during expansion phases MtEve

But if there wasn’t a bottleneck? • Then our predecessors collecting data 20,000 years ago, could have identified a different mtEve, an Eve from an earlier generation; • in 20,000 years time, a new generation will be likely to find their mtEve to be a grandn-daughter of our mtEve. • While our mtEve may be special to us, for archaeogeneticists of past and future generations she will have no particular significance!

Insights from coalescent models Eve? Time Eve? Eve? present

What is the coalescent? • a simple model which generates a probability distribution for gene genealogies sampled from a population.

Further definitions • simple models: abstractions from complex demographic reality, which preserve key features • population: all individuals within a generation with the potential to contribute to the gene pool (including individuals who are reproductively successful as well as those who are not.) • gene genealogies: lineages of transmission of copies of a gene from parents to offspring • coalescence: where two transmission lineages find a common ancestor, looking backwards in time • probability distribution: a set of probabilities for many possible alternative gene genealogies compatible with the model

Models and data • Interpreting genetic polymorphism data • consider a sample of genes from a contemporary population, with their allelic frequencies and sequence identities determined – these data do not reveal our genetic past directly, they must be interpreted. • Options for model choice • evolution as phylogeny, phylo-geography • evolution as a balance of mutation and genetic drift in a population with a specified demography (population size, mating pattern, offspring distribution)

Characteristics of polymorphism data • For a small proportion of sites in human DNA, a second allele is present in populations due to a relatively recent mutation; this is polymorphism. • Polymorphism constitutes a transient phase in evolution, intermediate between the occurrence of a mutation and the fixation of either allele at 100%. • MtDNA trees may distort frequencies of polymorphisms. They show sets of mutation events as a proxy for fixed differences; it is the new allele that is assumed to fix (attain 100%). • These potential sources of error for time scale estimates may be minor but could be substantial.

Ingman and Gyllensten, 2003 Genome Research 13:1600-1606 Neighbor-joining phylogram of 101 mtDNA coding regions sequences. Is phylogenetic branching the right model? Note variable branch lengths and endpoints; yet all individuals sampled in the present!

A phylogenetic model with added genealogical detail and molecular clock

Trajectories for neutral alleles

Ne=10, constant over time Understanding genetic drift as genealogy Two of the gene copies in gen. t are inherited by all of the offspring copies in generation t+x. This is the process of drift that leads eventually to either loss or fixation (100% frequency in the population) of new mutations.

Some advantages of coalescent models over phylogeny for interpreting polymorphism data • they make better use of molecular clocks and do not treat polymorphisms as fixed differences; • as models of populations they clarify the difference between • ‘absence of evidence’ (eg for Neanderthal ancestry) and • ‘evidence of absence’ (any single locus only represents such a small sample of ancestors from >50,000 years ago that with present data we don’t have the statistical power to rule out Neanderthal ancestry). • they incorporate some measure of our uncertainty about the evolution of allele frequencies (a mixed process of mutation and transmission in genealogies).

Assumptions of Kingman’s (1982) coalescent for interpreting polymorphism data (random sample) • Neutrality • All new mutations unique and informative • If individuals are diploid in a population of size N, the model applies to 2N independent, haploid copies of a gene • Random mating within a population • Constant population size, Ne • A very specific probability distribution for transmissions of gene copies to 0, 1, 2 … offspring • Non-overlapping generations

Aims of coalescent modelling: to make inferences from genetic data • to simulate different demographies to see what to expect in polymorphism data; • to estimate parameters under an explicit demographic model, eg Kingman’s coalescent; • to estimate in which generation (and sub-population) particular lineages coalesced or mutations occurred, given explicit demographic assumptions; • to evaluate the uncertainty in our estimates; • to introduce new parameters to improve the model, judging by its fit to data, to learn about demography.

The ancestry of a sample composed of two copies of the gene in generation t0 MRCA Following the ancestry of a sample of two copies of a gene (gene A) from time t0, ie the present, backwards (red) , we find their most recent common ancestor (MRCA) at generation t8.

Expected coalescence times Expected time to coalescence for n lineages As the sample size increases towards 2N, E(tmrca) approaches 4N, which equals the fixation time for a newly arisen mutation.

Constant N N E(T2)=2Ne E(TMRCA)=4Ne(1-1/5) E(T5)=Ne/5 N expanding N reducing N0 N0 time N1 N1 Thanks to Lounes for this slide

Simulated genealogies with constant Ne • TMRCA • 4.57 • 2.93* • 1.48 • 0.01 1 2 units of 2Ne generations 3 4 eg 2.93x2x10,000x20 = 1.2 million years

Simulating recent expansion: not much variability in TMRCA between genealogies 1 2 TMRCA 1. 0.0026 2. 0.0029 3. 0.0028 4. 0.0027 3 4 units of 2Ne generations ~1000 years of human evolution

1. A time scale is given by the coalescent model for the demography (drift history) 2. Add mutations

Infinite-sites mutation in a gene tree

The relationship between av pairwise sequence difference, p, and the parameter q in Kingman’s Coalescent 2N generations

Data: Aboriginal Australian mtDNAs Model: Kingman’s coalescent MtDNA Coding DNA Sites: 9000 to 16000 one colonization event? ? ? ? ? or several founding lineages at different times? Note the non-uniform spacing of mutations

Another advantage of coalescent models over phylogeny • While the population bottlenecks implicitly assumed in phylogenetic and phylogeographic analyses can be explicitly assumed in a coalescent framework, alternative demographies may be assumed, or may be inferred. • (the relationship between coalescent nodes and colonization events is very ambiguous.)

Kingman’s coalescent as H0 • Kingman’s coalescent model is a starting point, available to us even before we collect any data. • Having collected data, we can test whether the data show goodness-of-fit to the expectations of our starting model. • If not, we should change or add parameters to improve the model. At present there are some options available (not many, but some!)

Variations from Kingman’s coalescent • Selection • Recurrent and back mutation • Recombination • *Non-random mating: eg geographic subdivision with specified migration between subpopulations • *Population size fluctuation, including bottlenecks and expansions • Non-’Poisson’ distributions of offspring numbers • Unequal generation intervals between lineages *similar model but additional parameters

The coalescent with structure Much migration Little migration Each generation m alleles are exchanged between sub-populations. Discrete migration probability m/2N, an allele migrates. Continuous waiting time for migration is expo(m)

Summary and points for discussion • Data drawn as gene trees show the relative ordering of coalescence events. • The length of time between coalescence events is a function of the number of mutation events inferred from the data AND the assumed demographic history. (Molecular clocks should NOT be applied directly.) • Present phylo-geographic methods fudge the data to circumvent thinking about demography. Consequently we do not learn anything about demography from them. Furthermore, these methods may be generating some highly inaccurate time estimates and they don’t provide satisfactory estimates of the uncertainty surrounding these estimates. • Coalescent modelling to date draws attention to many concerns, but to improve ‘phylo-geographic’ inference we need implementations of the structured coalescent appropriate for a colonization/extinction demography.

MtDNACoding DNA Sites: 500 to 9000

Implications of drift as genealogy All the identical copies of a gene, eg all the copies of the MC1R-151 red hair allele, carried by thousands of people across Europe, have been inherited from a single common ancestor living some time in the past. Although mutation may have generated MC1R-151 alleles many times, all these mutations were quickly lost, except for one. On one occasion only, the new mutation increased in frequency, becoming a common polymorphism. Could this be true? (We think so!)

Coalescent Models for Genetic Demography

Coalescent Models for Genetic Demography

Presentation Transcript

DEMOGRAPHY

Demography

DEMOGRAPHY

Security models for medical and genetic information

Demography

Demography

Coalescent theory

Demography

Demography

DEMOGRAPHY:

Deterministic genetic models

Demography

Genetic models Self-organization

Coalescent Theory

DEMOGRAPHY

The Coalescent

Genetic models for schizophrenia research

Gene tree discordance and multi-species coalescent models

Demography

Demography

Context for Demography