180 likes | 304 Views
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University. DNA Barcoding is great!. But it is useful to keep in mind that species taxa are provisional – they are hypothesis to be revised with more data Taxa are tools, not truth
E N D
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University
DNA Barcoding is great! • But it is useful to keep in mind that species taxa are provisional – they are hypothesis to be revised with more data • Taxa are tools, not truth • Mitochondrial-based DNA barcodes • Can be misleading due to chance factors (different genes have different histories) • Can be misleading due to deterministic factors (mitochondria are a large target for natural selection)
A general problem… You have some genetic data For example, a gene sequenced multiple times Or a microsatellite locus genotyped in a number of individuals Suppose you are willing to assume that positive or balancing selection has not played a big role in the history of the data What could you figure out about the history of the organisms from which the genes came?
A General Parameterization for questions on population demography, population divergence, speciation, population identification etc X genetic data (e.g. aligned sequences, microsatellites) may (or may not) come with population labels may (or may not) be given as diploid genotypes may include multiple loci for each sampled organism P population phylogeny T splitting times – i.e. the times of branch points in the phylogeny P ΘDemography - population size and migration rate parameters I Population labels – assignment of genes to populations - which genes came from which populations or species G Genealogy – the gene tree for the data G is a necessary ‘nuisance’ parameter – it provides a mathematical connection between X and (P,T, Θ and I) It is possible to calculate the probability of G as a function of P,T, Θ and I, p(G| P,T, Θ,I), using coalescent models It is possible to calculate the probability of a data set given G, p(X|G), using mutation models.
Connecting Data to the General Model – Parts 1-3For unlabeled data - without information on the number of populations, or on which populations were sampled 1 2 3 5 6 4 3 1 2 Specify a random G with topology and branch lengths for example : Unlabeled Data Sequence1 ACgTACgACgCACgAAT Sequence2ACgTACgACgCACgAAT Sequence3ACCTTCgACgTACgAGT Sequence4ACgTTCgACgTACgAAT Sequence5ACCTTCgACgTACgAAT Sequence6ACgTTCgACgTATgAAT With a mutation model, and a value of G, we can calculate the probability of G given the data: p(G|X)
4 5 Pop 2 Pop 3 Pop 1 N2 N3 ← T2 Pop (2,3) N1 N(2,3) ← T1 Pop (2,3),1 N(2,3),1 Connecting Data to the General Model – Parts 4&5 Specify a random phylogeny P with multiple populations and with splitting times T … for example: With a phylogeny that depicts populations in time, we can also pick random values for population sizes and migration rates – Θ = {N1, N2... m1>2, m2>1…}
5 6 4 3 1 2 6 7 8 Pop 1 Pop 2 Pop 3 5 6 4 3 1 2 Connecting Data to the General Model – Parts 6-8 Overlay the genealogy on the phylogeny add implied migration events and other random migration events to the phylogeny Identify I, the data labels representing the populations containing the data
Calculating the likelihood of P, T, Θ, and I, given the data • If we can solve this then we can obtain maximum likelihood estimates of P,T, I and Θ • We know how to calculate p(X|G) and p(G|P,T,I,Θ) • The math is not the hard part • The greatest challenge is finding efficient ways to sample the space of genealogies and the space of P, T, Θ, and I
Genetic Data and different types of data labels Often Population Labels are known (come with data) Population Labels Aligned DNA Sequences • A • A • B • B • C • C ACgTACgACgCACgAAT ACgTACgACgCACgAAT ACCTTCgACgTACgAGT ACgTTCgACgTACgAAT ACCTTCgACgTACgAAT ACgTTCgACgTATgAAT Population labels are already known and do not need to be estimated. Parameter I (population labels) is not included in the model.
Case 1 Data has no labeling at all Population Labels Aligned DNA Sequences • ? • ? • ? • ? • ? • ? ACgTACgACgCACgAAT ACgTACgACgCACgAAT ACCTTCgACgTACgAGT ACgTTCgACgTACgAAT ACCTTCgACgTACgAAT ACgTTCgACgTATgAAT
Case 2, no population labels, but data comes in diploid genotypes pairs Population Labels |Genotype Pairs Aligned DNA Sequences • ? • ? • ? • ? • ? • ? ACgTACgACgCACgAAT ACgTACgACgCACgAAT ACCTTCgACgTACgAGT ACgTTCgACgTACgAAT ACCTTCgACgTACgAAT ACgTTCgACgTATgAAT Individual #1 Individual #2 Individual #3 Gene copies are identified in genotype pairs only. Parameter I (Population labels) is unknown (?) and needs to be estimated.
Two kinds of data sets without population labels 1.Alleles or gene copies provided without any additional information on populations - e.g. locus may be haploid - or for whatever reason, data not collected in a way that yields diploid genotypes 2. Alleles or sequences provided in diploid (genotype) pairs This is a common situation for population assignment
Case 1: Alleles or gene copies come without any additional information on populations • The only available information on population labels (parameter I) and all other parameters (P, T, Θ) is in the actual variation in the data • This is a lot to ask of single locus data set. • With multiple loci, can be possible to to estimate P, T, Θ, and I • Can include information from a database on the same locus (loci) – i.e. DNA barcoding
Case 2: Data comes in diploid (genotype) pairs • Such data contains two types of information for population identification: • Patterns of variation (as in case 1) • Knowledge that both gene copies from a single individual must come from the same population (assume no hybrids) • This problem (identifying populations based on diploid genotypes) is traditionally called population assignment
Population Assignment based on diploid genotype data • Many methods exist for population assignment, using allelic data, based on an assumption of Hardy-Weinberg equilibrium within populations • These methods do not otherwise incorporate phylogenetics or population genetics (no P, T, or Θ) • Have to overcome difficulty of not knowing the underlying allele frequencies
Considering the probability of a particular genotype configuration, D The actual configuration D that comes with the data is one of many possible configurations. 6 Sequences 3 Genotype pairs 1ACgTACgACgCACgAAT 2ACgTACgACgCACgAAT 3ACCTTCgACgTACgAGT 4ACgTTCgACgTACgAAT 5ACCTTCgACgTACgAAT 6ACgTTCgACgTATgAAT
Calculating the probability of a particular genotype configuration, D • Assume that genes come together and form zygotes at random with respect to their time of common ancestry • This is a genealogical version of the assumption of random mating that is usually made with respect to segregating alleles (e.g. in Hardy Weinberg) • Assume that both gene copies within an individual are in the same population
Given a genealogy, G, Some genotype configurations are more probable than others under an assumption of random union of gametes