1 / 35

Estimating Genealogies from Marker Data

This presentation outlines a Bayesian approach to estimating genealogies from marker data, including computational aspects and examples with linked and unlinked markers.

sandovala
Download Presentation

Estimating Genealogies from Marker Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas SCB Workshop 14.12.2006 Isaac Newton Institute, Cambridge Biometry Group Department of Mathematics and Statistics University of Helsinki

  2. Outline of the presentation • The Problem • Description of the method • Probability model • Computational aspects • Example 1 • unlinked markers • relatedness estimation (with a pedigree) • Gasbarra et al.(2006): Estimating Genealogies from Unlinked Marker Data: a Bayesian Approach (under revision) • Example 2 • linked markers • haplotyping • relatedness estimation (with IBD-alleles) • Gasbarra et al.(2006): Estimating Genealogies from Linked Marker Data: a Bayesian Approach (under preparation)

  3. A Basic question in statistical genetics • Consider a population evolving in time • Inverse problem • Current state of the process is known • individuals alive at the moment • What was the path leading to this state? • family structures (pedigree) • inheritance patterns

  4. Why is the recent past important? • Relatedness estimation • In which parts of the genome a group of individuals share alleles (identical-by-descent)? • gene mapping • Haplotyping • Ancestral meioses have formed the haplotypes of the contemporary individuals

  5. Current methods on KNOWN pedigrees • Exact calculations on known pedigrees • Elston-Stewart algorithm • A few markers, not too complex pedigrees • Lander-Green algorithm • Small pedigrees, many markers • Approximative calculations on known pedigrees • McMC methods (e.g. Simwalk2 [Sobel et al.], Loki [Heath])

  6. What if the pedigree is not known? • There may be only partial pedigree data available. • Small pedigrees might share common ancestors already within a couple of generations backwards in time

  7. What we do … • Consider a sample of individuals from a population • Genotype data on (possibly linked) markers • Model the pedigree and the gene flow explicitly, applying a construction which proceeds backwards in time • Recombinations modelled based on genetic distance • Non-random mating allowed • Devise an McMC sampler with good mixing properties • Extends, because of computational reasons, only tens of generations backwards in time

  8. … and what we hope to get • Obtain useful summary statistics • E.g. estimates of IBD-probabilities between pairs of sampled individuals • Use the algorithm to perform numerical intergration over model unobservables • E.g. in gene mapping, when combined with a phenotype model, to account for shared ancestry

  9. The frame of study • Assume that we have fixed • A population whose size we know for T-1 (non-overlapping) generations backwards in time (T~10) • N sampled individuals from the current generation • Marker map with M markers and known recombination fractions • Allele frequencies at the population level for each of the markers

  10. A (prior) model for a possible history • A configuration C consists of • a pedigree • allelic paths • Specify probabilities for • Pedigree graph, Pg(C) • Recombination events, Pr(C) • Founder alleles, Pa(C) • The total probability for C is P(C) = Pg(C) x Pr(C) x Pa(C)

  11. A probability model for pedigrees • For fixed • number of generations,T-1, backwards in time • population size in each generation (number of ♂ and ♀) • sample of size N from the current generation • mating parameters α and β • To simulate a pedigree from the distribution we use • Proceed generation by generation from 0,…,T-1. • Let children choose parents according to Pólya urn scheme, where αaffects the correlation of choices of fathers and βaffects the correlation of choices of mothers given the choices of fathers. • Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of Sampled Individuals. Theor Pop Biol 67:75-83.

  12. Children choosing fathers • Suppose k children have chosen their fathers from among N_m males of the population • Ch(m) is the number of children that have chosen male m • P(k+1 chooses father m) ~ α + Ch(m) • Small α implies dominant males • Large α implies that the number of offspring does not vary much between different males

  13. Children choosing mothers • Suppose k children have chosen their mothers from among N_F females of the population • Ch(m,f) is the number of children who have chosen male m and female f as his/her parents • P(k+1 chooses mother f | the father of k+1 is m) ~ Ch(m,f)+β • Small β implies faithful males (monogamy in large populations) • Large β implies random mating

  14. Examples with different parameters • Left: a few dominant males + monogamy • Middle: a few dominant males • Right: Random mating

  15. Probability for allelic paths • For each non-founder haplotype in the pedigree form the expression • Take the product of these over all haplotypes to obtain Pr(C) • Consider all founder alleles and take the product of the corresponding population allelle frequencies to get Pa(C) (founders are assumed to be in H-W and linkage equilibrium)

  16. Data • Assume that we also have • Genotype data of the sampled individuals on M markers • The (posterior) probability in our model is π(C) ~ Pg(C) x Pr(C) x Pa(C) x I(C cons. with the data) • We are able to sample efficiently from the prior but not from the posterior

  17. Markov chain Monte Carlo sampling • We generate a Markov chain whose state space consists of all configurations consistent with the data and whose stationary distribution is our posterior (Metropolis-Hastings algorithm) • Highly dependent variables (close relatives and linked markers) require large block updates

  18. Proposals • Different versions of proposals • A (randomly chosen) group of children chooses (possibly new) parents and transmits their alleles to these parents • All children of a fixed father/mother choose (possibly new) mother/father and transmit their alleles to her/him • One child at a time chooses parent(s) and transmits alleles • All children within the group jointly choose new parents and transmit alleles • Pedigree is not changed but new allele paths are proposed

  19. Schematic representation of some updates in the MCMC algorithm

  20. Example 1:Relatedness estimation with unlinked markers • Simulated data • 20 generations ago a single founder population divided into 3 population isolates • Our sample contains 10 sibships of 3 individuals from each of the 3 populations (i.e. 90 individuals altogether)

  21. Relatedness matrix estimated from pedigrees

  22. Qualitative reconstruction with dendrogram

  23. Same data analyzed by STRUCTURE 3 pop 10 pop 30 pop

  24. Real data example: individuals sampled from Eastern and Western Finland: 31 unlinked microsatellite markers

  25. Example 2: The case of linked markers • Simulated pedigree • 10 generations • Youngest generation • 39 individuals divided into • 13 nuclear families • Genotype data • 20 markers / 10 alleles • Recombination fraction 0.05

  26. Reconstruction • We gave the algorithm • The genotype data on the youngest generation • The (correct) marker map • The (correct) allele frequencies • The population structure • The algorithm was run for 500,000 iterations

  27. Reconstructing the pedigree

  28. Reconstructing the haplotypes • The accuracy of the haplotype reconstruction can be measured with the concept of switch distance (SD) • SD between two pairs of haplotypes is the number of phase relations between neighboring loci that need to be changed in order to turn the first pair of haplotypes to the other • If correct haplotypes were (111111,222222) then • (111222,222111) has SD=1 • (112211,221122) has SD=2 • (121212,212121) has SD=5

  29. Reconstructing the haplotypes • The SDs between the reconstructed and the true haplotype pairs of the youngest generation (sum over all 39 individuals)

  30. Reconstructing the IBD sharing • We consider those alleles IBD (identical by descent) that trace back to a common ancestral allele at the founder level (9 generations backwards in time) • It is possible to calculate a single quantity that measures the proportion of the genome that two individuals share (coefficient of relatedness r) • It is also possible to compare the IBD sharing more accurately along the chromosome

  31. Comparison with IBS-based estimators Distribution of L_2 errors (741values) Lynch (1988) Lynch et Ritland (1999) Wang (2002) Sums: 1.93 3.25 3.27 3.51

  32. Reconstructing IBD

  33. Future work • Possibility of fixing some parts of the pedigree • Extending partially known genotype data to the known pedigree • Pirinen, Gasbarra (2006): Finding consistent gene transmission patterns on large and complex pedigrees. IEEE Trans. Comp. Biol. Bioinf. 3:252-262

  34. Future work • Adding a QTL or phenotype model to the algorithm • Allowing for mutations and considering evolutionary time scales (Ancestral Recombination Graph) • Running many chains in parallel ”in different temperatures” • McMcMC with 20 processors achieved a slightly better accuracy in 12 hours (of wall-clock time) than a single processor in 5 days

  35. Thanks Matti Pirinen Dario Gasbarra Mikko Sillanpää

More Related