350 likes | 443 Views
Estimating Genealogies from Marker Data. Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas. SCB Workshop 14.12.2006 Isaac Newton Institute, Cambridge. Biometry Group Department of Mathematics and Statistics University of Helsinki. Outline of the presentation. The Problem
E N D
Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas SCB Workshop 14.12.2006 Isaac Newton Institute, Cambridge Biometry Group Department of Mathematics and Statistics University of Helsinki
Outline of the presentation • The Problem • Description of the method • Probability model • Computational aspects • Example 1 • unlinked markers • relatedness estimation (with a pedigree) • Gasbarra et al.(2006): Estimating Genealogies from Unlinked Marker Data: a Bayesian Approach (under revision) • Example 2 • linked markers • haplotyping • relatedness estimation (with IBD-alleles) • Gasbarra et al.(2006): Estimating Genealogies from Linked Marker Data: a Bayesian Approach (under preparation)
A Basic question in statistical genetics • Consider a population evolving in time • Inverse problem • Current state of the process is known • individuals alive at the moment • What was the path leading to this state? • family structures (pedigree) • inheritance patterns
Why is the recent past important? • Relatedness estimation • In which parts of the genome a group of individuals share alleles (identical-by-descent)? • gene mapping • Haplotyping • Ancestral meioses have formed the haplotypes of the contemporary individuals
Current methods on KNOWN pedigrees • Exact calculations on known pedigrees • Elston-Stewart algorithm • A few markers, not too complex pedigrees • Lander-Green algorithm • Small pedigrees, many markers • Approximative calculations on known pedigrees • McMC methods (e.g. Simwalk2 [Sobel et al.], Loki [Heath])
What if the pedigree is not known? • There may be only partial pedigree data available. • Small pedigrees might share common ancestors already within a couple of generations backwards in time
What we do … • Consider a sample of individuals from a population • Genotype data on (possibly linked) markers • Model the pedigree and the gene flow explicitly, applying a construction which proceeds backwards in time • Recombinations modelled based on genetic distance • Non-random mating allowed • Devise an McMC sampler with good mixing properties • Extends, because of computational reasons, only tens of generations backwards in time
… and what we hope to get • Obtain useful summary statistics • E.g. estimates of IBD-probabilities between pairs of sampled individuals • Use the algorithm to perform numerical intergration over model unobservables • E.g. in gene mapping, when combined with a phenotype model, to account for shared ancestry
The frame of study • Assume that we have fixed • A population whose size we know for T-1 (non-overlapping) generations backwards in time (T~10) • N sampled individuals from the current generation • Marker map with M markers and known recombination fractions • Allele frequencies at the population level for each of the markers
A (prior) model for a possible history • A configuration C consists of • a pedigree • allelic paths • Specify probabilities for • Pedigree graph, Pg(C) • Recombination events, Pr(C) • Founder alleles, Pa(C) • The total probability for C is P(C) = Pg(C) x Pr(C) x Pa(C)
A probability model for pedigrees • For fixed • number of generations,T-1, backwards in time • population size in each generation (number of ♂ and ♀) • sample of size N from the current generation • mating parameters α and β • To simulate a pedigree from the distribution we use • Proceed generation by generation from 0,…,T-1. • Let children choose parents according to Pólya urn scheme, where αaffects the correlation of choices of fathers and βaffects the correlation of choices of mothers given the choices of fathers. • Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of Sampled Individuals. Theor Pop Biol 67:75-83.
Children choosing fathers • Suppose k children have chosen their fathers from among N_m males of the population • Ch(m) is the number of children that have chosen male m • P(k+1 chooses father m) ~ α + Ch(m) • Small α implies dominant males • Large α implies that the number of offspring does not vary much between different males
Children choosing mothers • Suppose k children have chosen their mothers from among N_F females of the population • Ch(m,f) is the number of children who have chosen male m and female f as his/her parents • P(k+1 chooses mother f | the father of k+1 is m) ~ Ch(m,f)+β • Small β implies faithful males (monogamy in large populations) • Large β implies random mating
Examples with different parameters • Left: a few dominant males + monogamy • Middle: a few dominant males • Right: Random mating
Probability for allelic paths • For each non-founder haplotype in the pedigree form the expression • Take the product of these over all haplotypes to obtain Pr(C) • Consider all founder alleles and take the product of the corresponding population allelle frequencies to get Pa(C) (founders are assumed to be in H-W and linkage equilibrium)
Data • Assume that we also have • Genotype data of the sampled individuals on M markers • The (posterior) probability in our model is π(C) ~ Pg(C) x Pr(C) x Pa(C) x I(C cons. with the data) • We are able to sample efficiently from the prior but not from the posterior
Markov chain Monte Carlo sampling • We generate a Markov chain whose state space consists of all configurations consistent with the data and whose stationary distribution is our posterior (Metropolis-Hastings algorithm) • Highly dependent variables (close relatives and linked markers) require large block updates
Proposals • Different versions of proposals • A (randomly chosen) group of children chooses (possibly new) parents and transmits their alleles to these parents • All children of a fixed father/mother choose (possibly new) mother/father and transmit their alleles to her/him • One child at a time chooses parent(s) and transmits alleles • All children within the group jointly choose new parents and transmit alleles • Pedigree is not changed but new allele paths are proposed
Schematic representation of some updates in the MCMC algorithm
Example 1:Relatedness estimation with unlinked markers • Simulated data • 20 generations ago a single founder population divided into 3 population isolates • Our sample contains 10 sibships of 3 individuals from each of the 3 populations (i.e. 90 individuals altogether)
Same data analyzed by STRUCTURE 3 pop 10 pop 30 pop
Real data example: individuals sampled from Eastern and Western Finland: 31 unlinked microsatellite markers
Example 2: The case of linked markers • Simulated pedigree • 10 generations • Youngest generation • 39 individuals divided into • 13 nuclear families • Genotype data • 20 markers / 10 alleles • Recombination fraction 0.05
Reconstruction • We gave the algorithm • The genotype data on the youngest generation • The (correct) marker map • The (correct) allele frequencies • The population structure • The algorithm was run for 500,000 iterations
Reconstructing the haplotypes • The accuracy of the haplotype reconstruction can be measured with the concept of switch distance (SD) • SD between two pairs of haplotypes is the number of phase relations between neighboring loci that need to be changed in order to turn the first pair of haplotypes to the other • If correct haplotypes were (111111,222222) then • (111222,222111) has SD=1 • (112211,221122) has SD=2 • (121212,212121) has SD=5
Reconstructing the haplotypes • The SDs between the reconstructed and the true haplotype pairs of the youngest generation (sum over all 39 individuals)
Reconstructing the IBD sharing • We consider those alleles IBD (identical by descent) that trace back to a common ancestral allele at the founder level (9 generations backwards in time) • It is possible to calculate a single quantity that measures the proportion of the genome that two individuals share (coefficient of relatedness r) • It is also possible to compare the IBD sharing more accurately along the chromosome
Comparison with IBS-based estimators Distribution of L_2 errors (741values) Lynch (1988) Lynch et Ritland (1999) Wang (2002) Sums: 1.93 3.25 3.27 3.51
Future work • Possibility of fixing some parts of the pedigree • Extending partially known genotype data to the known pedigree • Pirinen, Gasbarra (2006): Finding consistent gene transmission patterns on large and complex pedigrees. IEEE Trans. Comp. Biol. Bioinf. 3:252-262
Future work • Adding a QTL or phenotype model to the algorithm • Allowing for mutations and considering evolutionary time scales (Ancestral Recombination Graph) • Running many chains in parallel ”in different temperatures” • McMcMC with 20 processors achieved a slightly better accuracy in 12 hours (of wall-clock time) than a single processor in 5 days
Thanks Matti Pirinen Dario Gasbarra Mikko Sillanpää