Estimating Genealogies from Marker Data

Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas SCB Workshop 14.12.2006 Isaac Newton Institute, Cambridge Biometry Group Department of Mathematics and Statistics University of Helsinki

Outline of the presentation • The Problem • Description of the method • Probability model • Computational aspects • Example 1 • unlinked markers • relatedness estimation (with a pedigree) • Gasbarra et al.(2006): Estimating Genealogies from Unlinked Marker Data: a Bayesian Approach (under revision) • Example 2 • linked markers • haplotyping • relatedness estimation (with IBD-alleles) • Gasbarra et al.(2006): Estimating Genealogies from Linked Marker Data: a Bayesian Approach (under preparation)

A Basic question in statistical genetics • Consider a population evolving in time • Inverse problem • Current state of the process is known • individuals alive at the moment • What was the path leading to this state? • family structures (pedigree) • inheritance patterns

Why is the recent past important? • Relatedness estimation • In which parts of the genome a group of individuals share alleles (identical-by-descent)? • gene mapping • Haplotyping • Ancestral meioses have formed the haplotypes of the contemporary individuals

Current methods on KNOWN pedigrees • Exact calculations on known pedigrees • Elston-Stewart algorithm • A few markers, not too complex pedigrees • Lander-Green algorithm • Small pedigrees, many markers • Approximative calculations on known pedigrees • McMC methods (e.g. Simwalk2 [Sobel et al.], Loki [Heath])

What if the pedigree is not known? • There may be only partial pedigree data available. • Small pedigrees might share common ancestors already within a couple of generations backwards in time

What we do … • Consider a sample of individuals from a population • Genotype data on (possibly linked) markers • Model the pedigree and the gene flow explicitly, applying a construction which proceeds backwards in time • Recombinations modelled based on genetic distance • Non-random mating allowed • Devise an McMC sampler with good mixing properties • Extends, because of computational reasons, only tens of generations backwards in time

… and what we hope to get • Obtain useful summary statistics • E.g. estimates of IBD-probabilities between pairs of sampled individuals • Use the algorithm to perform numerical intergration over model unobservables • E.g. in gene mapping, when combined with a phenotype model, to account for shared ancestry

The frame of study • Assume that we have fixed • A population whose size we know for T-1 (non-overlapping) generations backwards in time (T~10) • N sampled individuals from the current generation • Marker map with M markers and known recombination fractions • Allele frequencies at the population level for each of the markers

A (prior) model for a possible history • A configuration C consists of • a pedigree • allelic paths • Specify probabilities for • Pedigree graph, Pg(C) • Recombination events, Pr(C) • Founder alleles, Pa(C) • The total probability for C is P(C) = Pg(C) x Pr(C) x Pa(C)

A probability model for pedigrees • For fixed • number of generations,T-1, backwards in time • population size in each generation (number of ♂ and ♀) • sample of size N from the current generation • mating parameters α and β • To simulate a pedigree from the distribution we use • Proceed generation by generation from 0,…,T-1. • Let children choose parents according to Pólya urn scheme, where αaffects the correlation of choices of fathers and βaffects the correlation of choices of mothers given the choices of fathers. • Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of Sampled Individuals. Theor Pop Biol 67:75-83.

Children choosing fathers • Suppose k children have chosen their fathers from among N_m males of the population • Ch(m) is the number of children that have chosen male m • P(k+1 chooses father m) ~ α + Ch(m) • Small α implies dominant males • Large α implies that the number of offspring does not vary much between different males

Children choosing mothers • Suppose k children have chosen their mothers from among N_F females of the population • Ch(m,f) is the number of children who have chosen male m and female f as his/her parents • P(k+1 chooses mother f | the father of k+1 is m) ~ Ch(m,f)+β • Small β implies faithful males (monogamy in large populations) • Large β implies random mating

Examples with different parameters • Left: a few dominant males + monogamy • Middle: a few dominant males • Right: Random mating

Probability for allelic paths • For each non-founder haplotype in the pedigree form the expression • Take the product of these over all haplotypes to obtain Pr(C) • Consider all founder alleles and take the product of the corresponding population allelle frequencies to get Pa(C) (founders are assumed to be in H-W and linkage equilibrium)

Data • Assume that we also have • Genotype data of the sampled individuals on M markers • The (posterior) probability in our model is π(C) ~ Pg(C) x Pr(C) x Pa(C) x I(C cons. with the data) • We are able to sample efficiently from the prior but not from the posterior

Markov chain Monte Carlo sampling • We generate a Markov chain whose state space consists of all configurations consistent with the data and whose stationary distribution is our posterior (Metropolis-Hastings algorithm) • Highly dependent variables (close relatives and linked markers) require large block updates

Proposals • Different versions of proposals • A (randomly chosen) group of children chooses (possibly new) parents and transmits their alleles to these parents • All children of a fixed father/mother choose (possibly new) mother/father and transmit their alleles to her/him • One child at a time chooses parent(s) and transmits alleles • All children within the group jointly choose new parents and transmit alleles • Pedigree is not changed but new allele paths are proposed

Schematic representation of some updates in the MCMC algorithm

Example 1:Relatedness estimation with unlinked markers • Simulated data • 20 generations ago a single founder population divided into 3 population isolates • Our sample contains 10 sibships of 3 individuals from each of the 3 populations (i.e. 90 individuals altogether)

Relatedness matrix estimated from pedigrees

Qualitative reconstruction with dendrogram

Same data analyzed by STRUCTURE 3 pop 10 pop 30 pop

Real data example: individuals sampled from Eastern and Western Finland: 31 unlinked microsatellite markers

Example 2: The case of linked markers • Simulated pedigree • 10 generations • Youngest generation • 39 individuals divided into • 13 nuclear families • Genotype data • 20 markers / 10 alleles • Recombination fraction 0.05

Reconstruction • We gave the algorithm • The genotype data on the youngest generation • The (correct) marker map • The (correct) allele frequencies • The population structure • The algorithm was run for 500,000 iterations

Reconstructing the pedigree

Reconstructing the haplotypes • The accuracy of the haplotype reconstruction can be measured with the concept of switch distance (SD) • SD between two pairs of haplotypes is the number of phase relations between neighboring loci that need to be changed in order to turn the first pair of haplotypes to the other • If correct haplotypes were (111111,222222) then • (111222,222111) has SD=1 • (112211,221122) has SD=2 • (121212,212121) has SD=5

Reconstructing the haplotypes • The SDs between the reconstructed and the true haplotype pairs of the youngest generation (sum over all 39 individuals)

Reconstructing the IBD sharing • We consider those alleles IBD (identical by descent) that trace back to a common ancestral allele at the founder level (9 generations backwards in time) • It is possible to calculate a single quantity that measures the proportion of the genome that two individuals share (coefficient of relatedness r) • It is also possible to compare the IBD sharing more accurately along the chromosome

Comparison with IBS-based estimators Distribution of L_2 errors (741values) Lynch (1988) Lynch et Ritland (1999) Wang (2002) Sums: 1.93 3.25 3.27 3.51

Reconstructing IBD

Future work • Possibility of fixing some parts of the pedigree • Extending partially known genotype data to the known pedigree • Pirinen, Gasbarra (2006): Finding consistent gene transmission patterns on large and complex pedigrees. IEEE Trans. Comp. Biol. Bioinf. 3:252-262

Future work • Adding a QTL or phenotype model to the algorithm • Allowing for mutations and considering evolutionary time scales (Ancestral Recombination Graph) • Running many chains in parallel ”in different temperatures” • McMcMC with 20 processors achieved a slightly better accuracy in 12 hours (of wall-clock time) than a single processor in 5 days

Thanks Matti Pirinen Dario Gasbarra Mikko Sillanpää

Estimating Genealogies from Marker Data