Bayesian Estimators of Time to Most Recent Common Ancestry

Bayesian Estimators ofTime to Most Recent Common Ancestry Bruce Walsh jbwalsh@u.arizona.edu Ecology and Evolutionary Biology BIO 5 Adjunct Appointments Molecular and Cellular Biology Plant Sciences Epidemiology & Biostatistics Animal Sciences

Definitions MRCA - Most Recent Common Ancestor TMRCA - Time to Most Recent Common Ancestor IBD - identical by descent Question: Given molecular marker information from a pair of individuals, what is the estimated time back to their most recent common ancestor? Application in Genealogy -- how closely are (say) Bill Walsh (former football coach) and I related?

Basic concept: The more closely related, the more genetic markers in common we should share With even a small number of highly polymorphic autosomal markers, trivial to assess zero (subject/ biological sample) and one (parent-offspring) MRCA For forensic applications: 13 highly polymorphic autosomal marker loci sufficient for individualization Thus, if issue of paternity ( Bill’s my dad) or a sibling, Easy.

Problems with Autosomal Markers Often we are very interested in MRCAs that are modest (5-10 generations) or large (100’s to 10,000’s of generations) Unlinked autosomal markers simply don’t work over these time scales. Reason: IBD probabilities for individuals sharing a MRCA 5 or more generations ago are extremely small and hence very hard to estimate (need VERY large number of markers).

mtDNA and Y Chromosomes So how can we accurately estimate TMRCA for modest to large number of generations? Answer: Use a set of completely linked markers With autosomes, unlinked markers assort each generation leaving only a small amount of IBD information on each marker, which we must then multiply together. IBD information decays on the order of 1/2 each generation. With completely linked marker loci, information on IBD does not assort away via recombination. IBD information decay is on the order of the mutation rate.

Y chromosome microsatellitemutation rates- I Estimates of human mutation rate in microsatellites are fairly consistent over both the Y and the autosomes

Y chromosome • Since Y = male, sons get their Y chromosome only from their father • Hence, Y chromosomes follow shared paternal lineages • No recombination on the Y, hence all markers passed along as a group • Thus, only way for father and son to differ in their Y genotype is via mutation

Basic Structure of the Problem What is the probability that the two marker alleles at a haploid locus from two related individuals agree given that their MRCA was t generation ago? Phrased another way, what is their probability of identity in state (IBS), given a TMRCA of t generations

Infinite Alleles Model The first step in estimating TMRCA requires us to assume a particular mutational model Our (initial) assumption will be the infinite alleles model (IAM) The key assumption of this model (originally due to Kimura and Crow, 1964) is that each new mutation gives rise to a new allele.

A B (1-u)t (1-u) t Pr(No mutation from MRCA->A) = (1-u)t Pr(No mutation from MRCA->B) = (1-u)t MRCA MRCA Key: Under the infinite alleles, two alleles that are identical in state that are also ibd have not experienced any mutations since their MRCA. Let q(t) = Probability two alleles with a MRCA t generations ago are identical in state If u = per generation mutation rate, then q(t) = (1-u)2t

Building the Likelihood Function for n Loci The probability that k of n marker loci are IBS is just a Binomial distribution with success parameter q(t) Likelihood function for t given k of n matches ( ) For any single marker locus, the probability of IBS given a TMRCA of t generations is q(t) = (1-u)2t ≈ e-2ut = e-t, t = 2ut

MLE for t is solution of ∂ L/∂t = 0 ^ ^ ( ) p = fraction of matches ML Analysis of TMRCA It would seem that we now have all the pieces in hand for a likelihood analysis of TMRCA given the marker data (k of n matches) Likelihood function (t = 2ut)

( Likewise, the precision of this estimator follows for the (negative) inverse of the 2nd derivative of the log-likelihood function evaluated at the MLE, - ^ - Var( t ) = ^ In particular, the MLE for t becomes

What about one-LOD support intervals (95% CI) ? With n=k, the value of the likelihood function is L(t) = (1-u)2tn ≈ e-2tun L has a maximum value of one under the MLE Hence, any value of t that gives a likelihood value of 0.1 or larger is in the one-LOD support interval Solving, the one-LOD support interval is from t=0 to t = (1/2n) [ -Ln(10)/Ln(1-u) ] ≈(1/n) [ Ln(10)/(2u) ] For u = 0.002, CI is (0, 575/n)

What about Hypothesis testing? Again recall that for k =n that the likelihood at t = t0 is L(t0) ≈ Exp(-2t0un) Hence, the LR test statistic for Ho: t = t0 is just LR = -2 ln [ L(t0)/ L(0) ] = -2 ln [ Exp(-2t0un) / 1 ] = 4t0un Thus the probability for the test that TMRCA = t0 is just Pr( c12 > 4t0un)

Trouble in Paradise The ML machinery has seem to have done its job, giving us an estimate, its approximate sampling error, approximate confidence intervals, and a scheme for hypothesis testing. Hence, all seems well. Problem: Look at k=n (= complete match at all markers). MLE (TMRCA) = 0 (independent of n) Var(MLE) = 0 (ouch!) One-LOD support interval for t is (0, 575/n)

The problem(s) with ML The expressions developed for the sampling variance, approximate confidence intervals, and hypothesis testing are all large-sample approximations Problem 1: Here our sample size is the number of markers scored in the two individuals. Not likely to be large (typically 10-60). Problem 2: These expressions are obtained by taking appropriate limits of the likelihood function. If the ML is exactly at the boundary of the admissible space on the likelihood surface, this limit may not formally exist, and hence the above approximations are incorrect.

The solution? “Ain’t Too Proud to Bayes” -- Brad Carlin

Why go Bayesian? Instead of simply estimating a point estimate (e.g., the MLE), the goal is the estimate the entire distributionfor the unknown parameter q given the data x p(q | x) = C * l(x | q) p(q) Why Bayesian? • Exact for any sample size • Marginal posteriors • Efficient use of any prior information

The Prior on TMRCA The first step in any Bayesian analysis is choice of an appropriate prior distribution p(t) -- our thoughts on the distribution of TMRCA in the absence of any of the marker data Standard approach: Use a flat or uninformative prior, with p(t) = a constant over the admissible range of the parameter. Can cause problems if the likelihood function is unbounded (integrates to infinity) In our case, population-genetic theory provides the prior: under very general settings, the time to MRCA for a pair of individuals follows a geometric distribution

Prior hyperparameter = 1/Ne - - - - Previous likelihood function (ignoring constants that cancel when we compute the normalizing factor C) Prior In particular, for a haploid gene, TMRCA follows a geometric distribution with mean 1/Ne. Hence, our prior is just p(t) = l(1-l)t≈l e-lt, where l = 1/Ne Hence, we can use an exponential prior with hyperparameter (the parameter fully characterizing the distribution) l = 1/Ne. The posterior thus becomes

The Normalizing constant where I ensures that the posterior distribution integrates to one, and hence is formally a probability distribution

What is the effect of the hyperparameter? If 2uk >> l, then essentially no dependence on the actual value of l chosen. Hence, if 2Neuk >> 1, essentially no dependence on (hyperparameter) assumptions of the prior. For a typical microsatellite rate of u = 0.002, this is just Nek >> 250, which is a very weak assumption. For example, with k =10 matches, Ne >> 25. Even with only 1 match (k=1), just require Ne >> 250.

- - - - = - - - - - Closed-form Solutions for the Posterior Distribution Complete analytic solutions to the prior can be obtained by using a series expansion (of the (1-ex)n term) to give Each term is just a * ebt, which is easily integrated

- - - - - - With the assumption of a flat prior, l = 0, this reduces to

- - Hence, the complete analytic solution of the posterior is Suppose k = n (no mismatches) In this case, the prior is simply an exponential distribution with mean 2un + l,

- - - Analysis of n = k case (complete match at n markers) Mean TMRCA and its variance: Cumulative probability: In particular, the time Ta satisfying P(t < Ta) = a is

For a flat prior (l = 0), the 95% (one-side) confidence interval is thus given by -ln(0.5)/(2nu) ≈ 1.50/(nu) Hence, under a Bayesian analysis for u = 0.002, the 95% upper confidence interval is given by ≈ 749/n Recall that the one-LOD support interval (approximate 95% CI) under an ML analysis is ≈ 575/n The ML solution’s asymptotic approximation significantly underestimates the true interval relative to the exact analysis under a full Bayesian model

Posteriors for n = 10 Posteriors for n = 20 Posteriors for n = 50 n = 100 93 n = 100 0.30 92 0.060 94 91 100 90 0.25 0.050 99 0.20 0.040 p( t | k ) p( t | k ) 98 89 0.15 97 0.030 96 95 0.10 0.020 0.05 0.010 0.00 0.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 5 10 15 20 25 30 35 40 45 50 55 60 65 Time t to MRCA Time t to MRCA Sample Posteriors for u = 0.002

Key points • By using markers on a non recombining chromosomal section, we can estimate TMRCA over much, much longer time scales than with unlinked autosomal markers • By using the appropriate number of markers we can get accurate estimates for TMRCA for even just a few generations. 20-50 markers will do. • Hence, we have a fairly large window of resolution for TMRCA when using a modest number of completely linked markers.

Extensions. I: Different Mutation Rates Let marker locus k have mutation rate uk. Code the observations as xk = 1 if a match, otherwise code xk = 0 The posterior becomes:

Mutation 1 Mutation 2 Under IAM, score as a match, and hence no mutations In reality, there are two mutations Stepwise Mutation Model (SMM) Microsatellite allelic variants are scored by their number of repeat units. Hence, two “matching” alleles can actually hide multiple mutations (and hence more time to the MRCA) The Infinite alleles model (IAM) is not especially realistic with microsatellite data, unless the fraction of matches is very high.

Y chromosome microsatellitemutation rates- II The SMM model is an attempt to correct for multiple hits by fully accounting for the mutational structure. Good fit to array sizes in natural populations when assuming the symmetric single-step model • Equal probability of a one-step move up or down In direct studies of (Y chromosome) microsatellites 35 of 37 dectected mutations in pedigrees were single step, other 2 were two-step

- - Formally, the SMM model assumes the following transition probabilities Note that two alleles can match only if they have experienced an even number of mutations in total between them. In such cases, the match probabilities become

SMM0 model -- match/no match under SMM The simplest implementation of the SMM model is to simply replace the match probabilities q(t) under the IAM model with those for the SMM model. This simply codes the marker loci as match / no match We refer to this as the SMMO model

- - The zero-order modifed Type I Bessel Function Hence,

Under the SMM model, the prior hyperparameter can now become important. This is the case when the number n of markers is small and/or k/n is not very close to one Why? Under the prior, TMRCA is forced by a geometric with 1/Ne. Under the IAM model for most values this is still much more time than the likelihood function predicts from the marker data Under the SMM model, the likelihood alone predicts a much longer value so that the forcing effect of the initial geometric can come into play

n =5, k = 3, u = 0.002 IAM, both flat prior and Ne = 5000 SSMO, Ne = 5000 SSMO, flat prior Pr(TMRCA < t) Prior with Ne =5000 Time, t

The jth-order modifed Type I Bessel Function An Exact Treatment: SMME With a little work we can show that the probability two sites differ by j steps is just The resulting likelihood thus becomes Where nj is the number of sites that differ by k (observed) steps

- Number of exact matches Number of k steps differences With this likelihood, the resulting prior becomes This rearranges to give the general posterior under the Exact SMM model (SMME) as

Example Consider comparing haplotypes 1 and 3 from Thomas et al.’s (2000) study on the Lemba and Cohen Y chromosome modal haplotypes. Here six markers used, four exactly match, one differs by one repeat, the other by two repeats Hence, n = 6, k = 4 for IAM and SMM0 models n0 = 4, n1 = 1, n2 = 1, n = 6 under SMME model Assume Hammer’s value of Ne=5000 for the prior

IAM SMM0 P(t | markers) SMME Time to MRCA, t TMRCA for Lemba and Cohen Y

IAM SMM0 Pr(TMRCA < t) SMME Time to MRCA, t

Summary • Sets of non-recombining markers are optimal for estimation of TMRCA • Y (paternal lineage), mtDNA (maternal) • Simple bayesian estimators are easily developed and can be extended to allow for more realistic mutation models.

Bayesian Estimators of Time to Most Recent Common Ancestry