340 likes | 519 Views
Lecture 8: Linkage Analysis I. Date: 9/19/02 General likelihood method for phase-known gametes Backcross, F2 variants, mixed crosses Statistical properties of q estimate. Limitations of Partitioned Test Statistics. Test statistics can be partitioned only when:
E N D
Lecture 8: Linkage Analysis I Date: 9/19/02 General likelihood method for phase-known gametes Backcross, F2 variants, mixed crosses Statistical properties of q estimate.
Limitations of Partitioned Test Statistics • Test statistics can be partitioned only when: • single mating type in all crosses (families) • same genotypes in all crosses (families) • In addition, the partitioning becomes cumbersome when the number of loci is large, as there is one partition for every locus.
locus genotype A Gj = A1A2 B Gk = B1B1 General Likelihood Method • c = number of crosses (or families) • nAi = number of genotypes at locus A, cross i. • nBi = number of genotypes at locus B, cross i. • fijk = observed counts of genotype jk in cross i. • pijk = expected frequency of genotype jk in cross i, a function of qi.
A Test for Heterogeneity in Linkage • Perhaps the first thing you want to do is check for heterogeneity in linkage. If there is no heterogeneity, then the crosses (families) can be pooled. • To proceed you must obtain the MLE, which makes the LR approach more tedious when a goodness-of-fit statistic would also apply.
Finding the MLE: General Approach • Only in the backcross (BC) is an analytic MLE available. • In general, numeric methods are required. After listing all observable genotypes/phenotypes, the following are needed for each: • List the observed counts. • Calculate the expected frequency in terms of q. • For EM, also need Pi(recombination | genotype) • For NR or confidence intervals, need information.
Newton-Raphson: Linkage Analysis • Obtain an expression for the score S(q) = dL(q)/dq • Obtain an expression for the information I(q) • Make a first guess q0. • Iterate qn+1 = qn – S(q)/[NI(q)] until | qn+1 –qn| < tolerance (e.g. 0.00001)
EM Algorithm: Linkage Analysis • Make an initial guess qprevious. • Compute expected number of recombinants Ei = fiPi(R|G) • Compute new maximum likelihood estimate q new = 1/N ´ SiEi • Iterate until |qnew – qprevious|< tolerance
Advantages of EM Algorithm for Linkage Analysis • No need to calculate the first and second derivative of the log likelihood. • The calculations are simpler.
Information and Variance • Recall, that the variance of the MLE estimate is approximately normal with • So, calculating I(q) is necessary for NR & variance estimates.
Three Things We Need • Expected frequencies pi. • Conditional probabilities of recombination Pi(R|G). • Score & information per individual.
ML Estimation in Mixed Populations: Hypothesis Tests • GF2 = 2[LF2(qmle)-LF2(0.5)] = 2(103.03 – 52.33) = 101.4 (<0.00001); qmle=0.11 • GBC = 2[LBC(qmle)-LBC(0.5)] = 2(-200.16 + 277.1) = 154.18 (<0.00001); qmle=0.20 • Gpool = 2 [Lpool(qmle)-Lpool(0.5)] = 2(-101.15 + 224.77) = 247.24 (<0.00001); qmle=0.17 • Gtotal = GF2 + GBC = 255.6 • Gheterogeneity = Gtotal – Gpool = 8.36 (<0.01)
Statistical Properties of qmle • In the backcross formulation, qmle is distributed as a binomial random variable. • For all other crosses and mixes, qmle distribution is obtained from the asymptotic properties of MLE estimators. Sample size needs to be large.
Empirical Variance Using Bootstrap Sample from the data with replacement b times to generate b bootstrap data sets, such that the ith data set has genotype counts: ifA-B-, ifaaB-, ifA-bb, and ifaabb
Confidence Intervals • Simulation studies have shown that the bootstrap confidence intervals give smaller intervals and better coverage probabilities than the normal approximation when sample size is small (100-200).
Power and Linkage Analysis • G(q) = 2[lnL(q)– lnL(1/2)] • Calculate the noncentrality parameter for a given qthat you wish to be able to detect. That noncentrality parameter is given by EG = E[G(q)]. • The power is given by:
Sample Size and Linkage Analysis EG0 is the per observation expected log likelihood ratio.
Sample Size and Cross sample size needed to detect given q with 95% power
The Problem of Dominant Markers • The success of techniques like RAPD and AFLP has created many dominant markers. • Dominant markers in repulsion phase have low information (require larger sample size to obtain same confidence). • The q estimate is also biased for dominant markers in repulsion.
Trans Dominant Linked Markers (TDLM) • Two dominant markers in linkage repulsion can be recoded as a codominant marker if they are linked closely enough.
Assumption Violation: Segregation Distortion • Assume no segregation distortion at individual loci. • Additive distortion P(Aa) += a. Then the false positive rate increases in F2 cross. • Penetrance distortion P(Aa) *= a. Then the power for detection decreases.
Summary • General likelihood method for phase-known gametes. • NR & EM for linkage analysis. • Backcross, F2 double codominant, F2 coupled dominant, F2 repulsion dominant, mixed populations • Bootstrap estimates of variance & bias. • Some problems: dominant markers, segregation distortion.
Self Test • Can you derive the expected frequencies and conditional probabilities given in the tables here?