490 likes | 756 Views
Introduction to Haplotype Estimation. Stat/Biostat 550. The Haplotype Problem. Suppose we genotype individuals at a number of tightly linked SNPs. A. C. G. C. C. T. T. T. G. C. G. C. G. A. A. C. C. C. C. C. A. G. G. C. The Haplotype Problem.
E N D
Introduction to Haplotype Estimation Stat/Biostat 550
The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs. A C G C C T T T G C G C G A A C C C C C A G G C
The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs. A C G C C T T T G C G C G A A C C C C C A G G C
The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs.
The Haplotype Problem • What do the types on the two chromosomes look like?
The Haplotype Problem • What do the types on the two chromosomes look like?
The Haplotype Problem • What do the types on the two chromosomes look like?
The Haplotype Problem • What do the types on the two chromosomes look like?
The Haplotype Problem • What do the types on the two chromosomes look like?
Haplotypes: who cares? • LD mapping: increase power? • LD mapping: decrease genotyping? • Evolutionary studies: selection, recombination, gene conversion, population structure,… Many people, for many different reasons…
The Haplotype Problem – potential solutions • Molecular methods • Collect family data • Statistical methods for population data
The Simplest Case • What do the types on the two chromosomes look like?
The Next Simplest Case • What do the types on the two chromosomes look like?
The Next Simplest Case • What do the types on the two chromosomes look like?
The first difficult case… • What do the types on the two chromosomes look like?
The first difficult case… • What do the types on the two chromosomes look like?
Clark’s Method (1990) • Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.
1 2 3 Is it this configuration?
1 2 3 …or this one?
1 2 3 This one is more probable.
Clark’s Method (Clark, 1990) • Identify the unambiguous individuals. • Make a list of “known” haplotypes. • Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.
1 2 3 Clark’s Method List of known haps.
1 2 3 Clark’s Method List of known haps.
1 2 Clark’s Method: Problem 1 3
1 List of known haps. 2 3 Clark’s Method: Problem 1
1 List of known haps. 2 Clark’s Method: Problem 1 3
1 List of known haps. 2 Clark’s Method: Problem 1 3
1 List of known haps. 2 Clark’s Method: Problem 1 3
1 List of known haps. 2 Clark’s Method: Problem 1 3 Answer depends on order list is considered…. … and frequency information is ignored
1 2 Clark’s Method: Problem 2 3
1 List of known haps. 2 Clark’s Method: Problem 2 3 Algorithm can fail to resolve all haplotypes… … because looks only for exact matches
Clark’s Algorithm: Summary • Results may depend on order individuals are considered. • Frequency information is ignored. • May fail to resolve all haplotypes. • Fails to assess uncertainty. • Looks only for exact matches. • Fast and intuitive(?).
Maximum Likelihood (EM Algorithm) • Idea: find haplotype frequencies (f1,…fN) to maximise probability of observed genotype data (g1,…,gn).
Bayesian version Modify Clark’s algorithm: • Replace single pass through data, with iterative scheme. • Allow for uncertainty in resolution. • Use frequency information. Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001).
1 List of known haps. 2 3 Example 3 1 Matches 1 known Does not match any Assigned moderate probability
1 List of known haps. 2 Example 3 1 Matches 3 known 3 Does not match any Assigned higher probability
1 List of known haps. 2 Example 3 1 Does not match any 3 Does not match any Assigned low probability
Problems with EM/naïve Gibbs • Potentially (very) large number of parameters to estimate, leading to inaccurate estimates. • Can be time-consuming for large problems. • Can “converge” to poor local optima (alleviated by multiple runs).
Further modification • Take into account “near misses”, as well as exact matches. (PHASE v1.0: Stephens, Smith and Donnelly 2001)
1 List of known haps. 2 3 Example 3 1 Matches 1 known Differs by 2 from 3 known
1 List of known haps. 2 Example 3 1 Matches 3 known 3 Differs by 2 from 1 known
1 List of known haps. 2 Example 3 1 Differs by 1 from 3 known 3 Differs by 1 from 1 known How to balance these possibilities?
The key question • What is the conditional distribution of the next haplotype, given a set of known haplotypes?
1 2 Example Given the above haplotypes, what would you expect the next haplotype to look like?
Qualitative answer • The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype. • Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.
Problems • Time-consuming for large problems. • Can “converge” to poor local optima. • Ignores recombination (decay of LD with distance). • How should uncertainty in haplotype estimates be treated?