1 / 49

Introduction to Haplotype Estimation

Introduction to Haplotype Estimation. Stat/Biostat 550. The Haplotype Problem. Suppose we genotype individuals at a number of tightly linked SNPs. A. C. G. C. C. T. T. T. G. C. G. C. G. A. A. C. C. C. C. C. A. G. G. C. The Haplotype Problem.

trixie
Download Presentation

Introduction to Haplotype Estimation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Haplotype Estimation Stat/Biostat 550

  2. The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs. A C G C C T T T G C G C G A A C C C C C A G G C

  3. The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs. A C G C C T T T G C G C G A A C C C C C A G G C

  4. The Haplotype Problem • Suppose we genotype individuals at a number of tightly linked SNPs.

  5. The Haplotype Problem • What do the types on the two chromosomes look like?

  6. The Haplotype Problem • What do the types on the two chromosomes look like?

  7. The Haplotype Problem • What do the types on the two chromosomes look like?

  8. The Haplotype Problem • What do the types on the two chromosomes look like?

  9. The Haplotype Problem • What do the types on the two chromosomes look like?

  10. Haplotypes: who cares? • LD mapping: increase power? • LD mapping: decrease genotyping? • Evolutionary studies: selection, recombination, gene conversion, population structure,… Many people, for many different reasons…

  11. The Haplotype Problem – potential solutions • Molecular methods • Collect family data • Statistical methods for population data

  12. The Simplest Case • What do the types on the two chromosomes look like?

  13. The Next Simplest Case • What do the types on the two chromosomes look like?

  14. The Next Simplest Case • What do the types on the two chromosomes look like?

  15. The first difficult case… • What do the types on the two chromosomes look like?

  16. The first difficult case… • What do the types on the two chromosomes look like?

  17. Clark’s Method (1990) • Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.

  18. 1 2 3 Is it this configuration?

  19. 1 2 3 …or this one?

  20. 1 2 3 This one is more probable.

  21. Clark’s Method (Clark, 1990) • Identify the unambiguous individuals. • Make a list of “known” haplotypes. • Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.

  22. 1 2 3 Clark’s Method List of known haps.

  23. 1 2 3 Clark’s Method List of known haps.

  24. 1 2 Clark’s Method: Problem 1 3

  25. 1 List of known haps. 2 3 Clark’s Method: Problem 1

  26. 1 List of known haps. 2 Clark’s Method: Problem 1 3

  27. 1 List of known haps. 2 Clark’s Method: Problem 1 3

  28. 1 List of known haps. 2 Clark’s Method: Problem 1 3

  29. 1 List of known haps. 2 Clark’s Method: Problem 1 3 Answer depends on order list is considered…. … and frequency information is ignored

  30. 1 2 Clark’s Method: Problem 2 3

  31. 1 List of known haps. 2 Clark’s Method: Problem 2 3 Algorithm can fail to resolve all haplotypes… … because looks only for exact matches

  32. Clark’s Algorithm: Summary • Results may depend on order individuals are considered. • Frequency information is ignored. • May fail to resolve all haplotypes. • Fails to assess uncertainty. • Looks only for exact matches. • Fast and intuitive(?).

  33. Maximum Likelihood (EM Algorithm) • Idea: find haplotype frequencies (f1,…fN) to maximise probability of observed genotype data (g1,…,gn).

  34. Bayesian version Modify Clark’s algorithm: • Replace single pass through data, with iterative scheme. • Allow for uncertainty in resolution. • Use frequency information. Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001).

  35. 1 List of known haps. 2 3 Example 3 1 Matches 1 known Does not match any Assigned moderate probability

  36. 1 List of known haps. 2 Example 3 1 Matches 3 known 3 Does not match any Assigned higher probability

  37. 1 List of known haps. 2 Example 3 1 Does not match any 3 Does not match any Assigned low probability

  38. Problems with EM/naïve Gibbs • Potentially (very) large number of parameters to estimate, leading to inaccurate estimates. • Can be time-consuming for large problems. • Can “converge” to poor local optima (alleviated by multiple runs).

  39. Further modification • Take into account “near misses”, as well as exact matches. (PHASE v1.0: Stephens, Smith and Donnelly 2001)

  40. 1 List of known haps. 2 3 Example 3 1 Matches 1 known Differs by 2 from 3 known

  41. 1 List of known haps. 2 Example 3 1 Matches 3 known 3 Differs by 2 from 1 known

  42. 1 List of known haps. 2 Example 3 1 Differs by 1 from 3 known 3 Differs by 1 from 1 known How to balance these possibilities?

  43. The key question • What is the conditional distribution of the next haplotype, given a set of known haplotypes?

  44. 1 2 Example Given the above haplotypes, what would you expect the next haplotype to look like?

  45. Qualitative answer • The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype. • Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.

  46. Comparisons on simulated data

  47. Problems • Time-consuming for large problems. • Can “converge” to poor local optima. • Ignores recombination (decay of LD with distance). • How should uncertainty in haplotype estimates be treated?

  48. … to be continued.

More Related