1 / 61

METHODS FOR HAPLOTYPE RECONSTRUCTION

METHODS FOR HAPLOTYPE RECONSTRUCTION. Andrew Morris Wellcome Trust Centre for Human Genetics March 6, 2003. Outline. Haplotypes and genotypes. Reconstruction in pedigrees. Reconstruction in unrelated individuals. Interpretation and LD assessment. Two stage analyses.

Download Presentation

METHODS FOR HAPLOTYPE RECONSTRUCTION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. METHODS FOR HAPLOTYPE RECONSTRUCTION Andrew Morris Wellcome Trust Centre for Human Genetics March 6, 2003

  2. Outline • Haplotypes and genotypes. • Reconstruction in pedigrees. • Reconstruction in unrelated individuals. • Interpretation and LD assessment. • Two stage analyses.

  3. Haplotypes and genotypes (1)

  4. Haplotypes and genotypes (1)

  5. Haplotypes and genotypes (1)

  6. Haplotypes and genotypes (1)

  7. Haplotypes and genotypes (2) • Individuals that are homozygous at every locus, or heterozygous at just one locus can be resolved. • Individuals that are heterozygous at k loci are consistent with 2k-1 configurations of haplotypes.

  8. Why do we need haplotypes? • Correlation between alleles at closely linked loci… • Fine-scale mapping studies. • Association studies with multiple markers in candidate genes. • Investigating patterns of LD across genomic regions. • Inferring population histories.

  9. Molecular methods • Single molecule dilution. • Allele specific long range PCR. • Prone to errors. • Expensive and inefficient: low throughput.

  10. Simplex family data (1) 00 01 00 11 x 01 11 01 01 (M) (F) 00 01 01 01

  11. Simplex family data (1) 00 01 00 11 x 01 11 0101 (M)(F) 00010101

  12. Simplex family data (1) 00 01 00 11 x 01 11 0101 (M)(F) 00010101 Inferred haplotypes: 0001 / 0110

  13. Simplex family data (2) 00 01 00 01 x 01 01 00 01 (M) (F) 00 01 00 01 • Cannot be fully resolved…

  14. Pedigree data (1) 11 01 11 01 11 x 00 00 11 11 11 01 01 11 11 11 x 01 00 00 01 00 01 01 01 01 01 11 01 01 01 01 00 00 01 11 01

  15. Pedigree data (1) 11111 / 10101 x 00111 / 00111 11111 / 00111 x 00010 / 10000 11111 / 00000 11111 / 10000 00111 / 00010

  16. Pedigree data (1) 11111 / 10101 x 00111 / 00111 11111 / 00111 x 00010 / 10000 11111 / 00000 11111 / 10000 00111 / 00010

  17. Pedigree data (2) • Many combinations of haplotypes may be consistent with pedigree genotype data. • Complex computational problem. • Need to make assumptions about recombination. • SIMWALK and MERLIN.

  18. Statistical approaches to reconstruct haplotypes in unrelated individuals • Parsimony methods: Clark’s algorithm. • Likelihood methods: E-M algorithm. • Bayesian methods: PHASE algorithm. • Aims: reconstruct haplotypes and/or estimate population frequencies.

  19. Clark’s algorithm (1) • Reconstruct haplotypes in unresolved individuals via parsimony. • Minimise number of haplotypes observed in sample. • Microsatellite or SNP genotypes.

  20. Clark’s algorithm (2) • Search for resolved individuals, and record all recovered haplotypes. • Compare each unresolved individual with list of recovered haplotypes. • If a recovered haplotype is identified, individual is resolved. • Complimentary haplotype added to list of recovered haplotypes. • Repeat 2-4 until all individuals are resolved or no more haplotypes can be recovered.

  21. (A) 00 01 01 00 (B) 00 00 00 00 (C) 00 01 00 00 (D) 01 11 01 11 (E) 00 11 01 01 (F) 01 11 11 00 (G) 00 01 11 01 (H) 00 01 01 11 (I) 00 00 00 00 (J) 00 00 00 11 Example

  22. (A) 00 01 01 00 (B)00000000 (C)00010000 (D) 01 11 01 11 (E) 00 11 01 01 (F)01111100 (G) 00 01 11 01 (H) 00 01 01 11 (I)00000000 (J)00000011 Example

  23. (A) 00 01 01 00 (B) 0000 / 0000 (C) 0000 / 0100 (D) 01 11 01 11 (E) 00 11 01 01 (F) 0110 / 1110 (G) 00 01 11 01 (H) 00 01 01 11 (I) 0000 / 0000 (J) 0001 / 0001 Recovered haplotypes: 0000 0100 0110 1110 0001 Example

  24. (A) 00010100 (B) 0000 / 0000 (C) 0000 / 0100 (D) 01 11 01 11 (E) 00 11 01 01 (F) 0110 / 1110 (G) 00 01 11 01 (H) 00 01 01 11 (I) 0000 / 0000 (J) 0001 / 0001 Recovered haplotypes: 0000 0100 0110 1110 0001 Example

  25. (A) 0000 / 0110 (B) 0000 / 0000 (C) 0000 / 0100 (D) 01 11 01 11 (E) 00110101 (F) 0110 / 1110 (G) 00 01 11 01 (H) 00 01 01 11 (I) 0000 / 0000 (J) 0001 / 0001 Recovered haplotypes: 0000 0111 0100 0110 1110 0001 Example

  26. (A) 0000 / 0110 (B) 0000 / 0000 (C) 0000 / 0100 (D) 01 11 01 11 (E) 0100 / 0111 (F) 0110 / 1110 (G) 00011101 (H) 00 01 01 11 (I) 0000 / 0000 (J) 0001 / 0001 Recovered haplotypes: 0000 0111 0100 0011 0110 1110 0001 Example

  27. (A) 0000 / 0110 (B) 0000 / 0000 (C) 0000 / 0100 (D) 0111 / 1101 (E) 0100 / 0111 (F) 0110 / 1110 (G) 0110 / 0011 (H) 0001 / 0111 (I) 0000 / 0000 (J) 0001 / 0001 Recovered haplotypes: 0000 0111 0100 0011 0110 1101 1110 0001 Example

  28. (A) 0000 / 0110 (B) 0000 / 0000 (C) 0000 / 0100 (D) 01 11 01 11 (E) 0100 / 0111 (F) 0110 / 1110 (G) 00011101 (H) 00 01 01 11 (I) 0000 / 0000 (J) 0001 / 0001 Recovered haplotypes: 0000 0111 0100 0011 0110 1110 0001 Example: problem…

  29. (A) 0000 / 0110 (B) 0000 / 0000 (C) 0000 / 0100 (D) 01 11 01 11 (E) 0100 / 0111 (F) 0110 / 1110 (G) 00011101 (H) 00 01 01 11 (I) 0000 / 0000 (J) 0001 / 0001 Recovered haplotypes: 0000 0111 0100 0010 0110 1110 0001 Example: problem…

  30. Clark’s algorithm: problems • Multiple solutions: try many different orderings of individuals. • No starting point for algorithm. • Algorithm may leave many unresolved individuals. • How to deal with missing data?

  31. E-M algorithm (1) • Maximum likelihood method for population haplotype frequency estimation. • Allows for the fact that unresolved genotypes could be constructed from many different haplotype configurations. • Microsatellite or SNP genotypes.

  32. E-M algorithm (2) • Observed sample of N individuals with genotypes, G. • Unobserved population haplotype frequencies, h. • Unobserved configurations, H, consisting of a complimentary haplotype pairs Hi = {Hi1,Hi2}.

  33. E-M algorithm (3) • Likelihood: f(G|h) = ∏k f(Gk|h) = ∏k ∑i f(Gk|Hi) f(Hi|h) where f(Hi|h) = f(Hi1|h) f(Hi2|h) under Hardy-Weinberg equilibrium.

  34. E-M algorithm (4) • Numerical algorithm used to obtain maximum likelihood estimates of h. • Initial set of haplotype frequencies h(0). • Haplotype frequencies h(t) at iteration t updated from frequencies at iteration t-1 using Expectation and Maximisation steps. • Continue until h(t) has converged.

  35. Expectation step • Use haplotype frequencies, h(t), to calculate the probability of resolving each genotype, Gk, into each possible haplotype configuration, Hi. • E(Hi|Gk,h(t)) = f(Gk|Hi) f(Hi1|h(t)) f(Hi2|h(t)) f(Gk|h(t))

  36. Maximisation step • Compute haplotype frequencies using procedure equivalent to gene counting. • hs(t+1) = ∑k ∑i Zsi E(Hi|Gk,h(t)) 2N • Zsi = number of copies (0,1,2) of sth haplotype in configuration Hi.

  37. E-M algorithm: comments • Can handle missing data. • For many loci, the number of possible haplotypes is large, so population frequencies are difficult to estimate: re-parameterisation. • Does not provide reconstructed haplotype configuration for unresolved individuals: can use “maximum likelihood” configuration.

  38. PHASE algorithm (1) • Treats haplotype configuration for each unresolved individual as an unobserved random quantity. • Evaluate the conditional distribution, given a sample of unresolved genotype data. • Microsatellite or SNP genotypes. • Reconstruction and population haplotype frequency estimation.

  39. PHASE algorithm (2) • Bayesian framework: goal is to approximate posterior distribution of haplotype configurations f(H|G). • Implements Markov chain Monte Carlo (MCMC) methods to sample from f(H|G): Gibbs sampling. • Start at random configuration. • Repeatedly select unresolved individuals at random, and sample from their possible haplotype configurations, assuming all other individuals to be correctly resolved.

  40. PHASE algorithm (3) • Initial haplotype configuration H(0). • Subsequent iterations obtain H(t+1) from H(t) using the following steps: • Select an unresolved individual,i, at random. • Sample Hi(t+1) from f(Hi|G,H-i(t)). • Set Hk(t+1) = Hk(t) for all k ≠ i. • On convergence, each sampled configuration represents random draw from f(H|G).

  41. PHASE algorithm (4) How to obtain f(Hi|G,H-i)? • Base directly on sample frequency of observed haplotypes in configuration H-i. • Better to introduce prior model for population haplotype frequencies, f(h). • Coalescent process used to predict likely patterns of haplotypes occurring in populations.

  42. PHASE algorithm (5) Key principle… • Configuration Hi is more likely to consist of haplotypes Hi1 and Hi2 that are exactly the same as, or similar to, haplotypes in the configuration H-i.

  43. Example (1) • Resolved haplotypes 22544 and 22334. • Unresolved individual 22 22 35 34 44.

  44. Example (1) • Resolved haplotypes 22544 and 22334. • Unresolved individual 22 22 35 34 44. • Possible configurations… (1) 22334 / 22544 (2) 22534 / 22344

  45. Example (1) • Resolved haplotypes 22544 and 22334. • Unresolved individual 22 22 35 34 44. • Possible configurations… (1) 22334 / 22544 (2) 22534 / 22344 -1 and +1

  46. Example (1) • Resolved haplotypes 22544 and 22334. • Unresolved individual 22 22 35 34 44. • Possible configurations… (1) 22334 / 22544 (2) 22534 / 22344 -1 and +1 • Assign high probability to sampling configuration (1).

More Related