1 / 38

Haplotyping Algorithm

Haplotyping Algorithm. Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 6, 2008. Haplotyping….

merton
Download Presentation

Haplotyping Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Haplotyping Algorithm Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 6, 2008

  2. Haplotyping… Using molecular and/or mathematical techniques to measure/infer haplotypes of a subject (or a set of subjects), given a set of genetic makers/loci (locus number L≥2)

  3. Questions WHAT is haplotype? WHY study haplotype? WHY use algorithm in haplotyping? HOW ? (Data, Hypotheses, Algorithms)

  4. WHAT is Haplotype? A haplotype (Greek haploos = simple) is a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. The term haplotype is a portmanteau of "haploid genotype.“ In a second meaning, haplotype is a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Such information is very valuable for investigating the genetics behind common diseases, and is collected by the International HapMap Project. From http://en.wikipedia.org/wiki/Haplotype

  5. Haplotype = Genotype of Haploid Haplotypes: AB//ab Genotype: Aa Bb Haplotype C G GenotypeCT GA Haplotype T A Haplotypes: Ab//aB Genotype: Aa Bb

  6. WHY Study Haplotype? An efficient way of presentation of genetic variation/polymorphism, useful in genomics, population genetics, and genetic epidemiology • Population evolution • LD analysis • Missing genotype imputation • IBD estimation • Tag maker (SNP) selection • Multi-locus linkage & association • …

  7. Most of current molecular genotyping techniques mix DNA pieces from two complementary chromosomes and only provide genotypes of diploid (mixture of haplotypes) genotypeshaplotypes • Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genome-wide study. • So, at least now, we need algorithms … ? WHY use algorithm in haplotyping?

  8. Ambiguity of Haplotype Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci are heterozygous and their genetic phase is unknown

  9. Rule-based Approaches(Parsimony & Phylogeny)Search an optimal set of haplotypes that satisfies some specific rules

  10. ABC, abc, abC Abc AaBbCC => ABC//abC AABbCc => ABC//Abc Continue … Until on one can be resolved Parsimony Approaches Parsimony rules: Maximum-resolution of genotypes and/or Minimum set of haplotypes Clark’s Algorithm 1.List all unambiguous haplotypes 2.Resolve ambiguous individuals one by one using listed haplotypes 3. If only half-resolved, add new haplotype to the list 4. Continue 2 & 3 5. Until on one can be solved Clark, 1990, Mol. Biol. Evol., 7(2): 111-122

  11. Phylogeny Approaches Given a set of genotypes, find a set of explaining haplotypes, which defines a perfect phylogeny. Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no recombination, infinite-site mutation) D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175.

  12. Probability-based Approaches(EM & MCMC)Calculate probability of haplotype, conditional on genotypes. Pr(H|G)=?

  13. Pr ( H | G ) = ? Mutation Selection Admixture Drift (gene frequencies) Linkage Recombination LD (haplotype frequencies) HWE (genotype frequencies) Epidemiologic Data HaplotypeHaplotype Genotype Sample Phenotype Environment Factors Pr ( P | G,E) = ?

  14. Loci (A,B,C…) Data Structure for Haplotyping Linkage A B C Gene/haplotype frequencies HWE, LD Subjects(1,2,3…) Genotypes Genetic Relationship Haplotypes

  15. HWE & LD Hardy-Weinberg Equilibrium (HWE)Hardy-Weinberg Disequilibrium (HWD) HWE: random combination of allelic genes (same loci) Under HWE, allele freq. determines genotype freq.HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a)Linkage Equilibrium (LE)Linkage Disequilibrium (LD)LE: random combination of genes from different lociLD: association between genes from different loci Under LE, allele freq. determines haplotype freq.LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C)

  16. AABB AaBb AaBb AaBb AABB aabb Genetic Relationship (R) & Linkage (r) AaBb AB//ab or aB//Ab Recombination rate (r) r =0, complete Linkage 0< r <0.5, incomplete Linkage r =0.5, no Linkage AB//ab (if r=0) AB//ab (if r>0) AB//ab, Ab//aB

  17. Haplotyping & Conditional Probability AaBB: Pr(AB//aB)=1 AAbB: Pr(AB//Ab)=1 AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5 AABB, aabb, AABB, aabb, AABB, AABb, aabb AaBB, aabb, AABB, AABB, AABB, AABB, aabb aabb, AABB, AABB, AABB, AaBb, AABB,aabb aabb, AABB, AABB, aabb, AABB, aabb, AABB … Pr(AB//ab)=Pr(Ab//aB)=0.5 ? P(H|G)=? HWE or HWD? LD or LE? P(H|G, R, r)=?

  18. EM Algorithmfor unrelated individualsPr(Ha,b|G,F)=? LD: Pr(ABC)≠Pr(A)*Pr(B)*Pr(C) Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927 Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO)

  19. Likelihood: L(G|F) Haplotypes Haplotype Frequencies Genotypes Joint Likelihood of G given F Prbability of the k-th individual’s G given F & HWE Haplotype-Genotype compatibility index of the k-th individual F=? => Max. L(G|F)

  20. EM AlgorithmMaximum Likelihood Estimationof Haplotype Freq. Lagrange multiplier EM Recursion Partial Derivative Equations Prior Expectation Maximization E … M E M …

  21. Posterior Probability of Haplotype Prior Prob. Posterior Prob.

  22. Limitation of EM Algorithm • For diploid(2n) organism, a genotype of L heterozygous makers may have 2L possible haplotypes, EM is unpractical for large L • Only suitable for small number of loci, 2~12 • While L=20, 2L=1,048,576 …Large space of F • Subseting approaches (partition-ligation & block partitioning etc.) have been used to reduce computational burden …

  23. MCMCMarkov Chain Monte Carlo Algorithmfor unrelated individualsby sampling from Pr(H|G,F) Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE)

  24. Markov Chain MCMC Estimation Random sampling based on Pr(H|G,H_) Repeat many times After getting close to stationary distribution of P(H|G) Collect samples Average over samples

  25. Transition Probability subseting loci, reducing time Coalescent hypothesis, Mutation rate, M haplotypes Add the newly constructed haplotype to list H, pick Gk+1 …

  26. EM vs. MCMC

  27. EM Algorithmfor family data(no recombination, r=0) Pr(Ha,b{fam.}|G,R,F)=? Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO) Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP) O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO)

  28. AaBb AB//ab Ab//aB AB//ab AaBb AB//ab Ab//aB AB//ab AaBb AB//ab Ab//aB Ab//aB Haplotype Configuration of Family Genotypes Possible Haplotype Configurations recombinant, as r=0 or nearly =0, impossible or very low prob. , ignored

  29. EM AlgorithmHaplotype Freq. Estimation using Nuclear Families Tips: Only use parents to calculate haplotype freq. (f) Use parents+children ’s info to determine compatibility (c)

  30. EM AlgorithmHaplotype Freq. Estimation for General Pedigrees Tips: Only use founders to calculate haplotype freq. (f) Use all members (founders & non- founders) to determine compatibility (c) Discard the cases with too small probabilities to save time

  31. Posterior Probability of Haplotype Configuration Dad Mom

  32. A Middle Summary …Subject-oriented Algorithms A B C X X X indiv. by indiv. unrelated family by family r=0 Joint Prob. / Likelihood Large/General Pedigree & Allowing Recombination (r>0) ?

  33. Next … Locus-oriented Algorithm (Lander-Green) For Large/General Pedigree Data & Allowing Recombination (r>0) A B C … Joint Prob./ Likelihood X X X A B C Locus by Locus A Pedigree

  34. Prob. Inheritance Vector (V) of a pedigree A Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367 Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN) Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)

  35. Inheritance Vector & Haplotype 5: AaBb 1101 AB//ab 1101 1101 Ab//aB 1111

  36. Lander-Green Algorithm A B C Loci A,B,C,… One pedigree Hidden status (inheritance vectors) Transition Prob.=f(r) Emission Prob. Observations (genotypes) … … VA VB VC Pr(VB|VA) Pr(VC|VB) Pr(Vt+1|Vt) Pr(GA |VA) Pr(GB |VB) Pr(GC |VC) GA GB GC

  37. Lander-Green Algorithm Based (or Similar) Approaches Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Viterbi algorithm, the best haplotype configuration Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2) MCMC: Annealing & Metropolis Process Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767 (software: MERLIN) Allowing LD & Marker Cluster/Block

  38. Practices (1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible haplotype pairs of the child, calculate the probability of each pair. (2) If you know his/her father’s genotype is also AaBbCcDD and mother is AaBbCCDD, list all possible haplotype configurations of his/her family, calculate the probability of each configuration. (Assume recombination rate r=0) (3) Randomly assign a frequency to each haplotype in (1), say, f(ABCD)=0.4,f(abcD)=0.2,…,etc. Make sure the sum=1. Take these frequencies as the true haplotype frequencies in population, recalculate the (posterior) probabilities in (1) and (2). Within a week, send your answers to (E-mail: qunyuan@wustl.edu)

More Related