390 likes | 595 Views
Haplotyping Algorithm. Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 6, 2008. Haplotyping….
E N D
Haplotyping Algorithm Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 6, 2008
Haplotyping… Using molecular and/or mathematical techniques to measure/infer haplotypes of a subject (or a set of subjects), given a set of genetic makers/loci (locus number L≥2)
Questions WHAT is haplotype? WHY study haplotype? WHY use algorithm in haplotyping? HOW ? (Data, Hypotheses, Algorithms)
WHAT is Haplotype? A haplotype (Greek haploos = simple) is a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. The term haplotype is a portmanteau of "haploid genotype.“ In a second meaning, haplotype is a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Such information is very valuable for investigating the genetics behind common diseases, and is collected by the International HapMap Project. From http://en.wikipedia.org/wiki/Haplotype
Haplotype = Genotype of Haploid Haplotypes: AB//ab Genotype: Aa Bb Haplotype C G GenotypeCT GA Haplotype T A Haplotypes: Ab//aB Genotype: Aa Bb
WHY Study Haplotype? An efficient way of presentation of genetic variation/polymorphism, useful in genomics, population genetics, and genetic epidemiology • Population evolution • LD analysis • Missing genotype imputation • IBD estimation • Tag maker (SNP) selection • Multi-locus linkage & association • …
Most of current molecular genotyping techniques mix DNA pieces from two complementary chromosomes and only provide genotypes of diploid (mixture of haplotypes) genotypeshaplotypes • Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genome-wide study. • So, at least now, we need algorithms … ? WHY use algorithm in haplotyping?
Ambiguity of Haplotype Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci are heterozygous and their genetic phase is unknown
Rule-based Approaches(Parsimony & Phylogeny)Search an optimal set of haplotypes that satisfies some specific rules
ABC, abc, abC Abc AaBbCC => ABC//abC AABbCc => ABC//Abc Continue … Until on one can be resolved Parsimony Approaches Parsimony rules: Maximum-resolution of genotypes and/or Minimum set of haplotypes Clark’s Algorithm 1.List all unambiguous haplotypes 2.Resolve ambiguous individuals one by one using listed haplotypes 3. If only half-resolved, add new haplotype to the list 4. Continue 2 & 3 5. Until on one can be solved Clark, 1990, Mol. Biol. Evol., 7(2): 111-122
Phylogeny Approaches Given a set of genotypes, find a set of explaining haplotypes, which defines a perfect phylogeny. Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no recombination, infinite-site mutation) D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175.
Probability-based Approaches(EM & MCMC)Calculate probability of haplotype, conditional on genotypes. Pr(H|G)=?
Pr ( H | G ) = ? Mutation Selection Admixture Drift (gene frequencies) Linkage Recombination LD (haplotype frequencies) HWE (genotype frequencies) Epidemiologic Data HaplotypeHaplotype Genotype Sample Phenotype Environment Factors Pr ( P | G,E) = ?
Loci (A,B,C…) Data Structure for Haplotyping Linkage A B C Gene/haplotype frequencies HWE, LD Subjects(1,2,3…) Genotypes Genetic Relationship Haplotypes
HWE & LD Hardy-Weinberg Equilibrium (HWE)Hardy-Weinberg Disequilibrium (HWD) HWE: random combination of allelic genes (same loci) Under HWE, allele freq. determines genotype freq.HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a)Linkage Equilibrium (LE)Linkage Disequilibrium (LD)LE: random combination of genes from different lociLD: association between genes from different loci Under LE, allele freq. determines haplotype freq.LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C)
AABB AaBb AaBb AaBb AABB aabb Genetic Relationship (R) & Linkage (r) AaBb AB//ab or aB//Ab Recombination rate (r) r =0, complete Linkage 0< r <0.5, incomplete Linkage r =0.5, no Linkage AB//ab (if r=0) AB//ab (if r>0) AB//ab, Ab//aB
Haplotyping & Conditional Probability AaBB: Pr(AB//aB)=1 AAbB: Pr(AB//Ab)=1 AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5 AABB, aabb, AABB, aabb, AABB, AABb, aabb AaBB, aabb, AABB, AABB, AABB, AABB, aabb aabb, AABB, AABB, AABB, AaBb, AABB,aabb aabb, AABB, AABB, aabb, AABB, aabb, AABB … Pr(AB//ab)=Pr(Ab//aB)=0.5 ? P(H|G)=? HWE or HWD? LD or LE? P(H|G, R, r)=?
EM Algorithmfor unrelated individualsPr(Ha,b|G,F)=? LD: Pr(ABC)≠Pr(A)*Pr(B)*Pr(C) Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927 Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO)
Likelihood: L(G|F) Haplotypes Haplotype Frequencies Genotypes Joint Likelihood of G given F Prbability of the k-th individual’s G given F & HWE Haplotype-Genotype compatibility index of the k-th individual F=? => Max. L(G|F)
EM AlgorithmMaximum Likelihood Estimationof Haplotype Freq. Lagrange multiplier EM Recursion Partial Derivative Equations Prior Expectation Maximization E … M E M …
Posterior Probability of Haplotype Prior Prob. Posterior Prob.
Limitation of EM Algorithm • For diploid(2n) organism, a genotype of L heterozygous makers may have 2L possible haplotypes, EM is unpractical for large L • Only suitable for small number of loci, 2~12 • While L=20, 2L=1,048,576 …Large space of F • Subseting approaches (partition-ligation & block partitioning etc.) have been used to reduce computational burden …
MCMCMarkov Chain Monte Carlo Algorithmfor unrelated individualsby sampling from Pr(H|G,F) Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE)
Markov Chain MCMC Estimation Random sampling based on Pr(H|G,H_) Repeat many times After getting close to stationary distribution of P(H|G) Collect samples Average over samples
Transition Probability subseting loci, reducing time Coalescent hypothesis, Mutation rate, M haplotypes Add the newly constructed haplotype to list H, pick Gk+1 …
EM Algorithmfor family data(no recombination, r=0) Pr(Ha,b{fam.}|G,R,F)=? Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO) Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP) O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO)
AaBb AB//ab Ab//aB AB//ab AaBb AB//ab Ab//aB AB//ab AaBb AB//ab Ab//aB Ab//aB Haplotype Configuration of Family Genotypes Possible Haplotype Configurations recombinant, as r=0 or nearly =0, impossible or very low prob. , ignored
EM AlgorithmHaplotype Freq. Estimation using Nuclear Families Tips: Only use parents to calculate haplotype freq. (f) Use parents+children ’s info to determine compatibility (c)
EM AlgorithmHaplotype Freq. Estimation for General Pedigrees Tips: Only use founders to calculate haplotype freq. (f) Use all members (founders & non- founders) to determine compatibility (c) Discard the cases with too small probabilities to save time
A Middle Summary …Subject-oriented Algorithms A B C X X X indiv. by indiv. unrelated family by family r=0 Joint Prob. / Likelihood Large/General Pedigree & Allowing Recombination (r>0) ?
Next … Locus-oriented Algorithm (Lander-Green) For Large/General Pedigree Data & Allowing Recombination (r>0) A B C … Joint Prob./ Likelihood X X X A B C Locus by Locus A Pedigree
Prob. Inheritance Vector (V) of a pedigree A Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367 Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN) Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)
Inheritance Vector & Haplotype 5: AaBb 1101 AB//ab 1101 1101 Ab//aB 1111
Lander-Green Algorithm A B C Loci A,B,C,… One pedigree Hidden status (inheritance vectors) Transition Prob.=f(r) Emission Prob. Observations (genotypes) … … VA VB VC Pr(VB|VA) Pr(VC|VB) Pr(Vt+1|Vt) Pr(GA |VA) Pr(GB |VB) Pr(GC |VC) GA GB GC
Lander-Green Algorithm Based (or Similar) Approaches Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Viterbi algorithm, the best haplotype configuration Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2) MCMC: Annealing & Metropolis Process Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767 (software: MERLIN) Allowing LD & Marker Cluster/Block
Practices (1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible haplotype pairs of the child, calculate the probability of each pair. (2) If you know his/her father’s genotype is also AaBbCcDD and mother is AaBbCCDD, list all possible haplotype configurations of his/her family, calculate the probability of each configuration. (Assume recombination rate r=0) (3) Randomly assign a frequency to each haplotype in (1), say, f(ABCD)=0.4,f(abcD)=0.2,…,etc. Make sure the sum=1. Take these frequencies as the true haplotype frequencies in population, recalculate the (posterior) probabilities in (1) and (2). Within a week, send your answers to (E-mail: qunyuan@wustl.edu)