450 likes | 662 Views
ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis. Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State University). Outline. Background on genetic variation Genotype phasing Error detection Disease association search
E N D
ISBRA 2007 Tutorial A:Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State University)
Outline • Background on genetic variation • Genotype phasing • Error detection • Disease association search • Disease susceptibility prediction
Single Nucleotide Polymorphisms • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • High density in the human genome: 1 107 SNPs out of total 3 109 base pairs … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …
011100110 001000010 021200210 + Haplotypes and Genotypes • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype
Why SNPs? • Identification and fine mapping of disease-related genes • Methods: Linkage analysis, allele-sharing, association studies • Genotype data: large pedigrees, sibling pairs, trios, unrelated
Challenges in SNP Data Analysis • Latest technologies deliver 1M SNP genotypes per sample, at low cost • Major challenges • Efficiency • Reproducibility Need simple methods!
h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 Genotype Phasing • For a genotype with k 2’s there are 2k-1 possible pairs of haplotypes explaining it • Computational approaches to genotype phasing • Statistical methods: PHASE, Phamily, PL, GERBIL … • Combinatorial methods: Parsimony, HAP, 2SNP, ENT …
Minimum Entropy Genotype Phasing • Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g • Coverage of h in f – number of times h appears in the image of f • Entropy of a phasing: Minimum Entropy Genotype Phasing [HalperinKarp 04]: Given a set of genotypes, find a phasing with minimum entropy
Iterative Improvement Algorithm[Gusev et al. 07] Initialization Start with random phasing Iterative improvement step While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it
… … 4 3 2 1 g1 gn … free locked Overlapping Window approach • Entropy is computed over short windows of size l+f • l “locked” SNPs previously phased • f “free” SNPs are currently phased • Only phasings consistent with the l locked SNPs are considered
Time Complexity • n unrelated genotypes over k SNPs • k/f windows • n*2f candidate haplotype pairs evaluated per window • O(1) time per pair to compute the entropy gain • Empirically, the number of iterations is linear in n, but is reduced to O(log3n) by re-explaining multiple genotypes per iteration (batching) • Total runtime O(n log3n 2f k/f)
Extension to general pedigrees • Parent-child relationships can be exploited to infer haplotype phase for a substantial fraction of the SNPs • Phasing related genotypes based on the no recombination assumption • Algorithm modifications: • At each step re-explain an entire family • Cache inheritance pattern given by first window to speed-up computations for subsequent windows • Entropy computation based on founder haplotypes only
Enumeration No-Recombination Phasings for a Pedigree • Gaussian elimination [Jiang et al.] • [Gusev et al. 07] implementation based on simple backtracking
Empirical Evaluation • International HapMap Project, Phase I & II datasets • 3.7 million SNP loci • Trio and unrelated genotypes from 4 different populations • Reference haplotypes obtained using PHASE • Accuracy measures • Relative Genotype Error (RGE): percentage of missing genotypes inferred differently from the reference method • Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs
Empirical Evaluation (cont.) • Compared algorithms • ENT [Gusev et al. 07] • 2SNP [Brinza&Zelikovsky 05] • Pure Parsimony Trio Phasing (PPTP) [Brinza et al. 05] • PHASE [Stephens et al 01] • HAP [Halperin&Eskin 04] • FastPhase [Scheet & Stephens 06]
Results on Hapmap Phase II Trio Populations ENT needs only few hours on a regular workstation to phase the entire HapMap Phase II dataset, compared to PHASE which required months of CPU time on two clusters with a total of 238 nodes
Complex Pedigree Phasing Exploiting pedigree info significantly improves accuracy!
Genotyping Errors • A real problem despite advances in technology & typing algorithms • 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] • Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] • In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) • Many errors remain undetected • As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]
Effects of Undetected Genotyping Errors • Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based) • Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04] • 1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]
Related Work • Improved genotype calling algorithms • [Di et al. 05, Rabbee&Speed 06, Nicolae et al. 06] • Explicit modeling in analysis methods • [Sieberts et al. 01, Sobel et al. 02, Abecasis et al. 02,Cheng 06] • Computationally complex • Separate error detection step • [Douglas et al. 00, Abecasis et al. 02, Becker et al. 06] • Detected errors can be retyped, imputed, or ignored in downstream analyses
Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 1 h3 0 1 1 1 0 0 h4 0 1 1 1 0 0 h1 0 1 0 1 0 1 h2 Child 0 2 2 1 0 2 0 1 1 1 0 0 h1 0 0 0 1 0 1 h3 Likelihood of best phasing for original trio T Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 0 h’ 3 0 1 1 1 0 1 h’ 4 0 1 0 1 0 1 h’1 0 1 1 1 0 0 h’2 Child 0 2 2 1 0 2 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3 Likelihood of best phasing for modified trio T’ Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] ? Likelihood of best phasing for original trio T
Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 Child 0 2 2 1 0 2 ? • Large change in likelihood suggests likely error • Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)
Mother …201012 1 02210... Father …201202 2 10211... Child …000120 2 21021... Implementation in FAMHAP[Becker et al. 06] • Window-based algorithm • For each window including the SNP under test, generate list of H most frequent haplotypes (default H=50) • Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes • Flag genotype as an error if L(T’)/L(T) > R for at least one window
Limitations of FAMHAP Implementation • Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values • False positives caused by nearby errors (due to the use of multiple short windows) • [Kennedy et al.] • HMM model of haplotype diversity all haplotypes are represented + no need for short windows • Alternate likelihood functions scalable runtime
HMM Model • Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06] • Block-free model, paths with high transition probability correspond to “founder” haplotypes (Figure from Rastas et al. 07)
HMM Training • Previous works use EM training of HMM based on unrelated genotype data • 2-step algo exploiting pedigree info [Kennedy et al. 07] • Step 1: Infer haplotypes using pedigree-aware algorithm based on entropy-minimization • Step 2: train HMM based on inferred haplotypes, using Baum-Welch
Complexity of Computing Maximum Phasing Probability • How hard is to compute the likelihood function of Becker et al.? • Theorem [Kennedy et al. 07] • Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP, where n is the number of SNP loci • For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(n½-) • Open: complexity for fixed number of founder haplotypes
Complexity of Computing Maximum Phasing Probability • Reductions from the clique problem
Alternate Likelihood Functions • Viterbi probability (ViterbiProb): the maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio • Probability of Viterbi Haplotypes (ViterbiHaps): product of total probabilities of the 4 Viterbi haplotypes • Total Trio Probability (TotalProb): total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths
= maximum probability of emitting SNP genotypes at locus j+1 from states • = transition probability Efficient Computation of Viterbi Probability for Trios • For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time • K3 speed-up by factoring common terms: Where:
Overall Runtimes • Viterbi probability • Likelihoods of all 3N modified trios can be computed within time using forward-backward algorithm • Overall runtime for M trios • Probability of Viterbi haplotypes • Obtain haplotypes from standard traceback, then compute haplotype probabilities using forward algorithms • Overall runtime • Total trio probability • Similar pre-computation speed-up & forward-backward algorithm • Overall runtime
Empirical Evaluation • Real dataset [Becker et al. 2006] • 35 SNP loci on chromosome 16 covering a region of 91kb • 551 trios • Synthetic datasets • 35 SNPs, 30-551 trios, same missing data pattern as real dataset • Haplotypes assigned to trios based on frequencies inferred from real dataset • 1% error rate, four error insertion models • Random allele • Random genotype • Heterozygous-to-homozygous • Homozygous-to-heterozygous
Comparison of Alternative Likelihood Functions (1% Random Allele Errors)
FPs caused by same-locus errors in parents Parents vs. Children (1% Random Allele Errors)
“Combined” Detection Method • Compute 4 likelihood ratios • Trio • Mother-child duo • Father-child duo • Child (unrelated) • Flag as error if all ratios are above detection threshold
Acknowledgements • Sasha Gusev, Justin Kennedy, Bogdan Pasaniuc • NSF funding (Awards 0546457 and 0543365) • Software available at http://www.engr.uconn.edu/~ion/SOFT/