1 / 45

ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis

ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis. Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State University). Outline. Background on genetic variation Genotype phasing Error detection Disease association search

kelda
Download Presentation

ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISBRA 2007 Tutorial A:Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State University)

  2. Outline • Background on genetic variation • Genotype phasing • Error detection • Disease association search • Disease susceptibility prediction

  3. Single Nucleotide Polymorphisms • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • High density in the human genome:  1  107 SNPs out of total 3  109 base pairs … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

  4. 011100110 001000010 021200210 + Haplotypes and Genotypes • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype

  5. Why SNPs? • Identification and fine mapping of disease-related genes • Methods: Linkage analysis, allele-sharing, association studies • Genotype data: large pedigrees, sibling pairs, trios, unrelated

  6. Challenges in SNP Data Analysis • Latest technologies deliver 1M SNP genotypes per sample, at low cost • Major challenges • Efficiency • Reproducibility  Need simple methods!

  7. Genotype Phasing

  8. h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 Genotype Phasing • For a genotype with k 2’s there are 2k-1 possible pairs of haplotypes explaining it • Computational approaches to genotype phasing • Statistical methods: PHASE, Phamily, PL, GERBIL … • Combinatorial methods: Parsimony, HAP, 2SNP, ENT …

  9. Minimum Entropy Genotype Phasing • Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g • Coverage of h in f – number of times h appears in the image of f • Entropy of a phasing: Minimum Entropy Genotype Phasing [HalperinKarp 04]: Given a set of genotypes, find a phasing with minimum entropy

  10. Connection with Likelihood Maximization

  11. Iterative Improvement Algorithm[Gusev et al. 07] Initialization Start with random phasing Iterative improvement step While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it

  12. … 4 3 2 1 g1 gn … free locked Overlapping Window approach • Entropy is computed over short windows of size l+f • l “locked” SNPs previously phased • f “free” SNPs are currently phased • Only phasings consistent with the l locked SNPs are considered

  13. Effect of Window Size

  14. Time Complexity • n unrelated genotypes over k SNPs • k/f windows • n*2f candidate haplotype pairs evaluated per window • O(1) time per pair to compute the entropy gain • Empirically, the number of iterations is linear in n, but is reduced to O(log3n) by re-explaining multiple genotypes per iteration (batching) • Total runtime O(n log3n 2f k/f)

  15. Empirical Runtime

  16. Extension to general pedigrees • Parent-child relationships can be exploited to infer haplotype phase for a substantial fraction of the SNPs • Phasing related genotypes based on the no recombination assumption • Algorithm modifications: • At each step re-explain an entire family • Cache inheritance pattern given by first window to speed-up computations for subsequent windows • Entropy computation based on founder haplotypes only

  17. Enumeration No-Recombination Phasings for a Pedigree • Gaussian elimination [Jiang et al.] • [Gusev et al. 07] implementation based on simple backtracking

  18. Empirical Evaluation • International HapMap Project, Phase I & II datasets • 3.7 million SNP loci • Trio and unrelated genotypes from 4 different populations • Reference haplotypes obtained using PHASE • Accuracy measures • Relative Genotype Error (RGE): percentage of missing genotypes inferred differently from the reference method • Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs

  19. Empirical Evaluation (cont.) • Compared algorithms • ENT [Gusev et al. 07] • 2SNP [Brinza&Zelikovsky 05] • Pure Parsimony Trio Phasing (PPTP) [Brinza et al. 05] • PHASE [Stephens et al 01] • HAP [Halperin&Eskin 04] • FastPhase [Scheet & Stephens 06]

  20. Results on Hapmap Phase II Trio Populations ENT needs only few hours on a regular workstation to phase the entire HapMap Phase II dataset, compared to PHASE which required months of CPU time on two clusters with a total of 238 nodes

  21. Complex Pedigree Phasing Exploiting pedigree info significantly improves accuracy!

  22. Application of Phasing: Missing data recovery

  23. Genotype Error Detection

  24. Genotyping Errors • A real problem despite advances in technology & typing algorithms • 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] • Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] • In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) • Many errors remain undetected • As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]

  25. Effects of Undetected Genotyping Errors • Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based) • Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04] • 1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

  26. Related Work • Improved genotype calling algorithms • [Di et al. 05, Rabbee&Speed 06, Nicolae et al. 06] • Explicit modeling in analysis methods • [Sieberts et al. 01, Sobel et al. 02, Abecasis et al. 02,Cheng 06] • Computationally complex • Separate error detection step • [Douglas et al. 00, Abecasis et al. 02, Becker et al. 06] • Detected errors can be retyped, imputed, or ignored in downstream analyses

  27. Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 1 h3 0 1 1 1 0 0 h4 0 1 1 1 0 0 h1 0 1 0 1 0 1 h2 Child 0 2 2 1 0 2 0 1 1 1 0 0 h1 0 0 0 1 0 1 h3 Likelihood of best phasing for original trio T Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

  28. Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 0 h’ 3 0 1 1 1 0 1 h’ 4 0 1 0 1 0 1 h’1 0 1 1 1 0 0 h’2 Child 0 2 2 1 0 2 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3 Likelihood of best phasing for modified trio T’ Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] ? Likelihood of best phasing for original trio T

  29. Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 Child 0 2 2 1 0 2 ? • Large change in likelihood suggests likely error • Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

  30. Mother …201012 1 02210... Father …201202 2 10211... Child …000120 2 21021... Implementation in FAMHAP[Becker et al. 06] • Window-based algorithm • For each window including the SNP under test, generate list of H most frequent haplotypes (default H=50) • Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes • Flag genotype as an error if L(T’)/L(T) > R for at least one window

  31. Limitations of FAMHAP Implementation • Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values • False positives caused by nearby errors (due to the use of multiple short windows) • [Kennedy et al.] • HMM model of haplotype diversity  all haplotypes are represented + no need for short windows • Alternate likelihood functions  scalable runtime

  32. HMM Model • Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06] • Block-free model, paths with high transition probability correspond to “founder” haplotypes (Figure from Rastas et al. 07)

  33. HMM Training • Previous works use EM training of HMM based on unrelated genotype data • 2-step algo exploiting pedigree info [Kennedy et al. 07] • Step 1: Infer haplotypes using pedigree-aware algorithm based on entropy-minimization • Step 2: train HMM based on inferred haplotypes, using Baum-Welch

  34. Complexity of Computing Maximum Phasing Probability • How hard is to compute the likelihood function of Becker et al.? • Theorem [Kennedy et al. 07] • Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP, where n is the number of SNP loci • For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(n½-) • Open: complexity for fixed number of founder haplotypes

  35. Complexity of Computing Maximum Phasing Probability • Reductions from the clique problem

  36. Alternate Likelihood Functions • Viterbi probability (ViterbiProb): the maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio • Probability of Viterbi Haplotypes (ViterbiHaps): product of total probabilities of the 4 Viterbi haplotypes • Total Trio Probability (TotalProb): total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths

  37. = maximum probability of emitting SNP genotypes at locus j+1 from states •  = transition probability Efficient Computation of Viterbi Probability for Trios • For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time • K3 speed-up by factoring common terms: Where:

  38. Overall Runtimes • Viterbi probability • Likelihoods of all 3N modified trios can be computed within time using forward-backward algorithm • Overall runtime for M trios • Probability of Viterbi haplotypes • Obtain haplotypes from standard traceback, then compute haplotype probabilities using forward algorithms • Overall runtime • Total trio probability • Similar pre-computation speed-up & forward-backward algorithm • Overall runtime

  39. Empirical Evaluation • Real dataset [Becker et al. 2006] • 35 SNP loci on chromosome 16 covering a region of 91kb • 551 trios • Synthetic datasets • 35 SNPs, 30-551 trios, same missing data pattern as real dataset • Haplotypes assigned to trios based on frequencies inferred from real dataset • 1% error rate, four error insertion models • Random allele • Random genotype • Heterozygous-to-homozygous • Homozygous-to-heterozygous

  40. Comparison of Alternative Likelihood Functions (1% Random Allele Errors)

  41. FPs caused by same-locus errors in parents Parents vs. Children (1% Random Allele Errors)

  42. “Combined” Detection Method • Compute 4 likelihood ratios • Trio • Mother-child duo • Father-child duo • Child (unrelated) • Flag as error if all ratios are above detection threshold

  43. Comparison with FAMHAP (Children)

  44. Comparison with FAMHAP (Parents)

  45. Acknowledgements • Sasha Gusev, Justin Kennedy, Bogdan Pasaniuc • NSF funding (Awards 0546457 and 0543365) • Software available at http://www.engr.uconn.edu/~ion/SOFT/

More Related