1 / 16

Highly Scalable Genotype Phasing by Entropy Minimization

Highly Scalable Genotype Phasing by Entropy Minimization. Bogdan Pasaniuc and Ion Mandoiu. Computer Science & Engineering Department, University of Connecticut. SNPs. haplotypes. … ataggtcc C tatttcgcgc C gtatacacggg A ctata …  CCA … ataggtcc G tatttcgcgc C gtatacacggg T ctata …  GCT

Download Presentation

Highly Scalable Genotype Phasing by Entropy Minimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Highly Scalable Genotype Phasing by Entropy Minimization Bogdan Pasaniuc and Ion Mandoiu Computer Science & Engineering Department, University of Connecticut

  2. SNPs haplotypes … ataggtccCtatttcgcgcCgtatacacgggActata … CCA … ataggtccGtatttcgcgcCgtatacacgggTctata … GCT … ataggtccCtatttcgcgcCgtatacacgggTctata … CCT SNPs Genome variation:0.1% of the DNA different from one individual to another – 80% of the variation is represented by Single Nucleotide Polymorphisms (SNPs) – 2 possible nucleotides (alleles) for each SNP Haplotype : - description of SNP alleles on one chromosome - 0/1 vector

  3. Notations 011100110 001000010 021200210 + • Diploid organisms: two copies of each chromosome • One from mother and one father • Genotype: description of alleles on both chromosomes • 0/1/2 vector • 0 (1) - both chromosomes contain the dominant (resp. minor) allele • 2 - the chromosomes contain different alleles two haplotypes per individual genotype for the individual

  4. Genotype Population Phasing h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 • For a genotype with k 2’s there are 2k-1 possible pairs of haplotypes explaining it • Physical phasing is too expensive • Computational phasing is much cheaper • Statistical methods: PHASE, Phamily, PL, GERBIL … • Combinatorial methods: Parsimony, HAP, 2SNP, ENT … • Current genotyping platforms -> 500k SNPs in one experiment • Need for fast and accurate methods

  5. Minimum Entropy Population Phasing • Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g • Coverage of h in f – number of times h appears in the image of f • Entropy of a phasing: Minimum Entropy Population Phasing [Halperin&Karp 04]: Given a set of genotypes, find a phasing with minimum entropy

  6. Basic ENT Algorithm Initialization Random phasing Iterative improvement While there exists a genotype whose re-phasing decreases the entropy of f, find the genotype that yields the highest decrease in entropy and re-phase it Entropy is informative only for small number of SNPs Large number of SNPs  no common haplotypes

  7. Overlapping Window approach … … 4 3 2 1 g1 gn … free locked • Entropy is computed over short windows of size l+f • l “locked” SNPs previously phased • f “free” SNPs are currently phased • only phasings consistent with the l locked SNPs are considered • l and f • user specified parameters • auto computed inside the algorithm, based on the number of ambiguous SNPs(2’s) present

  8. ENT Time Complexity • n unrelated genotypes over k SNPs • k/f windows • n*2f candidate hap pairs/window are evaluated (pessimistic estimate) • Computing the entropy gain takes O(1) time per candidate pair • Empirically number of iterations linear in n • Total runtime O(n22fk/f) • Number of iterations reduced to constant by re-explaining multiple genotypes at one step

  9. Extension to general pedigrees • Genotypes coming from related individuals • At each step re-explain an entire family • No Recombination Assumption: a parent transmits one of it’s chromosome to the child • A trio family (mother, father + child) is phased using 4 haplotypes

  10. Experimental Setting • Dataset I: 129 family trios over 103 SNPs [Daly et al. 2001] • From trios, using the no-recombination assumption we recovered partial haplotypes for children • ENT run on the children treated as unrelated gens • Partial haplotypes used for testing accuracy of our method Switching error rate • Given the true haplotypes (t,t’) and the inferred ones (h,h’), the switching error rate is the ratio (given in percents) between the number of times we have to switch from reading h to h’ to obtain t and the number of ambiguous SNPs.

  11. Daly dataset/different window sizes ENT auto/auto used in following experiments

  12. Comparison with other methodsDaly Dataset

  13. Dataset II • Hapmap.org Phase I release 16 datasets • The International HapMap Project • Two 30 trio populations: • CEU – Utah residents with ancestry from northern and western Europe • YRB – Yoruba people of Ibadan, Nigeria • Haplotypes obtained by PHASE • 2SNP [Brinza&Zelikovsky 05] • phasing based on genotype statistics collected only for pairs of SNPs • Pure Parsimony Trio Phasing (PPTP) [Brinza et al. 05] • minimizes the number of distinct haplotypes used for phasing • Integer Linear Programming based method

  14. Hapmap phase I chromosome 22 All chrs ENT: 3h,20m 1,653,765 SNPs All chrs PHASE: over a month on two clusters with a combined total of 238 nodes.

  15. Missing data recovery • We randomly deleted 1-10% of genotype SNPs • Used genotypes with missing data as input • Measured the percent of correctly recovered alleles

  16. Conclusions • ENT is several orders of magnitude faster than current methods • Phasing accuracy close to the best methods • Current version handles any type of pedigree data • Code for download & Web server:http://dna.engr.uconn.edu/~software/ent/ • Thanks: Alexander Gusev and NSF Grants 0546457 and 0543365

More Related