1 / 40

Algorithms for SNP Data Collection and Analysis

Algorithms for SNP Data Collection and Analysis. Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/. Outline. Biological background Algorithms for DNA tag set design Entropy-based SNP genotype phasing Conclusions.

floridar
Download Presentation

Algorithms for SNP Data Collection and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for SNP Data Collection and Analysis Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/

  2. Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions

  3. Single Nucleotide Polymorphisms • Human Genome  3  109 base pairs • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • Total #SNPs  1  107 • Difference b/w any two individuals 3  106 SNPs ( 0.1% of entire genome) … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

  4. 011100110 001000010 021200210 + Haplotypes and Genotypes • Diploids: two homologous copies of each chromosome • One inherited from mother and one from father • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype

  5. Why SNPs? • Association studies: identification of disease-related genes • Compare SNP genotypes/haplotypes in affected individuals and controls • If particular genotypes/haplotypes occur more frequently in affected individuals, a gene influencing the disease may be located nearby Drug target discovery, personalized medicine,… • Challenges • Reducing genotyping cost, especially for user selected SNPs  Tag arrays • Current technologies produce genotypes, not haplotypes  Genotype phasing problem

  6. G A C T C A Optical scanning used to identify alleles present in the sample G A C T C A Genotyping via direct hybridization Labeled sample • SNP1 with alleles T/G • SNP2 with alleles A/G Array with 2 probes/SNP Hybridization Images courtesy of Affymetrix.

  7. DNA Tag Arrays • “Programable” arrays • Array consists of application independent tags • Analysis carried by a sequence of reactions involving application specific oligonucleotides • Cost effective AND flexible

  8. Genotyping with Tag Arrays (1) Images courtesy of Affymetrix.

  9. Genotyping with Tag Arrays (2) Images courtesy of Affymetrix.

  10. Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions

  11. Tag Set Design Problem (H1) Tags hybridize strongly to complementary antitags (H2) No tag hybridizes to a non-complementary antitag t1 t1 t2 t2 t1 t1 t2 Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)&(H2)

  12. Hybridization Models (1) • Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state • 2-4 rule: Tm = 2 #(As and Ts) + 4 #(Cs and Gs) • More accurate models exist, e.g., SantaLucia’s near-neighbor model, but are computationally complex

  13. Hybridization Models (2) • Hamming distance model, e.g., [Marathe et al. 01] • Models rigid DNA strands • LCS/edit distance model, e.g., [Torney et al. 03] • Models infinitely elastic DNA strands • c-token model [Ben-Dor et al. 00]: • Duplex formation requires formation of nucleation complex between perfectly complementary substrings • Nucleation complex must have weight  c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (derived from 2-4 rule)

  14. c-h Code Problem • c-token:left-minimal DNA string of weight  c, i.e., • w(x)  c • w(x’) < c for every proper suffix x’ of x • A set of tags is a c-h code if (C1) Every tag has weight  h (C2) Every c-token is used at most once c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code • [Ben-Dor et al.00] gave approximation algorithm based on DeBruijn sequences

  15. Token Content of a Tag c=4 CCAGATT CC CCA CAG AGA GAT GATT Tag  sequence of c-tokens CCAGATT CCCCACAGAGAGATGATT

  16. Layered c-token Graph h-1 h c c+1 … c1 t s cN

  17. Integer Program Formulation • Maximum integer flow problem w/ set capacity constraints • O(hN) constraints & variables, where N = #c-tokens

  18. Packing IP Formulation

  19. Garg-Konemann Algorithm • x  0; y // yi are variables of the dual LP • Find min weight s-t path p, where weight(v) = yi for every vVi • While weight(p) < 1 do M  maxi |p  Vi| xp  xp + 1/M For every i, yi  yi( 1 +  * |p  Vi|/M ) Find min weight s-t path p, where weight(v) = yi for vVi 4. For every p, xp  xp / (1 - log1+) [GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations

  20. LP Based Tag Set Design [MT06] • Run Garg-Konemann’s algorithm and store the minimum weight paths in a list • Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags • Run alphabetic tree search algorithm to select additional tags consisting of not yet used c-tokens

  21. Periodic Tags • Key observation: c-token uniqueness constraint in c-h code formulation is too strong • A c-token should not appear in two different tags, but can be repeated in a tag • A tag t is called periodic if it is the prefix of () for some  • Periodic strings make best use of c-tokens

  22. c-token factor graph (c=4, incomplete) CC AAG AAC AAAA AAAT

  23. Cycle Packing Algorithm [MT06] • Construct c-token factor graph G • T{} • For all cycles C, in increasing order of cycle length, • Add to T the tag defined by C • Remove C from G • Perform an alphabetic tree search and add to T tags consisting of unused c-tokens • Return T

  24. Experimental Results (h=28) Over 40% increase in the number of tags compared to LP-approx

  25. Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions

  26. h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 Genotype Phasing • For a genotype with k 2’s there are 2k-1 possible pairs of haplotypes explaining it • Computational approaches to genotype phasing • Statistical methods: PHASE, Phamily, PL, GERBIL … • Combinatorial methods: Parsimony, HAP, 2SNP, ENT … • Next-generation genotyping platforms yield millions of SNPs per experiment • Need algorithms that are both accurate and scalable

  27. Minimum Entropy Genotype Phasing • Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g • Coverage of h in f – number of times h appears in the image of f • Entropy of a phasing: Minimum Entropy Genotype Phasing [HalperinKarp 04]: Given a set of genotypes, find a phasing with minimum entropy

  28. Iterative Improvement Algorithm Initialization Start with random phasing Iterative improvement step While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it

  29. … 4 3 2 1 g1 gn … free locked Overlapping Window approach • Entropy is computed over short windows of size l+f • l “locked” SNPs previously phased • f “free” SNPs are currently phased • Only phasings consistent with the l locked SNPs are considered

  30. Effect of Window Size

  31. Time Complexity • n unrelated genotypes over k SNPs • k/f windows • n*2f candidate haplotype pairs evaluated per window • O(1) time per pair to compute the entropy gain • Empirically, the number of iterations is linear in n, but is reduced to O(log3n) by re-explaining multiple genotypes per iteration (batching) • Total runtime O(n log3n 2f k/f)

  32. Empirical Runtime

  33. Extension to general pedigrees • Parent-child relationships can be exploited to infer haplotype phase for a substantial fraction of the SNPs • Phasing related genotypes is based on the no recombination assumption • Algorithm modifications: • At each step re-explain an entire family • Cache inheritance pattern given by first window to speed-up computations for subsequent windows • Entropy computation based on founder haplotypes only

  34. Enumeration of Family Phasings • Gaussian elimination [Jiang et al.] • Our current implementation uses a simple backtracking algorithm

  35. Experimental Setup • International HapMap Project, Phase II datasets • 3.7 million SNP loci • Trio and unrelated genotypes from 4 different populations • Reference haplotypes obtained using PHASE • Accuracy measures • Relative Genotype Error (RGE): percentage of missing genotypes inferred differently from the reference method • Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs • Compared algorithms • ENT, our entropy minimization algorithm • 2SNP, phasing based on genotype statistics collected for pairs of SNPs [Brinza&Zelikovsky 05] • Pure Parsimony Trio Phasing (PPTP), minimizes the number of distinct haplotypes using integer programming [Brinza et al. 05]

  36. Hapmap Phase II Trio Populations ENT needs ~1/2 hour on a regular workstation to phase the entire HapMap Phase II dataset, compared to PHASE which required months of CPU time on two clusters with a total of 238 nodes

  37. Complex Pedigree Phasing

  38. Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions

  39. Conclusions and Ongoing Work • New optimization problems arising in the area of SNP data collection and analysis • Combinatorial techniques yield highly scalable runtimes (order of magnitude faster than previous methods) and significant solution quality improvements • Ongoing work • Genotyping based on Sequencing by Hybridization (SBE) • Increased throughput by allele pooling • HMM based phasing and genotyping error detection

  40. Acknowledgments • Dragos Trinca (tag set design) • Sasha Gusev and Bogdan Pasaniuc (genotype phasing) • Funding from NSF (Awards 0546457 and 0543365) and UCONN Research Foundation

More Related