410 likes | 431 Views
Algorithms for SNP Data Collection and Analysis. Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/. Outline. Biological background Algorithms for DNA tag set design Entropy-based SNP genotype phasing Conclusions.
E N D
Algorithms for SNP Data Collection and Analysis Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/
Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions
Single Nucleotide Polymorphisms • Human Genome 3 109 base pairs • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • Total #SNPs 1 107 • Difference b/w any two individuals 3 106 SNPs ( 0.1% of entire genome) … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …
011100110 001000010 021200210 + Haplotypes and Genotypes • Diploids: two homologous copies of each chromosome • One inherited from mother and one from father • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype
Why SNPs? • Association studies: identification of disease-related genes • Compare SNP genotypes/haplotypes in affected individuals and controls • If particular genotypes/haplotypes occur more frequently in affected individuals, a gene influencing the disease may be located nearby Drug target discovery, personalized medicine,… • Challenges • Reducing genotyping cost, especially for user selected SNPs Tag arrays • Current technologies produce genotypes, not haplotypes Genotype phasing problem
G A C T C A Optical scanning used to identify alleles present in the sample G A C T C A Genotyping via direct hybridization Labeled sample • SNP1 with alleles T/G • SNP2 with alleles A/G Array with 2 probes/SNP Hybridization Images courtesy of Affymetrix.
DNA Tag Arrays • “Programable” arrays • Array consists of application independent tags • Analysis carried by a sequence of reactions involving application specific oligonucleotides • Cost effective AND flexible
Genotyping with Tag Arrays (1) Images courtesy of Affymetrix.
Genotyping with Tag Arrays (2) Images courtesy of Affymetrix.
Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions
Tag Set Design Problem (H1) Tags hybridize strongly to complementary antitags (H2) No tag hybridizes to a non-complementary antitag t1 t1 t2 t2 t1 t1 t2 Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)&(H2)
Hybridization Models (1) • Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state • 2-4 rule: Tm = 2 #(As and Ts) + 4 #(Cs and Gs) • More accurate models exist, e.g., SantaLucia’s near-neighbor model, but are computationally complex
Hybridization Models (2) • Hamming distance model, e.g., [Marathe et al. 01] • Models rigid DNA strands • LCS/edit distance model, e.g., [Torney et al. 03] • Models infinitely elastic DNA strands • c-token model [Ben-Dor et al. 00]: • Duplex formation requires formation of nucleation complex between perfectly complementary substrings • Nucleation complex must have weight c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (derived from 2-4 rule)
c-h Code Problem • c-token:left-minimal DNA string of weight c, i.e., • w(x) c • w(x’) < c for every proper suffix x’ of x • A set of tags is a c-h code if (C1) Every tag has weight h (C2) Every c-token is used at most once c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code • [Ben-Dor et al.00] gave approximation algorithm based on DeBruijn sequences
Token Content of a Tag c=4 CCAGATT CC CCA CAG AGA GAT GATT Tag sequence of c-tokens CCAGATT CCCCACAGAGAGATGATT
Layered c-token Graph h-1 h c c+1 … c1 t s cN
Integer Program Formulation • Maximum integer flow problem w/ set capacity constraints • O(hN) constraints & variables, where N = #c-tokens
Garg-Konemann Algorithm • x 0; y // yi are variables of the dual LP • Find min weight s-t path p, where weight(v) = yi for every vVi • While weight(p) < 1 do M maxi |p Vi| xp xp + 1/M For every i, yi yi( 1 + * |p Vi|/M ) Find min weight s-t path p, where weight(v) = yi for vVi 4. For every p, xp xp / (1 - log1+) [GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations
LP Based Tag Set Design [MT06] • Run Garg-Konemann’s algorithm and store the minimum weight paths in a list • Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags • Run alphabetic tree search algorithm to select additional tags consisting of not yet used c-tokens
Periodic Tags • Key observation: c-token uniqueness constraint in c-h code formulation is too strong • A c-token should not appear in two different tags, but can be repeated in a tag • A tag t is called periodic if it is the prefix of () for some • Periodic strings make best use of c-tokens
c-token factor graph (c=4, incomplete) CC AAG AAC AAAA AAAT
Cycle Packing Algorithm [MT06] • Construct c-token factor graph G • T{} • For all cycles C, in increasing order of cycle length, • Add to T the tag defined by C • Remove C from G • Perform an alphabetic tree search and add to T tags consisting of unused c-tokens • Return T
Experimental Results (h=28) Over 40% increase in the number of tags compared to LP-approx
Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions
h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 Genotype Phasing • For a genotype with k 2’s there are 2k-1 possible pairs of haplotypes explaining it • Computational approaches to genotype phasing • Statistical methods: PHASE, Phamily, PL, GERBIL … • Combinatorial methods: Parsimony, HAP, 2SNP, ENT … • Next-generation genotyping platforms yield millions of SNPs per experiment • Need algorithms that are both accurate and scalable
Minimum Entropy Genotype Phasing • Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g • Coverage of h in f – number of times h appears in the image of f • Entropy of a phasing: Minimum Entropy Genotype Phasing [HalperinKarp 04]: Given a set of genotypes, find a phasing with minimum entropy
Iterative Improvement Algorithm Initialization Start with random phasing Iterative improvement step While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it
… … 4 3 2 1 g1 gn … free locked Overlapping Window approach • Entropy is computed over short windows of size l+f • l “locked” SNPs previously phased • f “free” SNPs are currently phased • Only phasings consistent with the l locked SNPs are considered
Time Complexity • n unrelated genotypes over k SNPs • k/f windows • n*2f candidate haplotype pairs evaluated per window • O(1) time per pair to compute the entropy gain • Empirically, the number of iterations is linear in n, but is reduced to O(log3n) by re-explaining multiple genotypes per iteration (batching) • Total runtime O(n log3n 2f k/f)
Extension to general pedigrees • Parent-child relationships can be exploited to infer haplotype phase for a substantial fraction of the SNPs • Phasing related genotypes is based on the no recombination assumption • Algorithm modifications: • At each step re-explain an entire family • Cache inheritance pattern given by first window to speed-up computations for subsequent windows • Entropy computation based on founder haplotypes only
Enumeration of Family Phasings • Gaussian elimination [Jiang et al.] • Our current implementation uses a simple backtracking algorithm
Experimental Setup • International HapMap Project, Phase II datasets • 3.7 million SNP loci • Trio and unrelated genotypes from 4 different populations • Reference haplotypes obtained using PHASE • Accuracy measures • Relative Genotype Error (RGE): percentage of missing genotypes inferred differently from the reference method • Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs • Compared algorithms • ENT, our entropy minimization algorithm • 2SNP, phasing based on genotype statistics collected for pairs of SNPs [Brinza&Zelikovsky 05] • Pure Parsimony Trio Phasing (PPTP), minimizes the number of distinct haplotypes using integer programming [Brinza et al. 05]
Hapmap Phase II Trio Populations ENT needs ~1/2 hour on a regular workstation to phase the entire HapMap Phase II dataset, compared to PHASE which required months of CPU time on two clusters with a total of 238 nodes
Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions
Conclusions and Ongoing Work • New optimization problems arising in the area of SNP data collection and analysis • Combinatorial techniques yield highly scalable runtimes (order of magnitude faster than previous methods) and significant solution quality improvements • Ongoing work • Genotyping based on Sequencing by Hybridization (SBE) • Increased throughput by allele pooling • HMM based phasing and genotyping error detection
Acknowledgments • Dragos Trinca (tag set design) • Sasha Gusev and Bogdan Pasaniuc (genotype phasing) • Funding from NSF (Awards 0546457 and 0543365) and UCONN Research Foundation