Algorithms for SNP Data Collection and Analysis

Algorithms for SNP Data Collection and Analysis Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/

Outline • Biological background • Algorithms for DNA tag set design • Entropy-based SNP genotype phasing • Conclusions

Single Nucleotide Polymorphisms • Human Genome  3  109 base pairs • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • Total #SNPs  1  107 • Difference b/w any two individuals 3  106 SNPs ( 0.1% of entire genome) … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

011100110 001000010 021200210 + Haplotypes and Genotypes • Diploids: two homologous copies of each chromosome • One inherited from mother and one from father • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype

Why SNPs? • Association studies: identification of disease-related genes • Compare SNP genotypes/haplotypes in affected individuals and controls • If particular genotypes/haplotypes occur more frequently in affected individuals, a gene influencing the disease may be located nearby Drug target discovery, personalized medicine,… • Challenges • Reducing genotyping cost, especially for user selected SNPs  Tag arrays • Current technologies produce genotypes, not haplotypes  Genotype phasing problem

G A C T C A Optical scanning used to identify alleles present in the sample G A C T C A Genotyping via direct hybridization Labeled sample • SNP1 with alleles T/G • SNP2 with alleles A/G Array with 2 probes/SNP Hybridization Images courtesy of Affymetrix.

DNA Tag Arrays • “Programable” arrays • Array consists of application independent tags • Analysis carried by a sequence of reactions involving application specific oligonucleotides • Cost effective AND flexible

Genotyping with Tag Arrays (1) Images courtesy of Affymetrix.

Genotyping with Tag Arrays (2) Images courtesy of Affymetrix.

Tag Set Design Problem (H1) Tags hybridize strongly to complementary antitags (H2) No tag hybridizes to a non-complementary antitag t1 t1 t2 t2 t1 t1 t2 Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)&(H2)

Hybridization Models (1) • Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state • 2-4 rule: Tm = 2 #(As and Ts) + 4 #(Cs and Gs) • More accurate models exist, e.g., SantaLucia’s near-neighbor model, but are computationally complex

Hybridization Models (2) • Hamming distance model, e.g., [Marathe et al. 01] • Models rigid DNA strands • LCS/edit distance model, e.g., [Torney et al. 03] • Models infinitely elastic DNA strands • c-token model [Ben-Dor et al. 00]: • Duplex formation requires formation of nucleation complex between perfectly complementary substrings • Nucleation complex must have weight  c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (derived from 2-4 rule)

c-h Code Problem • c-token:left-minimal DNA string of weight  c, i.e., • w(x)  c • w(x’) < c for every proper suffix x’ of x • A set of tags is a c-h code if (C1) Every tag has weight  h (C2) Every c-token is used at most once c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code • [Ben-Dor et al.00] gave approximation algorithm based on DeBruijn sequences

Token Content of a Tag c=4 CCAGATT CC CCA CAG AGA GAT GATT Tag  sequence of c-tokens CCAGATT CCCCACAGAGAGATGATT

Layered c-token Graph h-1 h c c+1 … c1 t s cN

Integer Program Formulation • Maximum integer flow problem w/ set capacity constraints • O(hN) constraints & variables, where N = #c-tokens

Packing IP Formulation

Garg-Konemann Algorithm • x  0; y // yi are variables of the dual LP • Find min weight s-t path p, where weight(v) = yi for every vVi • While weight(p) < 1 do M  maxi |p  Vi| xp  xp + 1/M For every i, yi  yi( 1 +  * |p  Vi|/M ) Find min weight s-t path p, where weight(v) = yi for vVi 4. For every p, xp  xp / (1 - log1+) [GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations

LP Based Tag Set Design [MT06] • Run Garg-Konemann’s algorithm and store the minimum weight paths in a list • Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags • Run alphabetic tree search algorithm to select additional tags consisting of not yet used c-tokens

Periodic Tags • Key observation: c-token uniqueness constraint in c-h code formulation is too strong • A c-token should not appear in two different tags, but can be repeated in a tag • A tag t is called periodic if it is the prefix of () for some  • Periodic strings make best use of c-tokens

c-token factor graph (c=4, incomplete) CC AAG AAC AAAA AAAT

Cycle Packing Algorithm [MT06] • Construct c-token factor graph G • T{} • For all cycles C, in increasing order of cycle length, • Add to T the tag defined by C • Remove C from G • Perform an alphabetic tree search and add to T tags consisting of unused c-tokens • Return T

Experimental Results (h=28) Over 40% increase in the number of tags compared to LP-approx

h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 Genotype Phasing • For a genotype with k 2’s there are 2k-1 possible pairs of haplotypes explaining it • Computational approaches to genotype phasing • Statistical methods: PHASE, Phamily, PL, GERBIL … • Combinatorial methods: Parsimony, HAP, 2SNP, ENT … • Next-generation genotyping platforms yield millions of SNPs per experiment • Need algorithms that are both accurate and scalable

Minimum Entropy Genotype Phasing • Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g • Coverage of h in f – number of times h appears in the image of f • Entropy of a phasing: Minimum Entropy Genotype Phasing [HalperinKarp 04]: Given a set of genotypes, find a phasing with minimum entropy

Iterative Improvement Algorithm Initialization Start with random phasing Iterative improvement step While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it

… … 4 3 2 1 g1 gn … free locked Overlapping Window approach • Entropy is computed over short windows of size l+f • l “locked” SNPs previously phased • f “free” SNPs are currently phased • Only phasings consistent with the l locked SNPs are considered

Effect of Window Size

Time Complexity • n unrelated genotypes over k SNPs • k/f windows • n*2f candidate haplotype pairs evaluated per window • O(1) time per pair to compute the entropy gain • Empirically, the number of iterations is linear in n, but is reduced to O(log3n) by re-explaining multiple genotypes per iteration (batching) • Total runtime O(n log3n 2f k/f)

Empirical Runtime

Extension to general pedigrees • Parent-child relationships can be exploited to infer haplotype phase for a substantial fraction of the SNPs • Phasing related genotypes is based on the no recombination assumption • Algorithm modifications: • At each step re-explain an entire family • Cache inheritance pattern given by first window to speed-up computations for subsequent windows • Entropy computation based on founder haplotypes only

Enumeration of Family Phasings • Gaussian elimination [Jiang et al.] • Our current implementation uses a simple backtracking algorithm

Experimental Setup • International HapMap Project, Phase II datasets • 3.7 million SNP loci • Trio and unrelated genotypes from 4 different populations • Reference haplotypes obtained using PHASE • Accuracy measures • Relative Genotype Error (RGE): percentage of missing genotypes inferred differently from the reference method • Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs • Compared algorithms • ENT, our entropy minimization algorithm • 2SNP, phasing based on genotype statistics collected for pairs of SNPs [Brinza&Zelikovsky 05] • Pure Parsimony Trio Phasing (PPTP), minimizes the number of distinct haplotypes using integer programming [Brinza et al. 05]

Hapmap Phase II Trio Populations ENT needs ~1/2 hour on a regular workstation to phase the entire HapMap Phase II dataset, compared to PHASE which required months of CPU time on two clusters with a total of 238 nodes

Complex Pedigree Phasing

Conclusions and Ongoing Work • New optimization problems arising in the area of SNP data collection and analysis • Combinatorial techniques yield highly scalable runtimes (order of magnitude faster than previous methods) and significant solution quality improvements • Ongoing work • Genotyping based on Sequencing by Hybridization (SBE) • Increased throughput by allele pooling • HMM based phasing and genotyping error detection

Acknowledgments • Dragos Trinca (tag set design) • Sasha Gusev and Bogdan Pasaniuc (genotype phasing) • Funding from NSF (Awards 0546457 and 0543365) and UCONN Research Foundation

Algorithms for SNP Data Collection and Analysis