1 / 48

Design and Optimization of Universal DNA Arrays

Design and Optimization of Universal DNA Arrays. Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/. DNA Arrays. Exploit Watson-Crick complementarity to simultaneously perform a large number of substring tests

tomw
Download Presentation

Design and Optimization of Universal DNA Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Optimization of Universal DNA Arrays Ion Mandoiu Computer Science & Engineering Department University of Connecticut http://www.engr.uconn.edu/~ion/

  2. DNA Arrays • Exploit Watson-Crick complementarity to simultaneously perform a large number of substring tests • Used in a variety of high-throughput genomic analyses • Transcription (gene expression) analysis • Single Nucleotide Polymorphism (SNP) genotyping • Alternative splicing, ChIP-on-chip, tiling arrays, genomic-based species identification, point-of-service diagnosis, … • Common array formats involve direct hybridization between labeled DNA/RNA sample and DNA probes attached to a glass slide

  3. Universal DNA Arrays • “Programable” arrays • Array consists of application independent oligonucleotides • Analysis carried by a sequence of reactions involving application specific primers • Flexible AND cost effective • Universal array architectures: tag arrays, APEX arrays, SBE/SBH arrays,…

  4. Overview • Tag Array Design - Tag Set Design - Tag Assignment Algorithms • SBE/SBH Assays - Decoding and Multiplexing Algorithms • Conclusions

  5. T T T A G C C C C G G G G A A SNP Genotyping with Tag Arrays Tag Primer G + A G 2. Solution phase hybridization • Mix reporter probes with unlabeled genomic DNA C antitag 4. Solid phase hybridization 3. Single-Base Extension (SBE)

  6. Tag Array Advantages • Cost effective • Same array used in many analyses  can be mass produced • Easy to customize • Only need to synthesize new set of reporter probes • Reliable • Solution phase hybridization better understood than hybridization on solid support

  7. Tag Set Design Problem (H1) Tags hybridize strongly to complementary antitags (H2) No tag hybridizes to a non-complementary antitag (H3) Tags do not cross-hybridize to each other t1 t1 t2 t2 t1 t1 t2 Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H3)

  8. Hybridization Models • Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state • 2-4 rule Tm = 2 #(As and Ts) + 4 #(Cs and Gs) • More accurate models exist, e.g., the near-neighbor model

  9. Hybridization Models (contd.) • Hamming distance model, e.g., [Marathe et al. 01] • Models rigid DNA strands • LCS/edit distance model, e.g., [Torney et al. 03] • Models infinitely elastic DNA strands • c-token model [Ben-Dor et al. 00]: • Duplex formation requires formation of nucleation complex between perfectly complementary substrings • Nucleation complex must have weight  c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (2-4 rule)

  10. c-h Code Problem • c-token:left-minimal DNA string of weight  c, i.e., • w(x)  c • w(x’) < c for every proper suffix x’ of x • A set of tags is a c-h code if (C1) Every tag has weight  h (C2) Every c-token is used at most once c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code

  11. Algorithms for c-h Code Problem • [Ben-Dor et al.00] approximation algorithm based on DeBruijn sequences • Alphabetic tree search algorithm • Enumerate candidate tags in lexicographic order, save tags whose c-tokens are not used by previously selected tags • Easily modified to handle various combinations of constraints • [MT 05, 06] Optimum c-h codes can be computed in practical time for small values of c by using integer programming • Practical runtime using Garg-Koneman approximation and LP-rounding

  12. Token Content of a Tag c=4 CCAGATT CC CCA CAG AGA GAT GATT Tag  sequence of c-tokens End pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT

  13. Layered c-token graph for length-l tags l-1 l c/2 (c/2)+1 … c1 t s cN

  14. Integer Program Formulation [MPT05] • Maximum integer flow problem w/ set capacity constraints • O(hN) constraints & variables, where N = #c-tokens

  15. Number of c-tokens

  16. Periodic Tags [MT05] • Key observation: c-token uniqueness constraint in c-h code formulation is too strong • A c-token should not appear in two different tags, but can be repeated in a tag • A tag t is called periodic if it is the prefix of () for some “period”  • Periodic strings make best use of c-tokens

  17. c-token factor graph, c=4 (incomplete) CC AAG AAC AAAA AAAT

  18. Vertex-disjoint Cycle Packing Problem • Given directed graph G, find maximum number of vertex disjoint directed cycles in G • [MT 05] APX-hard even for regular directed graphs with in-degree and out-degree 2 • h-c/2+1 approximation factor for tag set design problem • [Salavatipour and Verstraete 05] • Quasi-NP-hard to approximate within (log1- n) • O(n1/2) approximation algorithm

  19. Cycle Packing Algorithm • Construct c-token factor graph G • T{} • For all cycles C defining periodic tags, in increasing order of cycle length, • Add to T the tag defined by C • Remove C from G • Perform an alphabetic tree search and add to T tags consisting of unused c-tokens • Return T • Gives an increase of over 40% in the number of tags compared to previous methods

  20. h Experimental Results

  21. Antitag-to-Antitag Hybridization • Additional constraint: antitags do not cross-hybridize, including self • Formalization in c-token hybridization model: (C3) No two tags contain complementary substrings of weight  c • Cycle packing and tree search extend easily

  22. h Results w/ Extended Constraints

  23. More Hybridization Constraints… t1 t1 t2 • Enforced during tag assignment by • - Leaving some tags unassigned and distributing primers across multiple arrays [Ben-Dor et al. 03] • - Exploiting availability of multiple primer candidates [MPT05]

  24. p’ t’ p t Assignable Primers • If primer p hybridizes to the complement of tag t’, at most one of the assignments (p,t’), (p,t) and (p’,t’) can be made • Set P of primers is assignable to a set T of tags if the condition above is satisfied for every p,p’ and t,t’

  25. Characterization of Assignable Sets • conflict graph: • G=(T  P,E), where (t,p) ∈ E if t and p hybridize • X = number of primers adjacent to a degree 1 tag • Y = number of degree 0 tags X=1 Y=2 • [Ben-Dor 04] Set P is assignable to T iff X+Y  |P|

  26. Finding Assignable Primer Sets Multiplexing Problem: given primer set P and tag set T, find partition of P into minimum number of assignable sets • Both problems are NP-hard [Ben-Dor 04] Maximum Assignable Primer Set Problem: given primer set P and tag set T, find a maximum size assignable subset of P

  27. Integration with Primer Selection • In practice, several primer candidates with equivalent functionality • In SNP genotyping, can pick primer from either forward and reverse strand • In gene expression/identification applications, many primers have desired length, Tm, etc.

  28. Pooled Array Multiplexing Problem Pooled Multiplexing Problem: Given set of primer pools P and tag set T, find a primer from each pool and a partition of selected primers into minimum number of assignable sets

  29. Pooled Multiplexing Algorithms • Primer-Del = greedy deletion for pools similar to [Ben-Dor et al 04] • Repeatedly delete primer of maximum potential until X+Y  #pools, where • Potential of tag t is 2-deg(t) • Potential of primer p is sum of potentials of conflicting tags • Subtract ½ if primer adjacent to a tag of degree 1

  30. Pooled Multiplexing Algorithms • Primer-Del = greedy deletion for pools similar to [Ben-Dor et al 04] • Primer-Del+ = same but never delete last primer from pool unless no other choice • Min-Pot = select primer with min potential from each pool, then run Primer-Del • Min-Deg = select primer with min degree, then run Primer-Del • Iterative ILP = iteratively find a maximum assignable pool set using integer linear program

  31. Results: GenFlex Tags, c=8

  32. Herpes B Gene Expression Assay GenFlex Tags Periodic Tags

  33. Overview • Tag Array Design - Tag Set Design - Tag Assignment Algorithms • SBE/SBH Assays - Decoding and Multiplexing Algorithms • Conclusions

  34. SBE/SBH Assay [MP 06] Primers T T A A T T TTGCA T CCATT A GATAA T hybridization to k-mer array (SBH) single-base extension (SBE)

  35. Some notations • P set of primers, X set of probes • Ep⊆ {A,C,T,G} the set of possible extensions for primer p • The spectrum of primer p, SpecX(p), is the set of probes hybridizing with p • The extended spectrum of primer p with extension set Ep,

  36. Decodable primer sets • Four parallel single-color SBE/SBH experiments  one type of extension in each SBE experiment • P is weakly decodable with respect to extension e if for every primer p • One SBE/SBH experiment with 4 colors (4 extensions) • P is weaklydecodable if for every primer p and every extension e ∈ Ep

  37. Strongly r-decodable primer sets • Hybridization involving labeled nucleotide is less predictable Informative probes should not rely on it • Signal from one SNP may obscure signal from another when read at the same probe due to differences in DNA amplification efficiency Informative probes cannot be shared between SNPs • P is strongly r-decodable if for every primer p where r = redundancy parameter

  38. MPPP • A set of primer pools P ={P1,…,Pn } is strongly r-decodable iff there is a primer pi in each pool Pi such that {p1,…,pn} is strongly r-decodable. Minimum Pool Partitioning Problem (MPPP) Given: • primer pools set P and extensions sets Ep, for every primer p • probe set X • redundancy r Find: partition of P into the min number of strongly r-decodable subsets

  39. MDPSP Maximum r-Decodable Pool Subset Problem (MDPSP) Given: • primer pools set P and extensions sets Ep, for every primer p • probe set X • redundancy r Find: • strongly r-decodable subset of P of maximum size

  40. Min-Greedy Algorithm for Maximum Induced Matching in General Graphs • Pick a vertex u of min degree • Pick a vertex v of min degree from among u’s neighbors • Add edge (u,v) to the matching • Delete all neighbors of u and v from the graph • Repeat the above steps until the graph becomes empty • [Duckworth 05] d-1 approximation factor for d-regular graphs

  41. Min-Greedy Algorithms for MDPSP • Bipartite hybridization graph G: • Primers in left side, probes in right side • Two types of edges: • N+(p)=SpecX(p) • N-(p)=SpecX(p,Ep) \ SpecX(p) • Two algorithm variants: • MinPrimerGreedy: pick primer first • MinProbeGreedy: pick probe first • Delete primer/probe if N+ degree drops below r/1

  42. Experimental results for k-mers (|Ep|=4, primer length=20)

  43. MDPSP Size vs Primer Length k=10

  44. Experimental results for c-tokens (|Ep|=4, primer length=20)

  45. MDPSP Size vs Primer Length c=13

  46. Overview • Tag Array Design - Tag Set Design - Tag Assignment Algorithms • SBE/SBH Assays - Decoding and Multiplexing Algorithms • Conclusions

  47. Conclusions and Ongoing Work • Combinatorial algorithms yield significant increases in multiplexing rates of universal arrays • New SBE/SBH architecture particularly promising based on preliminary simulation results • Ongoing work: • Extend methods to more accurate hybridization models, e.g., use NN melting temperature models • More complex (e.g., temperature dependent) DNA tag set non-interaction requirements for DNA self/mediated assembly • Probabilistic decoding in presence of hybridization errors • Application to novel domains, e.g., DNA barcoding

  48. Acknowledgments • Claudia Prajescu and Dragos Trinca • Funding from NSF (Awards 0546457 and 0543365) and UCONN Research Foundation

More Related