Engineering a Scalable Placement Heuristic for DNA Probe Arrays

Engineering a Scalable Placement Heuristic for DNA Probe Arrays A.B. Kahng, I.I. Mandoiu, P. Pevzner, S. Reda (all UCSD), A. Zelikovsky (GSU)

Outline • DNA probe arrays and unwanted illumination • Synchronous array design (2-D placement) • Asynchronous array design (3-D placement) • Experimental results • Extensions • Conclusions

DNA Probe Arrays • Used in wide range of genomic analyses • Gene expression monitoring, SNP mapping, sequencing by hybridization,… • Arrays with up to 1000x1000 probes in commercial use, 108 probes envisioned for next generation arrays • Highly scalable algorithms required for array design

Simplified DNA Array Flow Probe Selection Mask Design: Placement & Embedding Mask Manufacturing Array Manufacturing Soft/Computational Domain Hybridization Experiment Analysis of Hybridization Intensities Hard/Biochemistry Domain Gene sequences, position of SNPs, etc.

Array Manufacturing Process Very Large-Scale Immobilized Polymer Synthesis: • Treat substrate with chemically protected “linker” molecules, creating rectangular array • Site size = approx. 10x10 microns • Selectively expose array sites to light • Light deprotects exposed molecules, activating further synthesis • Flush chip surface with solution of protected A,C,G,T • Binding occurs at previously deprotected sites • Repeat steps 2&3 until desired probes are synthesized

Photo-Deprotection Step Our concern: diffraction unwanted illumination yield decrease

AC G CG G  M3 ACG AG AC AG C CG C  M2 Placed probes Nucleotide deposition sequence ACG A A A A A C C A  M1 C C C C G G G G G G Probe Synthesis

AC G CG G  M3 ACG AG AC AG C CG border C  M2 Placed probes Nucleotide deposition sequence ACG A A A A A C C A  M1 C C C C G G G G G G Measuring Unwanted Illumination Unwanted illumination border length

4-group … T G G G C A T T G G C A T T T G C C C C A (a) (b) (c) (d) Synchronous vs. Asynchronous Synthesis (a) periodic deposition sequence (b) Synchronous embedding of CTG (c) Asynchronous leftmost embedding of CTG (d) Another asynchronous embedding

H G2 site probe Problem Formulation (Synchronous Case) Synchronous Array Design (2-D Placement) Problem: • Minimize placement cost of Hamming graph H (vertices = probes, distance = Hamming) • On 2-dimensional grid graph G2 (N x N array, edges b/w distance 1 neighbors)

H G2 probe 2-D Placement Lower Bound • Sum of Hamming distances to 4 closest neighbors minus weight of 4N heaviest arcs

TSP+1-Threading Placement Hubbell 90’s • Find TSP tour/path over given probes w.r.t. Hamming distance • Thread TSP path in the grid row by row Hannenhalli,Hubbell,Lipshutz, Pevzner’02 • Place the probes according to 1-Threading • Further decreases total border by 20%

1 2 3 A A C A T A T A T G C G C G G Radix-sort the probes in lexicographical order Thread on the chip Lexicographical Sorting +1-Threading

1 2 2 3 Re-embed using optimal perfect matching 3 2 5 1 4 4 Select an independent (mutually nonadjacent) set of placed probes Total cost can only decrease or remain the same Matching Based Probe Placement Runtime: roughly proportional to square of independent set size

Sliding Window Matching Iterate SlidingWindowMatching over the chip until improvement drops below 0.1% There is a trade-off between solution quality and size/overlap of windows

Effect of Window Size on Solution Quality Increased window size/overlap decreases number of conflicts, but increases runtime

Epitaxial Placement Algorithm • Simulates crystal-growth • Start with arbitrary probe placed at center • Maintain a best probe-candidate (i.e, a probe with min number of conflicts to the already placed neighbors) for each border site • Iteratively fill the border site with minimum increase in border length • - give priority to sites with more neighbors filled

Tile- and Row- Epitaxial • Tile-epitaxial • Divide array into 100x100 tiles • Run Epitaxial within each tile • Take into account border of already placed tiles • Row-epitaxial • Place probes by a fast method, e.g., sort+1-thread • Re-place probes row by row, sequentially filling sites within a row • Assign to each site a probe with min number of conflicts among the unplaced probes from following K rows

2-D Placement Algorithm Comparison: Border Conflict

2-D Placement Algorithm Comparison: Runtime

G2 H site probe Problem Formulation (Asynchronous Case) • Asynchronous synthesis: • Periodic nucleotide deposition sequence, e.g., (ACTG)p • Every probe grows asynchronously  Border length = Hamming distance between embedded probes • Asynchronous Array (3-D Placement) Design Problem: • Minimize placement cost of embedded-probe Hamming graph H (vertices=probes, distance = Hamming b/w embedded probes) • on 2-dimensional grid graph G2 (N x N array, edges b/w neighbors)

Lower Bound • Sum of distances to 4 closest neighbors minus weight of 4N heaviest arcs • Distance between two probes of length p = 2p - |Longest Common Subsequence| • Non-tight bound: example with LB = 8 and best placement cost = 10 1 (c) AC GA 1 A A 1 1 1 1 G 1 G G CT TG Nucleotide deposition sequence S=ACTGA 1 T T T AC GA C C C CT TG A A Optimum placement

Source Sink Optimal Probe Alignment • Find best alignment of probe wrt embedded neighbors • Dynamic Programming: • Source-sink paths corresponds to feasible embeddings • O[(probe length) x (deposition sequence length)] • Can be extended to simultaneous alignment of two adjacent probes (2x1) with increase by O(probe length) A C G A C G T T A C T

3-D Placement Flows • Simultaneous placement and alignment • asynchronous epitaxial (slow and low quality) • Synchronous placement followed by in-place probe alignment (analogous to standard for VLSI flow partition) • using previous DP to do in-place probe alignment • Synchronous placement followed by probe alignment with reshuffle (analogous to feedback loops in VLSI flows) • asynchronous sliding window matching

Algorithms for In-Place Probe Alignment • Asynchronous re-embedding after 2-dim placement • Greedy Algorithm • While there exist probes to re-embed with gain • Optimally re-embed the probe with the largest gain • Batched greedy: speed-up by avoiding recalculations • Chessboard Algorithm • While there is gain • Re-embed probes in green sites • Re-embed probes in red sites

Comparison of In-Place Probe Alignments • Post-placement LB = sum of distances to adjacent probes • Distance between two probes of length p = 2p - |LCS | • Useful for assessing quality of algorithms that change probe embeddings but do not change probe placement

3-D vs. 2-D Placement Results

3-D Placement Algorithm Comparison: Border Conflict

3-D Placement Algorithm Comparison: Runtime

Practical Extensions • Distant-dependent border conflict weights • Take into account conflicts between 2-,3-hop neighbors rather than only immediate neighbors • Position-dependent border conflict weights • In alignment DP for two sequences take into account importance of conflicts in the middle of probes – alignment cost has weights on conflicts which depend on conflict position • Polymorphic probes • Chip contains SNP’s, e.g. pairs of probes different in a single position – they should be placed together and alignment DP should align them simultaneously

Alignment DP for 2-SNP’s Optimal Embedding of A{C,T}T

Simplified DNA Array Flow Probe Selection Mask Design: Placement & Embedding Mask Manufacturing Array Manufacturing Soft/Computational Domain Hybridization Experiment Analysis of Hybridization Intensities Hard/Biochemistry Domain Gene sequences, position of SNPs, etc.

Enhanced DNA Array Design Flow Probe Selection Mask Design: Placement & Embedding

Enhanced DNA Array Design Flow Probe Selection Probe Pools Mask Design: Placement & Embedding

Enhanced DNA Array Design Flow Probe Selection Probe Pools Deposition Mask Design Mask Design: Placement & Embedding

Enhanced DNA Array Design Flow Probe Selection Design Rules &Parameters Probe Pools Deposition Mask Design Mask Design: Placement & Embedding

Enhanced DNA Array Design Flow Probe Selection Design Rules &Parameters Probe Pools Deposition Mask Design Conflict Map Mask Design: Placement & Embedding

Enhanced DNA Array Design Flow Probe Selection Design Rules &Parameters Probe Pools Deposition Mask Design Test/Control Structure Design Conflict Map Mask Design: Placement & Embedding

Summary • Contributions: • Epitaxial placement  reduces by extra 10% over the previously best known method • Asynchronous placement problem formulation • Postplacement improvement by extra 15.5-21.8% • Lower bounds • Scalable Placements (1000x1000 in 20min) • Ongoing work • Comparison on industrial benchmarks • Experiments with algorithms for extended formulations (SNPs, distance-dependent weights, etc.) • Future Directions • Design flow enhancements • Nucleotide deposition sequence design • Partitioning and integration for manufacturing cost reduction

Thank you!

Engineering a Scalable Placement Heuristic for DNA Probe Arrays