Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints

Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of Connecticut CS&E Department 2004 GSU Biotech Symposium

Combinatorial Optimization Applications in Bioinformatics • Fast growing number of applications • Dynamic Programming & Integer Programming in sequence alignment • TSP and Euler paths in DNA sequencing • Integer Programming in Haplotype inference • Integer Programming & approximation algorithms for efficient pathogen identification (string barcoding) • … 2004 GSU Biotech Symposium

High-Thrughput Assay Design • New source of combinatorial problems • Microarray probe selection • Mask design for Affy arrays • Universal tag arrays • Self-assembling microarrays • Quality control • … • This talk: Multiplex PCR primer set selection • Optimization goals • Improved speed • High reliability • Reduced COST 2004 GSU Biotech Symposium

Uniplex PCR … 2004 GSU Biotech Symposium

Primer Pair Selection Problem 5' 3' Reverse primer  L  L Forward primer 3' 5' amplification locus • Given: • Genomic sequence around amplification locus • Primer length k • Amplification upperbound L • Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperatures, secondary structure, cross hybridization, etc.) 2004 GSU Biotech Symposium

Motivation for Primer Set Selection (1) • Spotted microarray synthesis [Fernandes and Skiena’02] • Need unique pair for each amplification product, but primers can be reused to minimize cost • Potential to reduce #primers from O(n) to O(n1/2) for n products 2004 GSU Biotech Symposium

Motivation for Primer Set Selection (2) • SNP Genotyping • Thousands of SNPs that must genotyped using hybridization based methods (e.g., SBE) • Selective PCR amplification needed to improve accuracy of detection steps (whole-genome amplification not appropriate) • No need for unique amplification! • Primer minimization is critical • Fewer primers to buy • Fewer multiplex PCR reactions 2004 GSU Biotech Symposium

Primer Set Selection Problem • Given: • Genomic sequences around each amplification locus • Primer length k • Amplification upperbound L • Find: • Minimum size set of primers S of length k such that, for each amplification locus, there are two primers in S hybridizing to the forward and reverse sequences within a distance of L of each other • For some applications: S should contain a unique pair of primers amplifying each each locus 2004 GSU Biotech Symposium

Previous Work (1) • [Pearson et al. 96][Linhart&Shamir’02][Souvenir et al.’03] • - Separately select forward and reverse primers • - To enforce bound of L on amplification length, select only primers that are within a distance of L/2 of the target SNP • Ignores half of the feasible primer pairs • Solution can increase by a factor of O(n) by ignoring them! • Greedy set cover algorithm gives O(ln n) approximation factor for this formulation • Cannot approximate better unless P=NP 2004 GSU Biotech Symposium

Previous Work (2) • [Fernandes&Skiena’02] model primer selection as a minimum multicolored subgraph problem: • Vertices of the graph correspond to candidate primers • There is an edge colored by color i between primers u and v if they hybridize to i-th forward and reverse sequences within a distance of L • Goal is to find minimum size set of vertices inducing edges of all colors • No non-trivial approximation factor known previously 2004 GSU Biotech Symposium

Selection w/o Uniqueness Constraints • Can be seen as a “simultaneous set covering” problem: • - The ground set is partitioned into n disjoint sets, each with 2L elements • The goal is to select a minimum number of sets (== primers) that cover at least half of the elements in each partition • Naïve modifications of the greedy set cover algorithm do not work • Key idea: use potential function  for a partial solution P = minium number of elements that are not yet covered as measure of infeasibility • Initially,  = nL • For feasible solutions,  = 0 2004 GSU Biotech Symposium

Potential-Function Driven Greedy • Select a primer that decreases the potential function  by the largest amount (breaking ties arbitrarily) • Repeat until feasibility is achived • Lemma: Each greedy selection reduces by a factor of at least (1-1/OPT) • Theorem: The number of primers selected by the greedy algorithm is at most ln(nL) larger than the optimum 2004 GSU Biotech Symposium

Selection w/ Uniqueness Constraints • Can be modeled as minimum multicolored subgraph problem: add edge colored by color i between two primers if they amplify i-th SNP and do not amplify any other SNP • Trivial approximation algorithm: select 2 primers for each SNP • O(n1/2) approximation since at least n1/2 primers required by every solution • Non-trivial approximation? 2004 GSU Biotech Symposium

Integer Program Formulation • Variable xu for every vertex (candidate primer) u • xu set to 1 if u is selected, and to 0 otherwise • Variable ye for every edge e • ye set to 1 if corresponding primer pair selected to amplify one of the SNPs • Objective: minimize sum of xu’s • Constraints: • for each i, sum of {ye : e amplifying SNP i}1 • ye xu for every e incident to u 2004 GSU Biotech Symposium

LP-Rounding Algorithm • Solve linear programming relaxation • Select node u with probability xu • Theorem: With probability of at least 1/3, the number of selected nodes is within a factor of O(m1/2lnn) of the optimum, where m is the maximum number of edges sharing the same color. • For primer selection, m  L2 approximation factor is O(Lln n) 2004 GSU Biotech Symposium

Experimental Setting • SNP sets extracted from NCBI databases + randomly generated • C/C++ code run on a 2.8GHz Dell PowerEdge running Linux • Compared algorithms • G-FIX: greedy primer cover algorithm of Pearson et al. • - Primers restricted to be within L/2 of amplified SNPs • G-VAR: naïve modification of G-FIX • For each SNP, first selected primer can be L bases away from SNP • If first selected primer is L1 bases away from the SNP, opposite sequence is truncated to a length of L- L1 • G-POT: potential function driven greedy algorithm • MIPS-PT: iterative beam-search heuristic of Souvenir et al (WABI’03) 2004 GSU Biotech Symposium

Experimental Results, NCBI tests 2004 GSU Biotech Symposium

Experimental Results, k=8 2004 GSU Biotech Symposium

Runtime, k=10 2004 GSU Biotech Symposium

Conclusions • New combinatorial optimization problems arising in the area of high-throughput assay design • Theoretical insights (such as approximation results) give algorithms with significant practical improvements • Choosing the proper problem model is critical to solution efficiency 2004 GSU Biotech Symposium

Ongoing Work & Open Problems • Allow degenerate primers • Incorporate more biochemical constraints into the model (melting temperature, secondary structure, cross hybridization, etc.) • Close gap between O(lnn) inapproximability bound and O(L lnn) approximation factor for minimum multi-colored subgraph problem • Approximation algorithms for partition into multiple multiplexed PCR reactions (Aumann et al. WABI’03) 2004 GSU Biotech Symposium

Acknowledgments • Kishori Konwar • Alex Russell • Alex Shvartsman • Financial support from UCONN Research Foundation 2004 GSU Biotech Symposium

Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints