Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints

Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander C. Russell Alexander A. Shvartsman CS&E Dept., Univ. of Connecticut

Combinatorial Optimization in Bioinformatics • Fast growing number of applications • Sequence alignment • DNA sequencing • Haplotype inference • Pathogen identification • … • High-throughput assay design • Microarray probe selection • Microarray quality control • Universal tag arrays • … • This talk: Multiplex PCR primer set selection

Outline • Background and problem formulation • “Potential function” greedy algorithm • Approximation guarantee • Experimental results • Conclusions

Target Sequence Polymerase Primers Primer 1 Primer 2 Repeat 20-30 cycles The Polymerase Chain Reaction

5' 3' Reverse primer  L Forward primer 3' 5' amplification locus Primer Pair Selection Problem • Given: • Genomic sequence around amplification locus • Primer length k • Amplification upperbound L • Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)

PCR for SNP Genotyping • Thousands of SNPs to be genotyped using hybridization methods (e.g., SBE) • Selective PCR amplification needed to improve accuracy of detection steps • whole-genome amplification not appropriate • Simultaneous amplification OK  Multiplex PCR

Multiplex PCR • How it works • Multiple DNA fragments amplified simultaneously • Each amplified fragment still defined by two primers • A primer may participate in amplification of multiple targets • Primer set selection • Currently done by time-consuming trial and error • An important objective is to minimize number of primers • Reduced assay cost • Higher effective concentration of primers  higher amplification efficiency • Reduced unintended amplification

Primer Set Selection Problem • Given: • Genomic sequences around n amplification loci • Primer length k • Amplification upper bound L • Find: • Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other

Previous Work on Primer Selection • Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc. • Almost all problem formulations decouple selection of forward and reverse primers • To enforce bound of L on amplification length, select only primers that hybridize within L/2 bases of desired target • In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum • [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation

Previous Work (2) • [Fernandes&Skiena’02] study primer set selection with uniqueness constraints • Minimum Multi-Colored Subgraph Problem: • Vertices correspond to candidate primers • Edge colored by color i between u and v iff corresponding primers hybridize within a distance of L of each other around i-th amplification locus • Goal is to find minimum size set of vertices inducing edges of all colors

The Set Cover Problem • Given: • Universal set U with n elements • Family of sets (Sx, xX) covering all elements of U • Find: • Minimum size subset X’ of X s.t. (Sx, xX’) covers all elements of U

Selection w/ Length Constraints • “Simultaneous set covering” problem: • - Ground set partitioned into n disjoint sets Si (one for each target), each with 2Lelements • Goal is to select minimum number of sets == primers covering at least 1/2 of the elements in each partition SNPi L L

Greedy Setcover Algorithm • Greedy Algorithm: - Repeatedly pick the set with most uncovered elements • Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n) • The approximation factor is tight • Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))

Potential Functions • Set cover •  = #uncovered elements • Initially,  = n • For feasible solutions,  = 0 • Primer selection with length constraints •  = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si} • Initially,  = nL • For feasible solutions,  = 0

General setting • Potential function (X’)  0 • ({}) = max • (X’) = 0 for all feasible solutions • X’’  X’  (X’’)  (X’) • If (X’)>0, then there exists x s.t. (X’+x) < (X’) • X’’  X’  ∆(x,X’)  ∆(x,X’) for every x, where ∆(x,X’) := (X’) - (X’+x) • Objective: find minimum size set X’ with (X’)=0

Generic Greedy Algorithm • X’  {} • While (X’) > 0 Find x with maximum ∆(x,X’) X’  X’ + x • Theorem: The generic greedy algorithm has an approximation factor of 1+ln ∆max • Corollary:1+ln(nL) approximation for PCR primer selection

Proof Sketch (1) • x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen • x*1, x*2,…,x*k be the elements of an optimum solution. • Charging scheme: xi charges to x*j a cost of where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j}) Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max

Proof Sketch (2) Fact 2: Each xi charges at least 1 unit of cost

Experimental Setting • Datasets extracted from NCBI databases, L=1000 • Dell PowerEdge 2.8GHz Xeon • Compared algorithms • G-FIX: greedy primer cover algorithm [Pearson et al.] • MIPS-PT: iterative beam-search heuristic [Souvenir et al.] • Restrict primers to L/2 bases around amplification locus • G-VAR: naïve modification of G-FIX • First selected primer can be up to L bases away • Opposite sequence truncated after selecting first primer • G-POT: potential function driven greedy algorithm

Experimental Results, NCBI tests

#primers, as percentage of 2n (l=8) n

CPU Seconds (l=10) n

Conclusions • Numerous combinatorial optimization problems arising in the area of high-throughput assay design • Theoretical insights such as approximation results can lead to significant practical improvements • Choosing the proper problem model is critical to solution efficiency

Ongoing Work & Open Problems • Degenerate primers • Accurate hybridization model (melting temperature, secondary structure, cross hybridization,…) • In-silico MP-PCR simulator • Partition into multiple multiplexed PCR reactions (Aumann et al. Wabi’03)

Acknowledgments • Financial support from UCONN’s Research Foundation

Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints