320 likes | 593 Views
Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints. Kishori M. Konwar Ion I. Mandoiu Alexander C. Russell Alexander A. Shvartsman CS&E Dept., Univ. of Connecticut. Combinatorial Optimization in Bioinformatics. Fast growing number of applications
E N D
Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander C. Russell Alexander A. Shvartsman CS&E Dept., Univ. of Connecticut
Combinatorial Optimization in Bioinformatics • Fast growing number of applications • Sequence alignment • DNA sequencing • Haplotype inference • Pathogen identification • … • High-throughput assay design • Microarray probe selection • Microarray quality control • Universal tag arrays • … • This talk: Multiplex PCR primer set selection
Outline • Background and problem formulation • “Potential function” greedy algorithm • Approximation guarantee • Experimental results • Conclusions
Target Sequence Polymerase Primers Primer 1 Primer 2 Repeat 20-30 cycles The Polymerase Chain Reaction
5' 3' Reverse primer L Forward primer 3' 5' amplification locus Primer Pair Selection Problem • Given: • Genomic sequence around amplification locus • Primer length k • Amplification upperbound L • Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)
PCR for SNP Genotyping • Thousands of SNPs to be genotyped using hybridization methods (e.g., SBE) • Selective PCR amplification needed to improve accuracy of detection steps • whole-genome amplification not appropriate • Simultaneous amplification OK Multiplex PCR
Multiplex PCR • How it works • Multiple DNA fragments amplified simultaneously • Each amplified fragment still defined by two primers • A primer may participate in amplification of multiple targets • Primer set selection • Currently done by time-consuming trial and error • An important objective is to minimize number of primers • Reduced assay cost • Higher effective concentration of primers higher amplification efficiency • Reduced unintended amplification
Primer Set Selection Problem • Given: • Genomic sequences around n amplification loci • Primer length k • Amplification upper bound L • Find: • Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other
Previous Work on Primer Selection • Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc. • Almost all problem formulations decouple selection of forward and reverse primers • To enforce bound of L on amplification length, select only primers that hybridize within L/2 bases of desired target • In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum • [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation
Previous Work (2) • [Fernandes&Skiena’02] study primer set selection with uniqueness constraints • Minimum Multi-Colored Subgraph Problem: • Vertices correspond to candidate primers • Edge colored by color i between u and v iff corresponding primers hybridize within a distance of L of each other around i-th amplification locus • Goal is to find minimum size set of vertices inducing edges of all colors
The Set Cover Problem • Given: • Universal set U with n elements • Family of sets (Sx, xX) covering all elements of U • Find: • Minimum size subset X’ of X s.t. (Sx, xX’) covers all elements of U
Selection w/ Length Constraints • “Simultaneous set covering” problem: • - Ground set partitioned into n disjoint sets Si (one for each target), each with 2Lelements • Goal is to select minimum number of sets == primers covering at least 1/2 of the elements in each partition SNPi L L
Greedy Setcover Algorithm • Greedy Algorithm: - Repeatedly pick the set with most uncovered elements • Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n) • The approximation factor is tight • Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))
Potential Functions • Set cover • = #uncovered elements • Initially, = n • For feasible solutions, = 0 • Primer selection with length constraints • = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si} • Initially, = nL • For feasible solutions, = 0
General setting • Potential function (X’) 0 • ({}) = max • (X’) = 0 for all feasible solutions • X’’ X’ (X’’) (X’) • If (X’)>0, then there exists x s.t. (X’+x) < (X’) • X’’ X’ ∆(x,X’) ∆(x,X’) for every x, where ∆(x,X’) := (X’) - (X’+x) • Objective: find minimum size set X’ with (X’)=0
Generic Greedy Algorithm • X’ {} • While (X’) > 0 Find x with maximum ∆(x,X’) X’ X’ + x • Theorem: The generic greedy algorithm has an approximation factor of 1+ln ∆max • Corollary:1+ln(nL) approximation for PCR primer selection
Proof Sketch (1) • x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen • x*1, x*2,…,x*k be the elements of an optimum solution. • Charging scheme: xi charges to x*j a cost of where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j}) Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max
Proof Sketch (2) Fact 2: Each xi charges at least 1 unit of cost
Experimental Setting • Datasets extracted from NCBI databases, L=1000 • Dell PowerEdge 2.8GHz Xeon • Compared algorithms • G-FIX: greedy primer cover algorithm [Pearson et al.] • MIPS-PT: iterative beam-search heuristic [Souvenir et al.] • Restrict primers to L/2 bases around amplification locus • G-VAR: naïve modification of G-FIX • First selected primer can be up to L bases away • Opposite sequence truncated after selecting first primer • G-POT: potential function driven greedy algorithm
Conclusions • Numerous combinatorial optimization problems arising in the area of high-throughput assay design • Theoretical insights such as approximation results can lead to significant practical improvements • Choosing the proper problem model is critical to solution efficiency
Ongoing Work & Open Problems • Degenerate primers • Accurate hybridization model (melting temperature, secondary structure, cross hybridization,…) • In-silico MP-PCR simulator • Partition into multiple multiplexed PCR reactions (Aumann et al. Wabi’03)
Acknowledgments • Financial support from UCONN’s Research Foundation