520 likes | 734 Views
Greedy Approximation Algorithms for Covering Problems in Computational Biology. Ion Mandoiu Computer Science & Engineering Department University of Connecticut. Why Approximation Algorithms?. Most practical optimization problems are NP-hard
E N D
Greedy Approximation Algorithms for Covering Problems in Computational Biology Ion Mandoiu Computer Science & Engineering Department University of Connecticut
Why Approximation Algorithms? • Most practical optimization problems are NP-hard • Approximation algorithms offer the next best thing to an efficient exact algorithm • Polynomial time • Solutions guaranteed to be “close” to optimum • -approximation algorithm: solution cost within a multiplicative factor of of optimum cost • Practical relevance: insights needed to establish approximation guarantee often lead to fast, highly effective practical implementations
Why Computational Biology? • Exploding multidisciplinary field at the intersection of computer science, biology, discrete mathematics, statistics, optimization, chemistry, physics, … • Source of a fast growing number of combinatorial optimization applications: • TSP and Euler paths in DNA sequencing • Dynamic Programming in sequence alignment • Integer Programming in Haplotype inference • … • This talk: two “covering” problems in computational biology (primer set selection and string barcoding)
Overview • Potential function greedy algorithm - The set cover problem and the greedy algorithm - Potential function generalization • Primer Set Selection for Multiplex PCR • The String Barcoding Problem • Conclusions
The Set Cover Problem • Given: • Universal set U with n elements • Family of sets (Sx, xX) covering all elements of U • Find: • Minimum size subset X’ of X s.t. (Sx, xX’) covers all elements of U • Greedy Algorithm: - Start with empty X’, and repeatedly add x such that Sx contains the most uncovered elements until U is covered
Approximation Guarantee • Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n) • The approximation factor is tight • Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))
General setting • “Potential function” (X’) 0 • ({}) = max • (X’) = 0 for all feasible solutions • X’’ X’ (X’’) (X’) • If (X’)>0, then there exists x s.t. (X’+x) < (X’) • X’’ X’ ∆(x,X’) ∆(x,X’) for every x, where ∆(x,X’) := (X’) - (X’+x) • Problem: find minimum size set X’ with (X’)=0
Generic Greedy Algorithm • X’ {} • While (X’) > 0 Find x with maximum ∆(x,X’) X’ X’ + x • Theorem (Konwar et al.’05) The generic greedy algorithm has an approximation factor of 1+ln ∆max
Proof Idea Let x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen, and x*1, x*2,…,x*k be the elements of an optimum solution. Charging scheme: xi charges to x*j a cost of where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j}) Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max
Proof of claim 2 Fact 2: Each xi charges at least 1 unit of cost
Overview • Potential function greedy algorithm • Primer set selection for multiplex PCR • Motivation and problem formulation • Greedy applied to primer set selection • Experimental results • The String Barcoding Problem • Conclusions
DNA Structure • Four nucleotide types: A,C,T,G • Normally double stranded • A’s paired with T’s • C’s paired with G’s
Target Sequence Polymerase Primers Primer 1 Primer 2 Repeat 20-30 cycles The Polymerase Chain Reaction
5' 3' Reverse primer L Forward primer 3' 5' amplification locus Primer Pair Selection Problem • Given: • Genomic sequence around amplification locus • Primer length k • Amplification upperbound L • Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)
Multiplex PCR • Multiplex PCR (MP-PCR) • Multiple DNA fragments amplified simultaneously • Each amplified fragment still defined by two primers • A primer may participate in amplification of multiple targets • Primer set selection • Typically done by time-consuming trial and error • An important objective is to minimize number of primers • Reduced assay cost • Higher effective concentration of primers higher amplification efficiency • Reduced unintended amplification
Primer Set Selection Problem • Given: • Genomic sequences around n amplification loci • Primer length k • Amplification upper bound L • Find: • Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other
Applications • Single Nucleotide Polymorphism (SNP) genotyping • Up to thousands of SNPs genotyped simultaneously • Selective PCR amplification required for improved accuracy • Spotted microarray synthesis [Fernandes&Skiena’02] • Primers can be used multiple times • For each target, need a pair of primers amplifying that target and only that target (amplification uniqueness constraint) • Can still reduce #primers from 2n to O(n1/2)
Previous Work on Primer Selection • Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc. • Almost all problem formulations decouple selection of forward and reverse primers • To enforce bound of L on amplification length, select only primers that hybridize within L/2 bases of desired target • In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum • [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation
Previous Work (contd.) • [Fernandes&Skiena’02] study primer set selection with uniqueness constraints • Minimum Multi-Colored Subgraph Problem: • Vertices correspond to candidate primers • Edge colored by color i between u and v iff corresponding primers hybridize within a distance of L of each other around i-th amplification locus • Goal is to find minimum size set of vertices inducing edges of all colors • Can capture length amplification constraints too
Integer Program Formulation • 0/1 variable xu for every vertex • 0/1 variable ye for every edge e
LP-Rounding Algorithm (1) Solve linear programming relaxation (2) Select node u with probability xu (3) Repeat step 2 O(ln(n)) times and return selected nodes • Theorem [Konwar et al.’04]: The LP-rounding algorithm finds a feasible solution at most O(m1/2lnn) times larger than the optimum, where m is the maximum color class size, and n is the number of nodes • For primer selection, m L2 approximation factor is O(Llnn) • Better approximation? • Unlikely for minimum multi-colored subgraph problem
Selection w/o Uniqueness Constraints • Can be seen as a “simultaneous set covering” problem: • - The ground set is partitioned into n disjoint sets Si (one for each target), each with 2Lelements • The goal is to select a minimum number of sets (i.e., primers) that cover at least half of the elements in each partition SNPi L L
Greedy Algorithm • Potential function = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si} • Initially, = nL • For feasible solutions, = 0 • ∆() nL (much smaller in practice) • Theorem [Konwar et al.’05]:The number of primers selected by the greedy algorithm is at most 1+ln(nL) larger than the optimum
Experimental Setting • Datasets extracted from NCBI databases, L=1000 • Dell PowerEdge 2.8GHz Xeon • Compared algorithms • G-FIX: greedy primer cover algorithm [Pearson et al.] • MIPS-PT: iterative beam-search heuristic [Souvenir et al.] • Restrict primers to L/2 bases around amplification locus • G-VAR: naïve modification of G-FIX • First selected primer can be up to L bases away • Opposite sequence truncated after selecting first primer • G-POT: potential function driven greedy algorithm
Overview • Potential function greedy algorithm • Primer Set Selection for Multiplex PCR • The String Barcoding Problem • - Problem Formulation • - Integer programming and greedy algorithms • - Experimental results • Conclusions
Motivation • Rapid pathogen detection • Given • Pathogen with unknown identity • Database of known pathogens • Problem • Identify unknown pathogen quickly • Ideal solution: determine DNA sequence of unknown pathogen
Real World • Not possible to quickly sequence an unknown pathogen • Only have sequence for pathogens in database • Can quickly test for presence of short substrings in unknown virus (substring tests) using hybridization • String barcoding [Borneman et al.’01, RashGusfield’02] • Use substring tests that uniquely identify each pathogen in the database
String Barcoding Problem Given: Genomic sequences g1,…, gn Find: Minimum number of distinguisher strings t1,…,tk Such that: For every gi gj, there exists a string tl which is substring of gi or gj, but not of both • At least log2n distinguishers needed • Fingerprints n distinguishers
Example • Given sequences: 1. cagtgc 2. cagttc 3. catgga • Feasible set of distinguishers: {tg, atgga} Row vectors: unique barcodes for each pathogen
Computational Complexity • [Berman et al.’04] Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))
Setcover Greedy Algorithm • Distinguisher selection as setcover problem • Elements to be covered are the pairs of sequences • Each candidate distinguisher defines a set of pairs that it separates • Another view: covering all edges of a complete graph with n vertices by the minimum number of given cuts • For n sequences, largest set can have O(n2) elements The setcover greedy guarantees ln(n2) = 2 ln n approximation
Integer Program Formulation • 0/1 variable for each candidate distinguisher • 1 candidate is selected • 0 candidate is not selected • For each pair of sequences, at least one candidate separating them is selected • Objective Function • Minimize #selected candidates
Practical Issues • Quadratic # of constraints, huge # of variables • Genome sizes range from thousands of bases for phage and viruses to millions for bacteria to billions for higher organisms • Many variables can be removed: • Candidates that appear in all sequences • Sufficient to keep a single candidate among those that appear in the same set of sequences • How to efficiently remove useless variables? • Rash&Gusfield use suffix trees
v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3} v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3} v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3} v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3} v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1} Suffix Tree Example • Strings: 1. cagtgc 2. cagttc 3. catgga
Integer Program Minimize V18 + V22 + V11 + V17 + V8 #objective function Such that V18 + V17 + V8 >= 1 #constraint to cover pair 1,2 V22 + V11 + V8 >= 1 #constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3 Binaries #all variables are 0/1 V18 V22 V11 V17 V8 End
Limitations of Integer Program Method • Works only for small instances • 50-150 sequences • Average length ~1000 characters • Over 4 hours needed to come within 20% of optimum! • Scalable Heuristics?
Distinguisher 1 1 2 3 Distinguisher 2 n n-1 Distinguisher Induced Partition • Key idea [Berman et al. 04]:Keep track of the partition defined by distinguishers selected so far
Information Content Heuristic • = partition entropy = log2(#permutations compatible with current partition) • Initial partition entropy = log2(n!) n log2n • For feasible distinguisher sets, partition entropy = 0 • ∆() n : • log2(n!) - log2(k!(n-k)!) < log2(2n) = n • Information content heuristic (ICH) = greedy driven by partition entropy • Theorem [Berman et al.’04] ICH has an approximation factor of 1+ln(n)
ICH Limitations • Real genomic data has degenerate nucleotides • Ambiguous sequencing • Single nucleotide polymorphisms • For sequences with degenerate nucleotides there are three possibilities for distinguisher hybridization • Sure hybridization • Sure mismatch • Uncertain hybridization No partition to work with!
Practical Implementation • ICH and setcover greedy give nearly identical results on data w/o non-degenerate bases • Setcover greedy can also be extended to handle • degenerate bases in the sequences • redundancy requirements (each pair of sequences must be separated r times) • Two main steps for both algorithms: • Candidate generation • Greedy selection
Candidate Generation • Can be done using suffix trees • We use a simpler yet efficient incremental approach • Candidates that match all or only one sequence are removed from consideration • Solution quality is similar even when candidates are generated from a single sequence • Equivalent to considering only distinguisher sets that assign a barcode of (1,1,…,1) to the source sequence
Candidate Selection • Evaluate ∆() for all candidates and choose best • Speed-up techniques • Efficient gain computation using partition data-structure • Lazy gain update: if old ∆() is lower than best so far, do not recompute
Experimental Results mat mat part part # n lazy lazy dist 100 35.4 22.1 2.2 1.4 8.0 200 221.6 125.2 8.8 4.6 10.0 500 2168.8 1144.4 53.0 18.7 12.3 1000 5600.4 2756.4 113.6 31.7 14.1 • Averages over 10 testcases, sequence length = 10,000 • Barcodes for 100 sequences of length 1,000,000 computed in less than 10 minutes
Overview • Potential function greedy algorithm • Primer Set Selection for Multiplex PCR • The String Barcoding Problem • Conclusions
Conclusions • General potential function framework for designing and analyzing greedy covering algorithms • Improved approximation guarantees and practical performance for two important optimization problems in computational biology: primer set selection for multiplex PCR, and distinguisher selection for string barcoding