420 likes | 436 Views
Highly Scalable Algorithms for Robust String Barcoding. Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering Department University of Connecticut Storrs, CT * Department of Computer Science University of Illinois at Chicago Chicago, IL.
E N D
Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta* Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering Department University of Connecticut Storrs, CT *Department of Computer Science University of Illinois at Chicago Chicago, IL
Motivation • There are many critical situations when one needs to rapidly identify an unknown genomic sequence from among a given set of known sequences • Rapid identification of pathogens in epidemic outbreaks • Monitoring of microbial communities, e.g., in environmental studies • Fast database search
Possible Approaches • Sequencing based: sequence the unknown DNA sequence, then use similarity search programs such as BLAST to identify the unknown virus sequence for pathogens in database • Sequencing is prohibitively expensive and time consuming • Hybridization Based: identify the unknown sequence by testing for the presence of certain subsequences • Subsequence tests can be performed quickly and at low cost using a variety of hybridization based methods
Sequence Fingerprints • For each sequence, find a subsequence that appears in that sequence and only in that sequence • Sequence barcodes: 0/1 vectors • When using fingerprints, barcode length = #sequences
String Barcoding • (Borneman et al.’01, Rash & Gusfield’02): Unique occurrence of tested subsequences not needed, as long as 0/1 barcodes are unique • When using non-unique subsequences, barcode length can be much smaller than #sequences
Overview • Problem Formulation and Previous Work • Greedy Setcover Algorithm • Experimental Results • Conclusions
Problem Definition Given: Genomic sequences g1,…, gn Find: Minimum number of distinguisher strings t1,…,tk Such that: For every gi gj there exists a distinguisher tl which is a substring of gi or gj but not of both - At least log2n distinguishers needed - n distinguishers are always sufficient
Computational Complexity • [Berman et al.’04] Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))
Rash & Gusfield Integer Program • A non-redundant set of candidate distinguishers is generated using a suffix tree • One variable vi for each candidate distinguisher xi • vi = 1 xi is selected • vi = 0 xi is not selected
Integer Program Example Minimize VTG + VATGGA+ VCAGT + VTTC +VGTGC#objective function Such that VTG+ VTTC + VGTGC >= 1#constraint to cover pair 1,2 VATGGA+ VCAGT + VGTGC >= 1#constraint to cover pair 1,3 VTG+ VATGGA + VCAGT + VTTC >= 1#constraint to cover pair 2,3 Binaries #all variables are 0/1 VTG VATGGAVCAGTVTTCVGTGC End
Limitations of Integer Program Method • Works only for moderately sized datasets • 50-150 sequences • Average sequence length ~1000 nucleotides • Up to 4 hours needed to come within 20% of optimum
Information Content Heuristic • [Berman et al. 2004] • Keep track of the partition defined by distinguishers selected so far • In every step, choose candidate that reduces partition entropy by largest amount • Theorem: Information Content Heuristic is always finding a #distinguishers within 1+ln(n) of optimum
Limitations of ICH • Real genomic sequences contain degenerate nucleotides (e.g., N for any of {A,T,C,G}) due to sequencing errors and known single nucleotide polymorphisms • Distinguisher-to-sequence matches: • Perfect matches • Perfect mismatches • Uncertain matches • Information Content cannot be defined in the presence of uncertain matches
Other Heuristics • (Cazalis et al 2004): greedy setcover, simulated annealing, and genetic algorithms for distinguisher selection • To achieve practical running time, only a small random subset (2000 candidates) of all candidate distinguishers is considered • No data provided on the loss of solution quality due to this restriction
Overview • Problem Formulation and Previous work • Greedy Setcover Algorithm • Experimental Results • Conclusions
Setcover Greedy Heuristic • Phase I: Candidate Generation • Generate a representative set of candidate distinguishers from the source sequences • Phase II: Greedy Distinguisher Selection • In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs
Candidate Generation • A set of candidate distinguishers guaranteed to contain an optimum solution is generated from the sequences • We do not generate certain redundant candidates • A candidate is redundant if there is another candidate that appears exactly in the same set of sequences • For every sequence we generate only one of the substrings that appear exclusively in that sequence
Efficient Candidate Generation • Our implementation uses simple array datastructures • We generate candidates in increasing order of length • Exact match positions for candidates of length l-1 used to generate the exact matches for candidates of length l • Candidates that do not satisfy individual given biochemical constraints, such as minimum/maximumlength, GC content, melting temperature, are discarded
Setcover Greedy Heuristic • Phase I: Candidate Generation • Generate a set of candidate distinguishers from the source sequences • Phase II: Greedy Distinguisher Selection • In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs
Distinguisher Selection as Set Cover • Set Cover Problem: given a universal set U and a family of subsets, find a minimum number of subsets covering U • Distinguisher selection is a special case of set cover: • Elements to be covered are the pairs of sequences • Each candidate distinguisher defines a set of pairs that it separates • By a classical result, the greedy algorithm has an approximation factor of 1+ln(|U|) Setcover greedy has approximation factor of 2*ln(n) for distinguisher selection with n sequences
Distinguisher Selection • Start with an empty set D of distinguishers • While there are pairs of sequences not yet distinguished, do: • Compute for each remaining candidate c its coverage gain (c, D)– the number of not yet distinguished pairs of sequences that are distinguished by c • Add the candidate with maximum coverage gain to D • Return D
Computation of (c, D): CATCAGA TTCAGT TAT AATCAG AATAG D = { }
Computation of (c, D): c=TCAG CATCAGA TTCAGT TAT AATCAG AATAG D = { }
Computation of (c, D): c=TCAG (c, D)= 3 x (5 –3) = 6 CATCAGA TTCAGT TAT AATCAG AATAG D = { }
Computation of (c, D): CATCAGA TTCAGT TAT AATCAG AATAG D = {TCAG}
Computation of (c, D): CATCAGA TTCAGT TAT c=AAT AATCAG AATAG D = {TCAG}
Computation of (c, D): (c,D)= 1 x (2-1) + 1 x (3-1) = 3 CATCAGA TTCAGT TAT c=AAT AATCAG AATAG D = {TCAG}
Computation of (c, D): CATCAGA TTCAGT TAT AATCAG AATAG D = {TCAG,AAT}
Computation of (c, D) • S1, S2, …, Sk are the subsets in the partition defined by D • Mc is the set of matches of candidate c • Using simple datastructures, computation can be done in linear time (in the number of sequences)
Lazy Update of Gains • Coverage gains are monotonically non-increasingduring the algorithm • Re-compute coverage gain for a candidate only if last saved gain is higher than the gain of current best candidate • In practice this speeds-up the selection algorithm by a factor of ~2
Algorithm Extensions • Degenerate bases • A pair of sequences is separated by candidate c if • c has at least one perfect match with one of the sequences, and • c has perfect mismatches at all positions of the other sequence • Gain computation done in O(n2) time using a simple coverage matrix data-structure • Redundancy r • A pair of sequences is counted in the gain function until r distinguishers separate it • Distinguisher cross-hybridization • Minimum edit distance, or maximum common substring weight, bound for every pair of selected distinguishers • Candidates incompatible with a selected distinguisher removed from candidate list
Overview • Problem Formulation and Previous work • Greedy Setcover Algorithm • Experimental Results • Conclusions
Testcases • Randomly generated instances • Equal probabilities assigned to each of the four nucleotides • Microbial genomes extracted from NCBI databases • Sequence lengths between 490 Kbases to 4.75 Mbases • Small number of degenerate bases
Selection time, L=10k, r=1 basic – O(n2) computation of gains using matrix datastructure partition – O(n) computation of gains using partition-based datastructure
NCBI testcase • 20 NCBI microbial genomic sequences • Distinguisher melting temperature range of 55-60 oC • GC content range of 40-60% • Max common subsequence weight bound of 5 • weight(A)=weight(T)=1, weight(C)=weight(G)=2
NCBI testcase, r=1 GATTCGAACCCCCGA AACTGTCTCACGACGTTCTGAA CACAAGGAGTGAGTGTTGC GTGGATGCCTTGGCA AAGCGCGTCGCAAA GGACTACCAGGGTATCTAATCCTG AAAGAAGATAGAGCAGCAGCT CCATTGACAATTTCAACACC CGGTTTTGTGCTTCATGG
Overview • Problem Formulation and Previous work • Greedy Setcover Algorithm • Experimental Results • Conclusions
Conclusions • We provided highly scalable algorithms for the robust string barcoding problem, capable of handling whole genomic sequences of up to bacterial size • Distinguisher selection based whole genomic sequences results in a number of distinguishers nearly matching the information theoretic lower bounds for the problem • The software can be used online at http://dna.engr.uconn.edu/~software/DNA-BAR/