180 likes | 191 Views
Implementing a novel method using string barcoding for rapid virus identification, reducing complexity through suffix trees and ILP formulation. Achieve efficient virus signatures for enhanced detection.
E N D
String Barcoding Uncovering Optimal Virus Signatures Sam Rash, Dan Gusfield University of California, Davis.
Motivation • Need for rapid virus detection • Given • unknown virus • database known viruses • Problem • identify unknown virus quickly • Ideal solution • have sequence of • viruses in database • unknown virus • Solution • use BLAST (or any sequence similarity program/algorithm)
Motivation • Real World • only have sequence for pathogens in database • not possible to quickly sequence an unknown virus • can test for presence small (<= 50 bp) strings in unknown virus • substring tests • Another Idea • String Barcoding • use substring tests to uniquely identify each virus in the database • acquire unique barcode for each virus in database
Similar Work • Borneman et al, 2001 • Work similar to String Barcoding • Focused on bacterial size data • used a different approach tailored to their needs
Problem Definition • Formal Definition • given • set of strings S • goal • find set of strings S’, the testing set • wlog, for each s1,s2in S, there exists at least one u in S’ where u is a substring of only s1 • u is a signature substring • minimize |S’| • result • barcode for each element on S
Problem Complexity • Complexity • unknown if NP-hard when size of any uin S’ is unbounded • Max-Length String Barcoding • additional parameter k, a maximum length of any u in S’ • this variant is NP-Hard • reduction from Minimum Testing Set (Garey, Johnson, 1979) • means all real world uses have to deal with NP-hard result
Implementation • Basic Idea: Formulate problem as an ILP • Enumerate some “useful” set of substrings from S • variable in ILP for each substring • Constraint for each pair of strings in S • means that at least one substring will be chosen to distinguish each pair • Objective Function • Minimize sum of variables in ILP
Implementation • Key point: complexity of ILP primarily a function of the number of variables • reducing number of candidate substring tests reduces the number of variables in ILP • how to reduce? • Key to our method: suffix trees • finds minimum cardinality set of “useful” substrings for use as candidate signature substrings
Implementation: Suffix Trees • Key Properties of Suffix Tree build for set of strings S • tree with character sequences labeling edges • nodes labeled with a subset of original string IDs • every substring of original input set appears as a root-edge walk exactly once • root-node walk is considered root-edge walk into node’s in-edge from parent
c g a c a g t t a g t t c c g a g t t Implementation: Suffix Trees • root-edge walk • Creates string • appears in exactly the strings that label the node at which it ends • 2 root-edge walks ending onthe same edge • Both strings created by the walk occur in exactly the same set of original strings • Can use ether string example - a root edge walk
Implementation: Solving • If two substrings occur in exactly the same set of original strings, only one need be considered • Use strings from suffix tree for each uniquely labeled node • Build ILP as discussed • Solve ILP using CPLEX • Acquire barcode and signatures for each original string • signature is the set of substring tests occurring in a string
v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3} v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3} v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3} v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3} v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1} Implementation: Example • strings: 1. cagtgc 2. cagttc 3. catgga • Each node in the suffix tree has a corresponding set of string IDs below it Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga Figure 1.2 - table of string labels for each node in suffix tree from figure 1.1
Implementation: Example minimize V18 + V22 + V11 + V17 + V8 #objective function st V18 + V22 + V11 + V17 + V8 >= 2 #this is the theoretical minimum V18 + V17 + V8 >= 1 #constraint to cover pair 1,2 V22 + V11 + V8 >= 1 #constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3 binaries #all variables are 0/1 V18 V22 V11 V17 V8 end Figure 1.3 - ILP constructed for suffix tree in figure 1.1 using no additional constraints (length, etc) Figure 1.4 - barcodes Figure 1.5 - signatures
Implementation: Extensions • minimum and maximum lengths on signature substrings • acquire barcodes/signatures for only a subset of input strings (wrt to whole set) • minimum string edit distance between chosen signature substrings • redundancy • require r signature substrings to differentiate each pair • adds a higher level of confidence that signatures remain valid even with mutations
Results: Summary • Works quickly on most moderately sized datasets (especially when redundancy >= 2) • dataset properties • ~50k virus genomes taken from NCBI (Genbank) • 50-150 virus genomes • average length of each genome ~1000 characters • total input size ranged from approximately 50,000 – 150,000 characters • increasing dataset size scaled approximately linearly • reach 25% gap (at most 1/3 more than optimum) in just a few minutes • reach small gap (often < 1%) in 4 hours
Results: Summary • increasing redundancy greatly decreases run time and % gap at 4 hours in all cases tested Figure 2.1 - effect of redundancy on avg 25% gap Figure 2.2 - effect of redundancy on avg gap at 4 hours
Conclusion • Practical sized testing sets obtained on reasonable sized input datasets • testing set consisting of 50 – 270 substring tests on input sets of ~100 genomes • works well with reactions that have high number of assays (substring tests) per reaction • GeneChip – 400 assays per reaction • Redundancy • Good concept in theory • Reduces solution space and hence computation time • GeneChip makes higher number of assays needed cost-effective
Future Work • Expand to work on even larger datasets • Improve ILP solving • use other ILP approximations • Determine if unconstrained String Barcoding is NP-hard • More Applications?