290 likes | 463 Views
Multiple P attern M atching R evisited. Robert Susik 1 , Szymon Grabowski 1 , Kimmo Fredriksson 2. 1 Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland 2 University of Eastern Finland, School of Computing, Kuopio, Finland. PSC, Prague, Sept. 2014.
E N D
Multiple Pattern Matching Revisited Robert Susik1, Szymon Grabowski1,Kimmo Fredriksson2 1 Lodz University of Technology, Institute of Applied Computer Science, Łódź, Poland 2 University of Eastern Finland, School of Computing, Kuopio, Finland PSC, Prague, Sept. 2014
Multiple pattern matching • The problem: • report all text T1..n positions i such that one of r patterns P1..m matches T for some 1 ≤ i ≤ n both over a common integer alphabet of size σ. • Usage: • antivirus scanning, • intrusion detection, • web searches, • etc.
Related work • Aho–Corasick (1975), works in linear time, • Commentz–Walter (1979), based on Boyer–Moore (BM) algorithm - suffix-based approach, • Fredriksson and Grabowski (2009), an average-optimal filtering variant of the classic AC algorithm • Wu and Manber (1994), based on backward matching over a sliding text window, Aho-Corasick trie implementation for he, she, his and hers. Commentz-Walter trie implementation for he, she, his and hers. Wu and Manber Boyer Moore approach. Images taken from: S.M. Vidanagamachchi, S.D. Dewasurendra, R.G. Ragel, M.Niranjan, "Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification?"November 2012; Koloud Al-Khamaiseh Int. Journal of Engineering Research and Applications July 2014
Related work • DAWG-match (Crochemore et al., 1999) and MultiBDM (Crochemore & Rytter, 1994), based on backward matching, linear in the worst case, complex, Multi-BNDM (Navarro & Raffinot, 1998) – bit-parallel version, simplified, • Set Backward Oracle Matching (Allauzen & Raffinot, 1999), similar as above but simpler and is very efficient in practice, • Succinct Backward DAWG Matching (Fredriksson, 2003), practical for huge pattern sets due to use of succinct index, • Faro & Külekci, use of the SSE technology, e.g. wsfp (word-size fingerprint instruction) operation used to identify text blocks that may contain a matching pattern (2012), • Salmela et al. tried a similar approach to ours (not verysuccessful for short patterns in their tests), 2006.
Shift-Or(Baeza-Yates & Gonnet, 1992) • Shift-Or simulates a non-deterministic finite automaton (NFA), with bit-parallelism • Bit-parallelism: • Frequently used in stringology when the results of single operations are boolean or small integers • Many (even w, computer word size) operations can be made in parallel • Reinvented several times, but BY-G (1992)is the most known
Shift-Or – in work gcaga B[g] = 01101 B[c] = 10111 B[a] = 11010 B[] – bit-vector for each alphabet symbol,m * bits in total. V := ~0; i := 0 while i < ndo V := (V<< 1) | B[T[i]] if (m–1)-th bit of V is 0 then report match at position i i := i + 1 Search Preproc T = gcatcgcagagatP = gcaga
Shift-Or • Pros: • Fast: O(nm / w) time in the worst case • when m = O(w), it is linear in time • Cons: • Avg-case is the same as the worst-case but faster methods are possible
Average Optimal Shift-Or (AOSO)(Fredriksson & Grabowski, 2005, 2009) • Motivation: • Improve the avg-case of Shift-Or • Idea: • Sample T every k symbols: T’ = t0, tk, t2k, … • Need to match k subpatterns of P:P0, …, Pk–1, each sampled in the same way as T, starting from 0, 1, …, k–1 • When some subpattern matches, verify whether there is a true match
AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern
AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern
AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern
AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g match of subpattern! verification in T – success
AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern
AOSO – example T’ = gaccggt T = gcatcgcagagatP = gcagag P0 = gaa P1 = cgg Processing: T’ =g.a.c.c.g.g.t P0 =g a a P1 =c g g no match of subpattern
AOSO • Pros: • Faster than Shift-Or: O(n log (m)/m) time in the avg case • Cons: • Needs verification to exclude false matches, not a big problem in practice
Multi-pattern AOSO (MAOSO) • Idea: • Merge r patterns (input patterns) into one superimposed pattern • Check only one superimposed pattern, then exclude false matches • Example (for r = 2): • P0 = ATGG,P1 = ACTA • Merging: P* = [A][TC][GT][GA]
MAOSO – some details • Just set the bit-vectors (in the manner of Shift-Or) if any of the symbols at given position of superimposed pattern is present • Use AOSO for such superimposed pattern • Problem: If r is large and (especially) σ small, then there’s a lot of verifications
Q-grams Idea: grouping q successive text chars into supersymbols. New alphabet size: σq. Enlarging the alphabet may reduce the number of comparisons between the text and the pattern.
Alphabet mapping Map large alphabet of σsymbols to smaller alphabet of σ’ symbols. We achieve this using bin-packing method. New alphabet (σ’ = 4)
Multi AOSO on q-Grams (MAG) • Super-alphabet reduces verification number.We have p = O( (qr)/σq ) probability of match, so verification probability is O( p⌊ m / (kq) ⌋) and the cost is O(rqm) • q-gram based search makes steps bigger (equals q), or in other words text is smaller (n/q) • FAOSO runs in O(n/k · ⌈(m/q)/w⌉)time in our case, where w is the number of bits in computer word (typically 64).
Simple Multi AOSO on q-Grams (SMAG) • Simpler version of previously mentioned method. In this case the whole text is encoded prior to starting the actual search algorithm, which is then more streamlined. • Total complexity is Ω(n), the time to encode the text. • A little faster search, but much longer preprocessing phase. • Maybe useful if text is searched multiple times in short period and we have space to store it in encoded form.
Experimental results • Hardware: Intel Core i3 2100 3.1 GHz CPU 128KB L1, 512KB L2 and 3 MBL3 cache, 4GB of 1333MHz DDR3 RAM • Compiler: g++ version 4.8.1 with -O3 optimization • OS: Ubuntu 64-bit OS with kernel 3.11.0-17 • Text: taken from Pizza & Chili Corpus (http://pizzachili.dcc.uchile.cl), 200MB each • Tests: All source codes have been taken from authors and compiled on the same test machine(some of them cannot handle long patterns, ie. m=64).
Conclusions • Our work can be seen as a newand quite successful combination of known building bricks. • The presented algorithm, MAG, usually wins with its competitors on the three test datasets (english and proteins, dna). • One of the key successful ideas was alphabet quantization (binning),which is performed in a greedy manner, after sorting the original alphabet by frequency.
Future work • Different alphabet mapping techniques could improve efficiency. • Is it possible to choose the algorithm’sparameters in order to reach average optimality (for m = O(w))? • SSE instructions seem to offer great opportunities, especially for bit-parallel algorithms. • Dense codes (e.g., ETDC) for words or q-grams not only servefor compressing data (texts), but also enable faster pattern searches(our preliminary results are rather promising).