Variations of Forward-SBNDM

Variations of Forward-SBNDM Hannu Peltola Jorma Tarhio Aalto University Finland

Aims • Tuning algorithms for exact string matching. • Studying the effect of simultaneous 2-byte read. Aug. 29, 2011

SBNDMSimple Backward Nondeterministic DAWG Matching • SBNDM [18] is a simplification of BNDM [17].Both are bit-parallel algorithms. • Text T = t1...tn, pattern P = p1...pm. • At each alignment window of P in T, scan T from right to left until the suffix of the window is not a factor of P or an occurrence of P is found. Aug. 29, 2011

Shift of SBNDM • No factor: m • P found: 1 • Else: next alignment starts at the last factor Aug. 29, 2011

SBNDM, example P = banana, T = antanabadbanana...alignment: antanabadbananaa na ana Aug. 29, 2011

SBNDM, example P = banana, T = antanabadbanana...alignment: antanabadbananaa na ananot a factor: tananext alignment: antanabadbanana Aug. 29, 2011

SBNDM, example P = banana, T = antanabadbanana...alignment: antanabadbanana a na ananot a factor:tananext alignment: antanabadbanana not a factor:dnext alignment:antanabadbanana Aug. 29, 2011

SBNDMq • SBNDMq [6] is a tuned version of SBNDM. • Processing of an alignment starts with checking a q-gram. • Let q = 4. Consider an alignment at antana. Instead of testing four suffixes a, na, ana, tana,only tana is tested. • Testing is done in a fast loop. Aug. 29, 2011

Forward-SBNDM • Forward-SBNDM (FSB for short) by Faro & Lecroq [7] is a lookahead version of SBNDM2. • Both FSB and SBNDM2 read a 2-gram x1x2 before a factor test. • x1x2 is matched with the end of P in SBNDM2. • Only x1 is matched with the end of P in FSB, and x2 is a lookahead character following the current alignment. • FSB is faster than SBNDM2 for large alphabets. Aug. 29, 2011

Generalization of FSB: FSB(q,f) • FSB(q,f) (= Forward-SBNDM(q,f)) is SBNDMq with f lookahead characters, f = 0, 1, ..., q-1. • FSB(2,1) = FSB and FSB(q,0) = SBNDMq. • Motivation: SBNDMq works well on modern processors also for q>2. Aug. 29, 2011

FSB(q,f) • Let UV be a q-gram, where |V| = f. • Afterreading UV there are 3 alternatives: • If U is a suffix of P, reading continues leftwards. • Else if UV is a factor of P, reading continues leftwards. • Else the state vector is zero and P is shifted m-q+f+1 positions (f positions more than in SBNDMq). Aug. 29, 2011

Occurrence vectors in FSB(q,2) • Example: P = banana bananaSBNDMq: B[n] = 00001010FSB(q,2): B[n] = 00101011 B[a] = 01010111B[x] = 00000011 extra bits Aug. 29, 2011

State vectors in FSB(q,2) for q=4 4-gram nanx: x 00000011 n 00101011 a 01010111n 00101011 00001000 4-gram State vector Conclusionnanx00001000 na is a suffix of Pxana00000000not a factoranan01000000 factor of P nanx is not a factor Aug. 29, 2011

Benefits / drawbacks of lookahead characters and extra bits • Benefits • Longer shifts  more speed • Combined suffix/factor test • Drawback • More q-grams accepted  less speed Aug. 29, 2011

Greedy skip loop for SBNDM2 (GSB2 = Greedy-SBNDM2) • Factor tests of two 2-grams are done in one round. • Let B2[x,y] denote the combined occurrence vector of characters x and y. B2[x,y] = B[x] & (B[y]<<1) next: D  B2[ti,ti+1] if D = 0 then if B2[ti+m-1,ti+m] = 0 then i  i+2*m-2 goto next Aug. 29, 2011

2-byte read • Read two characters (= 2 bytes = 16 bits) in one instruction (in a skip loop). • Suits well q-gram algorithms with even q. • For experiments we made two versions of the algorithms: • Standard (1-byte read) • b-version using 2-byte read Aug. 29, 2011

2-byte read (cont.) • Advantage: a part of computation can moved to preprocessing phase • Example: B2[x,y] = B[x] & (B[y]<<1) • Speed-up factor even more than 2 • Drawback: extra 0.1 ms for preprocessing. Aug. 29, 2011

4-byte read? • Many border crosses happen => slow down • 232 tables too big for practice Aug. 29, 2011

Experimental results/KJV Bible • In the recent comparison S. Faro, T. Lecroq: The Exact String Matching Problem: a Comprehensive Experimental Evaluation (2010), the algorithms EBOM and Hash3 were the fastest in the bible text for m = 4,...,20. Aug. 29, 2011

KJV: EBOM & Hash3 (on ThinkPad X61s) Aug. 29, 2011

KJV: EBOMb & Hash3b (with 2-byte read) added Aug. 29, 2011

KJV: SBNDM2b = FSB(2,0)b added Aug. 29, 2011

KJV: GSB2b added Aug. 29, 2011

KJV: FSB(4,i)b added, i = 0,1,2 Aug. 29, 2011

KJV: Speed-up factors of 2-byte read GSB2 1.32 FSB(2,0) 1.34 FSB(2,1) 1.24 FSB(4,0) 1.72 FSB(4,1) 2.15 FSB(4,2) 2.03 Hash3 1.05 EBOM 1.17 Aug. 29, 2011

Other experiments • DNA and binary data was also tested. • Gain of lookahead characters or the greedy loop was smaller than with the bible data. • Gain of 2-byte read was smaller with 64-bit code than with 32-bit code. Aug. 29, 2011

Conclusions • Two new algorithms were presented: • FSB(q,f) • GSB2 • The new algorithms are faster than earlier algorithms on English data: • GSB2 for m = 4, …, 8 • FSB(q,f) for m = 8, …, 20 • 2-byte read makes most string algorithms faster. Aug. 29, 2011

Web site for practical speed comparison cse.aalto.fi/stringmatching Aug. 29, 2011

Variations of Forward-SBNDM

Variations of Forward-SBNDM

Presentation Transcript

VARIATIONS

Calculus of Variations

Variations of ANOVA

Variations and Valuation of variations

Reconstruction of EUV variations

Variations

5.2 Variations of Heredity

Variations of Language

Variations of Linked Lists

Calculus of Variations

Interface Variations

Variations of Inheritance Patterns

Variations of AM

Variations

Variations of Magic Square

Variations of Texas Holdem

Calculus of Variations

Calculus of Variations

Variations of Linked Lists

Variations of Hair Patterns

Calculus of variations

Variations