250 likes | 463 Views
Accelerating. Boyer Moore Searches on Binary Texts. Shmuel Tomi Klein Miri Kopel Ben-Nissan Bar Ilan University, ISRAEL. Background and motivation. Boyer Moore algorithm. New binary variant. Analysis. Experiments. Summary. Outline. Background and motivation.
E N D
Accelerating Boyer Moore Searches on Binary Texts Shmuel Tomi Klein Miri Kopel Ben-Nissan Bar Ilan University, ISRAEL
Background and motivation Boyer Moore algorithm New binary variant Analysis Experiments Summary Outline Background and motivation Boyer Moore algorithm New binary variant Analysis Experiments Summary
Important application of Automata: KMP BDM BM PATTERN MATCHING Boyer & Moore Match Backwards ! ! this-is-a-sample-text--- pattern
shift x contains no b Boyer – Moore Algorithm Mismatch – case 1: delta1 b does not occur inx y b u x a u
shift x b contains no b Boyer – Moore Algorithm Mismatch – case 2: delta1 b occurs inx y b u x a u
shift x c u Boyer – Moore Algorithm Mismatch – case 3: delta2 u reoccurs inx preceded by c≠a y b u x a u
shift x v Boyer – Moore Algorithm Mismatch – case 4: delta2 Only a suffixvofu reoccurs inx y b u x a u v
here is a simple example example here is a simple example example here is a simple example example delta1 example delta2 here is a simple example example here is a simple example example Boyer – Moore Example
this-is-a-sample-text--- pattern 0100101101011101000100110101001 1101100 Bit-level processing Problems of Binary Boyer & Moore most work by delta1 delta1 useless
Need for Binary Boyer & Moore Compressed Matching Given E(T) and P look for E(P) in E(T) rather than P in D(E(T)) Suggested Solution: BBBMM BlockedBinaryBoyerMooreMatching
k Text [ i ] Pat [ sh , j ] sh sl BBBMM
BBBMM More information in binary case ffghabdgttiocb sbgghj ASCII 01100010 01101010 BINARY
i – 1 i i + 1 T 101 P 101 100 101 01 BBBMM extended delta1
K T P sl k BBBMM Total size of delta1 tables: If too large, use limit value Size of delta1 tables reduced to
T P BBBMM Original delta1 : increase of text pointer BBBMM delta1 : shift size Mismatch not in last block Correct[sh,j]
T P BBBMM delta2
Analysis Assumption: random input Reasonable for compressed text Expected # comparisons till mismatch: Bit-wise: Blocked:
Analysis Expected # bits shifted after mismatch: Bit-wise: M Blocked: M’
Experiments English Bible (2.5MB) World Factbook (1.5MB) Text: Huffman encoded k = 8 Patterns: Random substrings of lengths 10 to 500
Bit-wise 1.5 Blocked 1.4 1.3 1.2 1.1 100 200 300 400 500 length of pattern Experiments: Average # comparisons between shifts
100 Blocked 80 60 40 20 100 200 300 400 500 length of pattern Experiments: Average size of shifts Bit-wise
Bit-wise 500 BDM 400 Blocked 300 200 100 100 200 300 400 500 length of pattern Experiments: Average # comparisons for 1000 bits
Bit-wise BDM 300 Turbo-BDM 250 Blocked 200 150 100 50 100 200 300 400 500 length of pattern Experiments: Time to locate first occurrence (ms)
Summary Blocked variant of BM Faster than alternatives, Overhead 1-10 K Extensions: ASCII, words instead of characters