Filter based fast matching of long patterns by using SIMD instructions

Filter based fast matching of long patterns by using SIMD instructions M. Oğuzhan Külekci TÜBİTAK-UEKAE National Research Institute of Electronics & Cryptology,Turkey kulekci@uekae.tubitak.gov.tr PSC'09, Prague, Czech Republic

Area of the research • Off-line (without using an index) exact pattern matching • for patterns longer than 32 bytes • Methodology : Filter-then-search • Technology : SIMD parallelisation PSC'09, Prague, Czech Republic

Filter-then-search • Two phase search: • Filtration: Detect potential match areas of the text via an easy to compute filter function. • Verification: Verify the existance of the pattern only on those positions that pass the filtration. PSC'09, Prague, Czech Republic

Filter-then-search • To be adventegous, computing the filter value of a text portion must be cheaper than performing a full scan of the pattern on the same area. PSC'09, Prague, Czech Republic

Filter-then-search • Efficiency of a filter depends on two criterion: • Distinguishing power: The rate of call for verification. • Computation cost: The time and space complexity of the filter calculation • Usually there is a trade-off in between. PSC'09, Prague, Czech Republic

Previous work on filtering algorithms • Filtering algorithms are especially useful in approximate string matching e.g., factor filters, suffix filters, gapped q-grams,counting filters,... – not included in this study • Lecroq’s q-hash algorithm for exact matching(2007) • Fredriksson&Grabowski’s AOSO and FAOSO(2005) • Bit-parallel algorithms may be used as filters PSC'09, Prague, Czech Republic

This work offers... • This work proposes to benefit from single instruction multiple data (SIMD) parallelisation in pattern matching. • With this aim, it presents a filter that is easy to compute with SIMD intrinsics. PSC'09, Prague, Czech Republic

SIMD Technology • Rather new with less than a decade history. • Main target was multimedia applications (audio/video/image processing) as algorithms on those areas are quite data parallel. • Not very much addressed in string algorithms area PSC'09, Prague, Czech Republic

x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 SIMD Technology • 128-bit special registers (4 floats/integers, or 2 double, or 16 8-bit characters) • Special instruction set dedicated to some operations Θ  Θ  Θ  Θ  PSC'09, Prague, Czech Republic

Intel SIMD Technology • SSE (streaming SIMD extensions) • MMX • SSE2 • SSE3, SSE3e • SSE 4.1 & SSE 4.2 (special instructions dedicated to string matching) • AVX (advanced vector technology, 2011 ?) PSC'09, Prague, Czech Republic

16-byte block D0 Di DN-1 ... t15 ... ti.16+15 t(N-1).16 ... tn-1 t0 ti.16 Q0 QM-1 ... p15 p(M-1).16 ... pm-1 p0 Basics... Text: N = n / 16 M = m / 16 Pattern: PSC'09, Prague, Czech Republic

16-byte 16-byte 16-byte 10-byte 6-byte Q0 Q1 Q2 Q3 Basics... Let L represents zero-based index of the last whole 16-byte block of the pattern L = m/16 - 1 Example: m=58, Q=Q0Q1Q2Q3 last 6 bytes of Q3 are null padded, L = 58/16-1 = 2. PSC'09, Prague, Czech Republic

Main idea F = filter(Dz.L+L), for 0 z < N/L • F indicates if P may begin at any byte in previous 16-byte blocks Dz.L to Dz.L+(L-1). • If so, call verification. • Move towards right by L blocks. PSC'09, Prague, Czech Republic

Q0 Q0 Q0 Q1 Q1 Q1 Q2 Q2 Q2 Q3 Q3 Q3 Main idea • Following the same example, m=58,L=2 D0 D1 D2 D3 D4 D5 31 0 15 16 PSC'09, Prague, Czech Republic

Q0 Q0 Q0 Q1 Q1 Q1 Q2 Q2 Q2 Q3 Q3 Q3 Main Idea • Following the same example, m=58,L=2 D0 D1 D2 D3 D4 D5 63 32 47 48 Note that L is actually the shift amount! PSC'09, Prague, Czech Republic

Byte 0 Byte 1 Byte 15 15 15 b 0 0 1 1 b 0 1 b b b b b b 15 15 b b 7 1 0 7 0 7 1 1 K 0 15 0 0 1 b 0 1 b 0 b 0 b b K+1 K K K+1 K+1 b b b 1 15 0 F = K K K Filter Calculation Given a 16-byte block : 1. Shift each byte left by K bits 2. Concatenate individual sign bits PSC'09, Prague, Czech Republic

Shift left by K bits ? • Why do we shift each byte left by K bits? • To compose the filter by the most informative bits of the bytes  more distinguishing filter • How to determine actual K? • According to the alphabet • According to the text (more powerfull, but not practicle) PSC'09, Prague, Czech Republic

Shift left by K bits? e.g. ASCII coded DNA sequences a = 0110 0001 t = 0111 0100 c = 0110 0011 g = 0110 0111 PSC'09, Prague, Czech Republic

Filter computation via SSE 2 intrinsics from SSE2 instruction set: • tmp128 = mm_slli_epi64(inp128,K); * performs the shift • F = mm_movemask_epi8(tmp128); * performs the sign bit concatenation PSC'09, Prague, Czech Republic

Preprocessing • The pattern can align with the text block Di+L (i=0 mod L), on which the filter is to be computed, in L.16 different ways. Di+L Di 0 1 15 0 1 15 p0 p1 p15 p16L p16L+1 p16L+15 p0 p1 p14 p16L-1 p16L p16L+14 L.16 p0 p1 p2 p15 16 PSC'09, Prague, Czech Republic

Preprocessing FList m=58, L = 2; i=0  r = 32  f([p32 .. P47]) i=1  r = 31  f([p31 .. P46]) i=31  r = 1  f([p1 .. P16]) null 0 12 1 null 3 21 null 65535 null PSC'09, Prague, Czech Republic

SSEF Algorithm PSC'09, Prague, Czech Republic

SSEF Algorithm Following the same example (m=58, L=2), let’s investigate the situation for i=16. D16 = [t16.16 .. t16.16+15] = [t256 .. t271] assume f(D16) = 217, and FList[217]  1  null. Remember that 1 means [p31..p47] may align with [t256..t271]. Thus, it’s appropriate that P may occur at [t225..t282]. Note that [t(i-L).16+j..t(i-L).16+j+m-1] = [t225..t282] ,for i=16,L=2,m=58,j=1 Call verification to check if [p0..p58] = [t225..t282]. PSC'09, Prague, Czech Republic

Complexity • Preprocessing • Space : • 64K Flist + 16.L pattern filter nodes • space consumption is O(16.L)  O(m), • Remembering L= m/16-1 • Time : • Exactly 16.L filter operations are performed on the pattern • O(16.L)  O(m) PSC'09, Prague, Czech Republic

Complexity • Searching • Filter is computed over N 16-byte blocks in steps of L. • Total number of filtering operation is O(N/L) • After each filter computation, verification is called • Maximum 16.L times • Minimum 0 times • Average  16.L / 64K times PSC'09, Prague, Czech Republic

Complexity • Best case • no verification call, just the filter calculations • O(N/L)  O(n/m) • Worst case • All possible 16.L alignments are verified at each filter • O( N/L + (L.16).m)  O(n.m) • Average case • At each filtering operation verification is called with a probablity of 16.L/64K • O( N/L + N/L.(L.16/64K).m)  O(n/m + n.m/65536 ). PSC'09, Prague, Czech Republic

Experimental Results • SSEF is compared with BLIM, QS, 3-hash, 8-hash, BOM2, and BSOM2 • 64-bit Intel Xeon machine, 3GB memory, gcc with –O3 option • Small, medium and large alphabet sets PSC'09, Prague, Czech Republic

Data sets PSC'09, Prague, Czech Republic

PSC'09, Prague, Czech Republic

Grand Average Speed • Very fast on binary alphabet, and plain DNA sequences. • On natural language text, the improvement is significant also. PSC'09, Prague, Czech Republic

Conclusions • Initial attempt to benefiting from SIMD parallelisation. • Faster than the alternatives in all alphabet sizes. • Its speed is not much effected by the alphabet size (similar to q-hash filter). • A new strong alternative for exact matching of long patterns on biological sequences. PSC'09, Prague, Czech Republic

Future work • Can we do better with SSE4 ? • What about the shorter (<32) length patterns? • Any other point to deploy SIMD parallelisation in string algorithms area? PSC'09, Prague, Czech Republic

Thank you...questions ? PSC'09, Prague, Czech Republic

Filter based fast matching of long patterns by using SIMD instructions

Filter based fast matching of long patterns by using SIMD instructions

Presentation Transcript

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

ESTC MATCHING/REPORTING INSTRUCTIONS

SIMD-Based Decoding of Posting Lists

Fast Bayesian Matching Pursuit

SIMD

Segment-based Stereo Matching Using Graph Cuts

Image Matching and Retrieval by Repetitive Patterns

Parallelization of the SIMD Kalman Filter for Track Fitting

High Performance Pattern Matching using Bloom- Bloomier Filter

Simulation based analysis of FAST TCP using OMNET++

Block Matching using Fast Walsh Search

Matching of finger vein patterns

Fingerprint Recognition by Matching of Gabor filter-based Patterns

An Analysis of SIMD Instructions in the Pentium III Microprocessor

Pattern Matching: Simple Patterns

The secrets of fast SMARTS matching

Fast Pattern Matching

Long Life Filter Bag – Better Filter

UniClear filter Instructions