Pattern Matching Using n -grams With Algebraic Signatures

Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2][1] Université Paris Dauphine[2] Santa Clara University

n-gram Search • New pattern matching idea • Matches algebraic signatures • Preprocesses both :pattern & string (record) • String preprocessing is a new idea • To the best of our knowledge • Provides incidental protection of stored data • Important for P2P & grid systems • Fast processing • Especially useful for DBs & longer patterns • ASCII, Unicode, DNA… • Should be then often faster than Boyer-Moore • Possibly the fastest known in this context

Algebraic Signature • Symbols of the alphabet are elements of a Galois Field • GF (256) usually • We choose there one primitive element  • Usually  = 2 • The algebraic signature of the string of i symbols p1… piis the sum: p’i= p1+…+pii. • Here the addition and the multiplication are the operations in GF.

Algebraic Signature • In our GF (2f) where f = 8,16: p + q = p – q = p XOR q • One method for multiplying is : p*q = antilog (( log p + log q) mod 255) • The division is then : p / q = antilog (( log p - log q) mod 255) • The log and antilog are encoded in log and antilog tables with 2f elements each. • Entry 0 is for element 0 of the GF and is by convention set to 2f - 1.

Cumulative Algebraic Signature • We encode every symbol piin a string into the signature of the prefix p1…pi • The value of a CAS symbol now encodes also the knowledge of values of all the previous ones • Matching a single symbol means prefix matching

Application of CASs • Protection against involuntary data disclosure • On P2P & Grid Servers especially • Numerous CAS encoded string matching algorithms • Prefix match with O (1) complexity • Pattern match by signature only • Karp – Rabin like, linearO (L) complexity • Longest common string search • Longest common prefix search • …

CAS Properties • O (K) encoding and decoding speed • For encoding, for instance: p’i= p’i-1 + pi  i= CAS ( pi-1) + pi i • Fast n – gram signature calculus • For Sk,l =pk…plwith k > 1 and l – k = n : AS ( Sk,l ) = AS (S l - k+1) = (p’l XOR p’k - 1) / k-1 • Logarithmic Algebraic Signature (LAS) LAS ( Sk,l ) = log AS ( Sk,l ) = = ( log (p’l XOR p’k - 1) – (k-1)) mod 2f – 1

The n-gram SearchKey ideas • Design a sublinear pattern match search • With speed about L / K • Apply to CAS encoded DB • New idea for string search algorithm with preprocessing • Justified for a DB • Store once, search many times

The n-gram SearchKey ideas • Preprocess the pattern to create a jump table • As in Boyer – Moore • Use n –grams with n > 1 to increase the discriminative power of an attempt • Comparisonof a sample from the pattern • a single symbol for BM • an LAS of an n – gram for a CAS-encoded string

The n-gram SearchKey ideas • If the alphabet uses m symbols, the probability that a symbol matches is 1/m • Assuming all symbols equally likely • For usual ASCII pattern matching m = 20-25 • For DNA m = 4 • A single symbol may often match without the whole pattern matching • e.g., ¼ times for DNA on the average • Leading to small jumps, • by m symbols on the average

The n-gram SearchKey ideas • The probability of an n - gram matching may be : min ( 1/ 2f , 1 / mn ) • In our examples it can reach 1 / 256 • More discriminative sampling • Longer jumps • By almost K or 256 symbols in general • Useful for longer strings • DNA, text, images…

ASCII ExempleUsual Alphabet 2-grams => 5 jumps 1-gram => 6 jumps

DNA Exemple4-letter Alphabet 3 jumps 4 jumps 4 jumps 11 jumps

The n-gram Search Preprocessing • Encode every record (string) into its CAS • Done for incidental protection anyhow for SDDS-2006 • Encode the terminal n - gram of the searched pattern SKintoits LAS in variable V • Fill up the jump table T for every other n - gram in SK • calculate every LAS • for each LAS, store in T its rightmost offset with respect to the end of SK

The n-gram Search Jump Table • For GF (256), every n – gram Si, i+n-1in the pattern and i = LAS (Si, i+n-1): • T ( i ) = the offset • T ( i ) = K – n + 1 otherwise • Remainder : LAS (0) = 255 • T can be also hash table • See the paper • Slower to use but possibly more memory efficient • Probably more useful for a larger GF

ASCII Exemple Dauphine 0 7 1 7 … … in’’ 1 V = ne’’ … … au’’ 5 … … ph’’ 3 Notation : xy’’ = LAS (xy) … … 255 7

Calculate LAS of the current n-gram in the string Start with the n-gram SK-n+1,K Continue depending on jump calculus Attempt to match V If .true then calculate LAS of the entire current possibly matching substring of length K and ending with the current n-gram If .true, then resolve the possible collision Either attempt to match all the K symbols Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value The n-gram Search Processing

Otherwise Go to T using LAS of the n-gram Jump by the number of symbols found in T Update the “current” position for n-gram to attempt the match Re-attempt the match as above Unless the n-gram to attempt is beyond the end of the string The n-gram Search Processing

ASCII Exemple Again 2-grams => 5 jumps 1-gram => 6 jumps

DNA Exemple Again 3 jumps 4 jumps 4 jumps 11 jumps

n-grams / BM • Average shifts with n-grams can betypicallylonger • Calculate an attempt & jump may be more expensive as well • About twice as long at first approach • The precise analysis remains to be done • Rule of thumb: If shifts are more than 2 times longer, n-grams with n > 1 or should be faster than BM.

Experimental Results • Searching large data of: • DNA • Typical ASCII • XML Documents • Patterns of 6 to 500 symbols (bytes) • 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64 Processors

Results Compared to BM • DNA • Up to 72 times faster • Typical ASCII • Up to about 11 times faster • XML Documents • Up to more than 5 times faster • Search faster for longer pattern • Average shifts are longer

DNA

ASCII

XML

Related Work • Implemented in SDDS-2006 • Applies best to • longer patterns • where many jumps occur • alphabets much smaller than the size of GF used • Instead of shifts of size min the average, one reachesalmost min (K, 2f)per shift • up to almost 256 for DNA or ASCII with GF (256) • up to almost 64K for DNA or Unicode with GF (64K) • instead of 4 or 25 respectively • For Boyer-Moore especially

Related Work • In SDDS 2006 & P2P or Grid System in general • Wish to hide what is searched for ? • Use the signature only based search • Usually slower since linear only

Conclusion • A new pattern matching algorithm • Uses algebraic signatures • Preprocesses both the pattern and the string • Appears particularly efficient • For databases • For longer patterns • Possibly faster in this context than any other algorithm known know • But all this are only preliminray results

Future Work • Performance Analysis • Theoretical • Jump Length • Median, Average… • Experimental • Actual text • Non uniform symbol distribution • DNA • Actual DNA strings

Future Work • Variants • Jump Table • Partial Signatures of n –grams • Symbol pi encodes the n –gram signature up to pi-n+1…pi • No more XORing & Division to find this signature • Faster unsuccessful attempt to match • Approximate Match • Tolerating match errors • E.g., and at most 1 symbol

Thank You for Your Attention witold.litwin@dauphine.fr

Pattern Matching Using n -grams With Algebraic Signatures

Pattern Matching Using n -grams With Algebraic Signatures

Presentation Transcript

Pattern Matching with Acceleration Data

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching

6. N-GRAMs

Pattern Matching

Language Modeling with N-Grams

N-Grams

Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching