1 / 32

Pattern Matching Using n -grams With Algebraic Signatures

Pattern Matching Using n -grams With Algebraic Signatures. Witold Litwin [1] , Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz [2] [1] Université Paris Dauphine [2] Santa Clara University. n -gram Search. New pattern matching idea Matches algebraic signatures

Download Presentation

Pattern Matching Using n -grams With Algebraic Signatures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2][1] Université Paris Dauphine[2] Santa Clara University

  2. n-gram Search • New pattern matching idea • Matches algebraic signatures • Preprocesses both :pattern & string (record) • String preprocessing is a new idea • To the best of our knowledge • Provides incidental protection of stored data • Important for P2P & grid systems • Fast processing • Especially useful for DBs & longer patterns • ASCII, Unicode, DNA… • Should be then often faster than Boyer-Moore • Possibly the fastest known in this context

  3. Algebraic Signature • Symbols of the alphabet are elements of a Galois Field • GF (256) usually • We choose there one primitive element  • Usually  = 2 • The algebraic signature of the string of i symbols p1… piis the sum: p’i= p1+…+pii. • Here the addition and the multiplication are the operations in GF.

  4. Algebraic Signature • In our GF (2f) where f = 8,16: p + q = p – q = p XOR q • One method for multiplying is : p*q = antilog (( log p + log q) mod 255) • The division is then : p / q = antilog (( log p - log q) mod 255) • The log and antilog are encoded in log and antilog tables with 2f elements each. • Entry 0 is for element 0 of the GF and is by convention set to 2f - 1.

  5. Cumulative Algebraic Signature • We encode every symbol piin a string into the signature of the prefix p1…pi • The value of a CAS symbol now encodes also the knowledge of values of all the previous ones • Matching a single symbol means prefix matching

  6. Application of CASs • Protection against involuntary data disclosure • On P2P & Grid Servers especially • Numerous CAS encoded string matching algorithms • Prefix match with O (1) complexity • Pattern match by signature only • Karp – Rabin like, linearO (L) complexity • Longest common string search • Longest common prefix search • …

  7. CAS Properties • O (K) encoding and decoding speed • For encoding, for instance: p’i= p’i-1 + pi  i= CAS ( pi-1) + pi i • Fast n – gram signature calculus • For Sk,l =pk…plwith k > 1 and l – k = n : AS ( Sk,l ) = AS (S l - k+1) = (p’l XOR p’k - 1) / k-1 • Logarithmic Algebraic Signature (LAS) LAS ( Sk,l ) = log AS ( Sk,l ) = = ( log (p’l XOR p’k - 1) – (k-1)) mod 2f – 1

  8. The n-gram SearchKey ideas • Design a sublinear pattern match search • With speed about L / K • Apply to CAS encoded DB • New idea for string search algorithm with preprocessing • Justified for a DB • Store once, search many times

  9. The n-gram SearchKey ideas • Preprocess the pattern to create a jump table • As in Boyer – Moore • Use n –grams with n > 1 to increase the discriminative power of an attempt • Comparisonof a sample from the pattern • a single symbol for BM • an LAS of an n – gram for a CAS-encoded string

  10. The n-gram SearchKey ideas • If the alphabet uses m symbols, the probability that a symbol matches is 1/m • Assuming all symbols equally likely • For usual ASCII pattern matching m = 20-25 • For DNA m = 4 • A single symbol may often match without the whole pattern matching • e.g., ¼ times for DNA on the average • Leading to small jumps, • by m symbols on the average

  11. The n-gram SearchKey ideas • The probability of an n - gram matching may be : min ( 1/ 2f , 1 / mn ) • In our examples it can reach 1 / 256 • More discriminative sampling • Longer jumps • By almost K or 256 symbols in general • Useful for longer strings • DNA, text, images…

  12. ASCII ExempleUsual Alphabet 2-grams => 5 jumps 1-gram => 6 jumps

  13. DNA Exemple4-letter Alphabet 3 jumps 4 jumps 4 jumps 11 jumps

  14. The n-gram Search Preprocessing • Encode every record (string) into its CAS • Done for incidental protection anyhow for SDDS-2006 • Encode the terminal n - gram of the searched pattern SKintoits LAS in variable V • Fill up the jump table T for every other n - gram in SK • calculate every LAS • for each LAS, store in T its rightmost offset with respect to the end of SK

  15. The n-gram Search Jump Table • For GF (256), every n – gram Si, i+n-1in the pattern and i = LAS (Si, i+n-1): • T ( i ) = the offset • T ( i ) = K – n + 1 otherwise • Remainder : LAS (0) = 255 • T can be also hash table • See the paper • Slower to use but possibly more memory efficient • Probably more useful for a larger GF

  16. ASCII Exemple Dauphine 0 7 1 7 … … in’’ 1 V = ne’’ … … au’’ 5 … … ph’’ 3 Notation : xy’’ = LAS (xy) … … 255 7

  17. Calculate LAS of the current n-gram in the string Start with the n-gram SK-n+1,K Continue depending on jump calculus Attempt to match V If .true then calculate LAS of the entire current possibly matching substring of length K and ending with the current n-gram If .true, then resolve the possible collision Either attempt to match all the K symbols Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value The n-gram Search Processing

  18. Otherwise Go to T using LAS of the n-gram Jump by the number of symbols found in T Update the “current” position for n-gram to attempt the match Re-attempt the match as above Unless the n-gram to attempt is beyond the end of the string The n-gram Search Processing

  19. ASCII Exemple Again 2-grams => 5 jumps 1-gram => 6 jumps

  20. DNA Exemple Again 3 jumps 4 jumps 4 jumps 11 jumps

  21. n-grams / BM • Average shifts with n-grams can betypicallylonger • Calculate an attempt & jump may be more expensive as well • About twice as long at first approach • The precise analysis remains to be done • Rule of thumb: If shifts are more than 2 times longer, n-grams with n > 1 or should be faster than BM.

  22. Experimental Results • Searching large data of: • DNA • Typical ASCII • XML Documents • Patterns of 6 to 500 symbols (bytes) • 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64 Processors

  23. Results Compared to BM • DNA • Up to 72 times faster • Typical ASCII • Up to about 11 times faster • XML Documents • Up to more than 5 times faster • Search faster for longer pattern • Average shifts are longer

  24. DNA

  25. ASCII

  26. XML

  27. Related Work • Implemented in SDDS-2006 • Applies best to • longer patterns • where many jumps occur • alphabets much smaller than the size of GF used • Instead of shifts of size min the average, one reachesalmost min (K, 2f)per shift • up to almost 256 for DNA or ASCII with GF (256) • up to almost 64K for DNA or Unicode with GF (64K) • instead of 4 or 25 respectively • For Boyer-Moore especially

  28. Related Work • In SDDS 2006 & P2P or Grid System in general • Wish to hide what is searched for ? • Use the signature only based search • Usually slower since linear only

  29. Conclusion • A new pattern matching algorithm • Uses algebraic signatures • Preprocesses both the pattern and the string • Appears particularly efficient • For databases • For longer patterns • Possibly faster in this context than any other algorithm known know • But all this are only preliminray results

  30. Future Work • Performance Analysis • Theoretical • Jump Length • Median, Average… • Experimental • Actual text • Non uniform symbol distribution • DNA • Actual DNA strings

  31. Future Work • Variants • Jump Table • Partial Signatures of n –grams • Symbol pi encodes the n –gram signature up to pi-n+1…pi • No more XORing & Division to find this signature • Faster unsuccessful attempt to match • Approximate Match • Tolerating match errors • E.g., and at most 1 symbol

  32. Thank You for Your Attention witold.litwin@dauphine.fr

More Related