1 / 29

Fast Submatch Extraction using OBDDs

Fast Submatch Extraction using OBDDs. Liu Yang 1 , Pratyusa Manadhata 2 , William Horne 2 , Prasad Rao 2 , Vinod Ganapathy 1 Rutgers University 1 HP Laboratories 2. Applications of Regular Expressions. Signatures. NIDS. Network traffic. Alerts.

denton
Download Presentation

Fast Submatch Extraction using OBDDs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Submatch Extraction using OBDDs Liu Yang1, Pratyusa Manadhata2, William Horne2, Prasad Rao2, Vinod Ganapathy1 Rutgers University1 HP Laboratories2

  2. Applications of Regular Expressions Signatures NIDS Network traffic Alerts Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures.

  3. Applications of Regular Expressions (cont.) Web security compliance Connectors (rule set) SIEM Email security compliance Security information and event management (SIEM) systems employ regular expressions to normalize event logs generated by hardware connectors and software systems.

  4. Submatch Extraction Rule set … username=(.*), hostname=(.*) … username=Bob, hostname=Foo Submatch extraction $1 = Bob, $2 = Foo

  5. Signature Matching • Non-deterministic finite automaton (NFAs) • Space efficient, time inefficient • Deterministic finite automaton (DFAs) • Time efficient, states blow-up • Recursive backtracking • Fast in general • Vulnerable to algorithmic complexity attacks

  6. Motivation: Time/Space Tradeoff NFA (non-deterministic finite automaton) Backtracking Time Our approach DFA (deterministic finite automaton) Ideal Space

  7. Our Contributions • A novel way of annotating capturing groups, tagged-NFAs • Design of a novel technique on submatch extraction (called Submatch-OBDD) • Extending Thompson’s algorithm • Using Boolean functions to represent tagged-NFAs • Using ordered binary decision diagrams (OBDDs) to improve time efficiency • Evaluation and comparison with RE2 and PCRE Note: RE2 is a hybrid approach, using a mix of DFA/NFA, while PCRE uses recursive backtracking.

  8. Solution Overview RegExps with capturing groups Tagged-NFAs Boolean Representations OBDD representations

  9. NFA Representation of RegExps E = a*aa NFA of regexp “a*aa” Transition table T(x,i,y)

  10. Submatch Tagging: tagged NFAs E = (a*)aa Tag(E) = (a*)taa 1 / t1 Tagged NFA of “(a*)aa” with submatch tagging t1 Extended transition table T(x,i,y,t) of the tagged NFA

  11. Match Test RegExp=(a*)aa; Input: aaaa {t1} {t1} {t1} {t1} 1 2 3 accept a a a a Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3}

  12. Submatch Extraction {t1} {t1} {t1} {t1} 1 2 3 accept a a a a $1=aa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} Any path from an accept state to a start state generates a valid assignment of submatches.

  13. Complexity of Tagged NFAs Match test: Submatch extraction: n – size of tagged NFA l – length of input string Can we make the operations faster?

  14. Submatch-OBDD • Representing tagged NFAs using Boolean functions • Updating frontiers in one-step using a single Boolean formula • Using OBDDs to manipulate Boolean functions

  15. Transitions as Boolean Functions RegExp: (a*)aa (1 Λ a Λ 1 Λ t1) V (1 Λ a Λ 2 Λ{}) V (2 Λ a Λ 3 Λ{}) T(x,i,y,t) =

  16. Match Test using Boolean Functions Transition table Next states Start states (1ΛaΛ 1 Λt1) V (1ΛaΛ 2 Λ{}) aaaa {1} Λ a Λ T(x,i,y,t) Input symbol Intermediate transitions aaaa (1ΛaΛ 1 Λ t1) V (1ΛaΛ 2 Λ{}) V (2ΛaΛ 3 Λ{}) {1,2} Λ a Λ T(x,i,y,t) Current states aaaa (1ΛaΛ 1 Λt1) V (1ΛaΛ 2 Λ{}) V (2ΛaΛ 3 Λ{}) {1,2,3} Λ a Λ T(x,i,y,t) … Accept

  17. Submatch Extraction using Boolean Functions The last input symbol Start from the last symbol, going backwards No output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) V (2ΛaΛ3Λ{}) aΛ3 Λ 2ΛaΛ3Λ{} aaaa Intermediate transitions [4] Previous state of 3 Accept state Rename previous state as current state and continue No output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) V (2ΛaΛ3Λ{}) aΛ2Λ 1ΛaΛ2Λ{} aaaa Previous state of 2 Intermediate transitions [3]

  18. Submatch Extraction using Boolean Functions Output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) V (2ΛaΛ3Λ{}) aΛ1Λ 1ΛaΛ1Λ t1 aaaa Intermediate transitions [2] Previous state of 1 Output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) aΛ1Λ 1ΛaΛ1Λ t1 aaaa Intermediate transitions [1] Previous state of 1 aaaa $1=aa t1 t1

  19. More Formal: Match Test Finding new frontiers after processing an input symbol: Next frontiers = Checking acceptance:

  20. More Formal: Submatch Extraction A back traversal approach: starting from the last input symbol. Submatch extraction: the last consecutive sequence of characters that are assigned with ti

  21. Submatch-OBDD • Representation of tagged NFAs, match test, and submatch extraction using OBDDs • OBDD representations for • Transitions with submatch tags • Intermediate transitions • Submatch tags • Set of start states • Set of accept states • Set of frontiers • Input symbols

  22. Implementation Toolchain in C++, interfacing with the CUDD* Input strings / network traffic Tagged NFAs RE2TNFA TNFA2OBDD PATTERNMATCH RegExps OBDDs No match Matched at reg# Submatches $1= …, $2 = … *CUDD is a package for manipulation of Binary Decision Diagrams

  23. Feasibility Study • Data sets • Snort-2009 • RegExps: 115 regexps with capturing groups from HTTP rules • Traces • 1.2GB department network traffic (average packet size 126 bytes) • 1.3GB Twitter traffic (average packet size 1202 bytes) • 1MB synthetic trace (average string length 311 bytes) • Snort-2012 • RegExps: 403 regexps with capturing groups from HTTP rules • Traces • 1.2GB department network traffic (average packet size 126 bytes) • 1.3GB Twitter traffic (average packet size 1202 bytes) • 1MB synthetic trace (average string length 689 bytes) • Firewall-504 • RegExps: 504 patterns from a commercial firewall F • Trace: 87MB of firewall logs (average line size 87 bytes)

  24. Experimental Setup • Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM • Two configurations on pattern matching • Conf. S • patterns compiled individually • Compiled pattern matched sequentially against input traces • Conf.C • patterns combined with UNION and compiled • combined pattern matched against input traces

  25. Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2009 data set

  26. Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2012 data set

  27. Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Firewall-504 data set

  28. Related Work • NFA-OBDD [Yang et al., RAID’10, Chasaki and Wolf, ANCS’10] • RE2 [Cox, code.google.com/p/re2] • PCRE [www.pcre.org] • TNFA [Laurikari et al., SPIRE’00] • MDFA [Yu et al., ANCS’06] • Hybrid FA [Becchi and Crowley, CoNEXT’07] • XFA [Smith et al., Oakland’08] • More – see paper for details

  29. Conclusion • A novel way of annotating capturing groups • Submatch-OBDD: a novel technique on submatch extraction using OBDDs • Feasibility study • Submatch-OBDD achieves ideal performance when patterns are combined • Faster than RE2 and PCRE when patterns are combined

More Related