290 likes | 446 Views
Fast Submatch Extraction using OBDDs. Liu Yang 1 , Pratyusa Manadhata 2 , William Horne 2 , Prasad Rao 2 , Vinod Ganapathy 1 Rutgers University 1 HP Laboratories 2. Applications of Regular Expressions. Signatures. NIDS. Network traffic. Alerts.
E N D
Fast Submatch Extraction using OBDDs Liu Yang1, Pratyusa Manadhata2, William Horne2, Prasad Rao2, Vinod Ganapathy1 Rutgers University1 HP Laboratories2
Applications of Regular Expressions Signatures NIDS Network traffic Alerts Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures.
Applications of Regular Expressions (cont.) Web security compliance Connectors (rule set) SIEM Email security compliance Security information and event management (SIEM) systems employ regular expressions to normalize event logs generated by hardware connectors and software systems.
Submatch Extraction Rule set … username=(.*), hostname=(.*) … username=Bob, hostname=Foo Submatch extraction $1 = Bob, $2 = Foo
Signature Matching • Non-deterministic finite automaton (NFAs) • Space efficient, time inefficient • Deterministic finite automaton (DFAs) • Time efficient, states blow-up • Recursive backtracking • Fast in general • Vulnerable to algorithmic complexity attacks
Motivation: Time/Space Tradeoff NFA (non-deterministic finite automaton) Backtracking Time Our approach DFA (deterministic finite automaton) Ideal Space
Our Contributions • A novel way of annotating capturing groups, tagged-NFAs • Design of a novel technique on submatch extraction (called Submatch-OBDD) • Extending Thompson’s algorithm • Using Boolean functions to represent tagged-NFAs • Using ordered binary decision diagrams (OBDDs) to improve time efficiency • Evaluation and comparison with RE2 and PCRE Note: RE2 is a hybrid approach, using a mix of DFA/NFA, while PCRE uses recursive backtracking.
Solution Overview RegExps with capturing groups Tagged-NFAs Boolean Representations OBDD representations
NFA Representation of RegExps E = a*aa NFA of regexp “a*aa” Transition table T(x,i,y)
Submatch Tagging: tagged NFAs E = (a*)aa Tag(E) = (a*)taa 1 / t1 Tagged NFA of “(a*)aa” with submatch tagging t1 Extended transition table T(x,i,y,t) of the tagged NFA
Match Test RegExp=(a*)aa; Input: aaaa {t1} {t1} {t1} {t1} 1 2 3 accept a a a a Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3}
Submatch Extraction {t1} {t1} {t1} {t1} 1 2 3 accept a a a a $1=aa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} Any path from an accept state to a start state generates a valid assignment of submatches.
Complexity of Tagged NFAs Match test: Submatch extraction: n – size of tagged NFA l – length of input string Can we make the operations faster?
Submatch-OBDD • Representing tagged NFAs using Boolean functions • Updating frontiers in one-step using a single Boolean formula • Using OBDDs to manipulate Boolean functions
Transitions as Boolean Functions RegExp: (a*)aa (1 Λ a Λ 1 Λ t1) V (1 Λ a Λ 2 Λ{}) V (2 Λ a Λ 3 Λ{}) T(x,i,y,t) =
Match Test using Boolean Functions Transition table Next states Start states (1ΛaΛ 1 Λt1) V (1ΛaΛ 2 Λ{}) aaaa {1} Λ a Λ T(x,i,y,t) Input symbol Intermediate transitions aaaa (1ΛaΛ 1 Λ t1) V (1ΛaΛ 2 Λ{}) V (2ΛaΛ 3 Λ{}) {1,2} Λ a Λ T(x,i,y,t) Current states aaaa (1ΛaΛ 1 Λt1) V (1ΛaΛ 2 Λ{}) V (2ΛaΛ 3 Λ{}) {1,2,3} Λ a Λ T(x,i,y,t) … Accept
Submatch Extraction using Boolean Functions The last input symbol Start from the last symbol, going backwards No output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) V (2ΛaΛ3Λ{}) aΛ3 Λ 2ΛaΛ3Λ{} aaaa Intermediate transitions [4] Previous state of 3 Accept state Rename previous state as current state and continue No output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) V (2ΛaΛ3Λ{}) aΛ2Λ 1ΛaΛ2Λ{} aaaa Previous state of 2 Intermediate transitions [3]
Submatch Extraction using Boolean Functions Output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) V (2ΛaΛ3Λ{}) aΛ1Λ 1ΛaΛ1Λ t1 aaaa Intermediate transitions [2] Previous state of 1 Output submatch tag (1ΛaΛ1Λt1) V (1ΛaΛ2Λ{}) aΛ1Λ 1ΛaΛ1Λ t1 aaaa Intermediate transitions [1] Previous state of 1 aaaa $1=aa t1 t1
More Formal: Match Test Finding new frontiers after processing an input symbol: Next frontiers = Checking acceptance:
More Formal: Submatch Extraction A back traversal approach: starting from the last input symbol. Submatch extraction: the last consecutive sequence of characters that are assigned with ti
Submatch-OBDD • Representation of tagged NFAs, match test, and submatch extraction using OBDDs • OBDD representations for • Transitions with submatch tags • Intermediate transitions • Submatch tags • Set of start states • Set of accept states • Set of frontiers • Input symbols
Implementation Toolchain in C++, interfacing with the CUDD* Input strings / network traffic Tagged NFAs RE2TNFA TNFA2OBDD PATTERNMATCH RegExps OBDDs No match Matched at reg# Submatches $1= …, $2 = … *CUDD is a package for manipulation of Binary Decision Diagrams
Feasibility Study • Data sets • Snort-2009 • RegExps: 115 regexps with capturing groups from HTTP rules • Traces • 1.2GB department network traffic (average packet size 126 bytes) • 1.3GB Twitter traffic (average packet size 1202 bytes) • 1MB synthetic trace (average string length 311 bytes) • Snort-2012 • RegExps: 403 regexps with capturing groups from HTTP rules • Traces • 1.2GB department network traffic (average packet size 126 bytes) • 1.3GB Twitter traffic (average packet size 1202 bytes) • 1MB synthetic trace (average string length 689 bytes) • Firewall-504 • RegExps: 504 patterns from a commercial firewall F • Trace: 87MB of firewall logs (average line size 87 bytes)
Experimental Setup • Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM • Two configurations on pattern matching • Conf. S • patterns compiled individually • Compiled pattern matched sequentially against input traces • Conf.C • patterns combined with UNION and compiled • combined pattern matched against input traces
Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2009 data set
Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2012 data set
Performance Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Firewall-504 data set
Related Work • NFA-OBDD [Yang et al., RAID’10, Chasaki and Wolf, ANCS’10] • RE2 [Cox, code.google.com/p/re2] • PCRE [www.pcre.org] • TNFA [Laurikari et al., SPIRE’00] • MDFA [Yu et al., ANCS’06] • Hybrid FA [Becchi and Crowley, CoNEXT’07] • XFA [Smith et al., Oakland’08] • More – see paper for details
Conclusion • A novel way of annotating capturing groups • Submatch-OBDD: a novel technique on submatch extraction using OBDDs • Feasibility study • Submatch-OBDD achieves ideal performance when patterns are combined • Faster than RE2 and PCRE when patterns are combined