220 likes | 458 Views
A Hybrid Finite Automaton for Practical Deep Packet Inspection. CoNEXT 2007. Michela Becchi and Patrick Crowley. Context. Matching Engine and RegEx set. FTP.OPEN.* www.spyware Host= Server.*HTTP. Safe packets. Incoming packets. Hosxyz. blaBLAbla. Safe_payload. Safe_payload.
E N D
A Hybrid Finite Automatonfor Practical Deep Packet Inspection CoNEXT 2007 Michela Becchi and Patrick Crowley
Context Matching Engine and RegEx set FTP.OPEN.* www.spyware Host= Server.*HTTP Safe packets Incoming packets Hosxyz blaBLAbla Safe_payload Safe_payload • Deep packet inspection • Challenge: perform regular expression matching at line rate, given data-sets of hundreds (or thousands) of patterns • Processing time • Memory requirement xHost= Malicious packets ServerxHTTP
a: 1-10 b c 1 2 3/1 a d b: 2-10 d b c d d 0 4 5 6/2 7/2 e c d e 8 9 10/3 c: 1,3,5-10 Deterministic vs. Non-Deterministic FA RegEx: (1) .*a+bc (2) .*bcd+ (3) .*cde a NFA c b 1 2 3/1 a * d b c d 0 6/2 4 5 DFA c d e 9/3 7 8 Text: d a b c d
Memory-time tradeoff NFA time DFA • NFA • limited size • potentially NNFA states active in parallel • DFA • one state traversal/char • size: potentially 2N states where N=NNFA • In practical cases single DFA infeasible! • Idea • Hybrid automaton • Size comparable to NFA by preventing “state explosion” • Predictable and small memory bandwidth/processing time • Limit to classes of RegEx in Intrusion Detection Systems • Analyze state explosion scenarios memory
SNORT Regular expressions pattern1.*pattern2.{n,m}[…]patternk[^cxcy]*[…]patternn • Examples • Server\s+Guptachar\s+\d+\x2E\d+ • User-Agent [^\r\n]*A-311\s+Server • Host[^\r\n]*wwp\.mirabilis\.com.*from=[^\r\n]*fromemail=[^\r\n]*subject=[^\r\n]*to=24962844 • \sPARTIAL.*BODY\.PEEK[^\n]{1024} • SNORT RegExs DO consist of: • Sequences of sub-patterns • Possibly containing (repetitions of) character ranges • Separated by dot-star terms and counting constraints • SNORT RegExs DON’T normally contain: • Nested repetitions • Disjunctions of complex sub-expressions
4 0 3 * * a b c d RegEx: ab.*cd NFA DFA 0 1 2 ^c a c ^c^d c a 0,2 4 0,2 3 0,1 0,2 b c d Dot-star terms • Definition • Unconstrained repetitions of wildcards (.*) or large ranges [^c1c2..ck]* • Examples • User-Agent[^\r\n]*ZC-Bridge • On single regular expressions (from practical data-sets) • NO state Blowup ^c
0 3 e a b c e c e [^cde] 2 1 a [^ceg] a 5 [^ce] d f a c c c e c a 8/1 e 4 e c e e e g a a e 6 [^cef] 7 f e h 9 [^ceh] e e g 10/2 [^ce] 11 h 12/2 Dot-star conditions (cont’d) [^ce] • Compiling together several RegEx • Duplication “sub-DFAs” at “.*” states • NO exponential blow-up • ab.*cd • efgh
Counting constraints • Definition • Constrained repetition of wildcard .{n,m} or large ranges [^c1c2..ck]{n,m} • Examples • AUTH\s[^\n]{100} (buffer overflow) • Exponential state explosion: • Single regular expressions: all possible occurrences of the prefix in the counting constraint • Multiple regular expressions: additionally, all the possible occurrences of other RegEx in the counting constraint
1 a 2 b 3 * 4 * 5 * 6 c 7 d 8 Counting constraints (cont’d) NFA * DFA a 7 a a a d b ^a ^a ^a c 1 2 3 4 5 6 a a a c a Ex:ab.{3}cd [^ab] [^ab] 0 a a a 8 9 10 1 b b b a a 2 ^a [^ac] 11 13 3 12 10 a [^ac] a b a c c [^ad] c [^ad] 14 5 15 16 4 a [^abc] d d 4 a c 18 9 1 17 6
3 1 a * e e 11 * b c d 1 3 4/1 2 e b c d a a 4/1 c 2 c f c e a 6 8/2 7 e e * f c c a b a e e c 6 7 8/2 0 5 a 1 11 1 11 13 5 c 0 5 c a 9 c 10/3 d a d 10/3 e 9 c e c 1 5 11 1 11 2 11 b f f b a 11 12 13/4 12 13/4 1 e e a a c 1 5 11 1 5 11 First step: hybrid-FA • Idea: Stop subset construction at the state where state blowup would occur • Implication: hybrid-FA with a head-DFA, one or more tail-NFAs and one of more border-states Hybrid-FA NFA e
a * e e 11 b c d 1 3 4/1 2 e 3 a a 1 c c f c e 6 7 8/2 e e c a b a e c 0 5 a 1 11 1 1113 5 c 9 c 10/3 a d e c c 1 5 11 1 11 2 11 f b a * 12 13/4 1 e e a a c 1 5 11 b c d 4/1 2 a * f c e 6 7 8/2 0 5 c a d 10/3 9 e b f 11 12 13/4 Hybrid-FA traversal NFA Hybrid-FA 1 5 11 • Functional equivalence (commonly reached accepting states) • Hybrid-FA: • Limitation in size of active vector till border state is reached • No back activation from tail-NFAs to head-DFA
Improving the worst case • Size: Hybrid-FA ≈Size of NFA • Bandwidth: • Average case improved (in DFA) • Worst case dependent on tail-NFAs size • Can we do better?
tail-NFA tail-DFA head-DFA tail-NFA tail-DFA head-DFA tail-NFA tail-DFA Dot-star terms: Tail-DFAs tail-NFA • Idea: • Problem: • Multiple border state traversals => Multiple tail-DFA activations • Fact: • In case of • sub_pattern1.* sub_pattern2 • sub_pattern1[^c1…ck] *sub_pattern2 w/ c1,..,ck sub_pattern2 subsequent activations of a tail-DFA can be safely ignored • Implication • Each tail-DFA adds only 1 to the worst case bound
. . . * * * * suffix b b+1 b+n-1 b+n n states Counting Constraints: counter trick NFA for .{n}suffix • Observation: • n “counting states” do not carry real next state information • Idea: • Replace n counting states w/ auto-decrementing counter • At most 2 memory accesses per counter sufficient • Optimization • Counting constraint at the end of the regular expression (no suffix) => ONE counter is enough
Rule-sets • Distinct PCREs: 982 • 25% w/ long counting constraints (generally at the end of the RegEx, n=100-1024) • 11.4 % containing .* terms • 54.89% containing [^c1c2..ck]* terms • Header-based grouping
Memory storage requirements Tail-DFAs and counter trick used (counters at end)
Memory bandwidth requirements • Simulations on 12 packet traces • From 17MB to 264 MB • 1-6 rules matched/traces • Observations: • active set size: # of parallel active states
Conclusion • Contributions: • Analysis of practical rule-sets • Proposal of hybrid-FA to • reduce memory storage requirement • limit average case memory bandwidth • Refinements: tail-DFAs and counter tricks • bound worst case memory bandwidth • Experimental results: • Memory size: comparable to the corresponding NFA • Memory bandwidth: • Average case ≈ single (unfeasible) DFA • Worst case dependent upon number of “problematic” RegEx • Deployment observation • Head and tail-FAs independent • Hybrid-FA suitable for deployment on parallel architectures and FPGAs
Thanks Questions?
A SNORT rule HEADER MATCHING (protocol, source addr, source port, dest. addr, dest. port) alert tcp $HOME_NET any -> $EXTERNAL_NET $HTTP_PORTS (msg:"BACKDOOR a-311 death user-agent string detected"; flow:to_server,established; content:"User-Agent|3A|"; nocase; content:"A-311"; distance:0; nocase; content:"Server"; distance:0; nocase; pcre:"/^User-Agent\x3A[^\r\n]*A-311\s+Server/smi"; reference:url,www3.ca.com/securityadvisor/pest/pest.aspx?id=453076778; classtype:trojan-activity; sid:6396; rev:1;) PAYLOAD INSPECTION Keywords (“content”) Regular expression (PCRE)
Problem • Network Intrusion Detection Systems use Regular Expression Matching for Payload Inspection • Regular Expression Matching performed in Linear time through deterministic finite automata (DFAs) • Several compression techniques put in place to reduce memory requirement of given DFAs BUT • Complexity of RegEx may make DFAs unfeasible because of “state explosion”. How to prevent state explosion from happening preserving worst case bound in memory bandwidth?
b c 1 2 3/1 a * b c d 4 5 6/2 0 c d e a 7 8 9/3 0,4 2 0,1 b a c 0,7 0 0,4 Deterministic vs. Non-Deterministic FA RegEx: (1) .*abc; (2) .*bcd; (3) .*cde NFA DFA c b