Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Publisher : ANCS’ 06 Author : Fang Yu, Zhifeng Chen, Yanlei Diao, T.V. Lakshman and Randy H. Katz Presenter : Yu-Hsiang Wang Date : 2010/11/17

Outline • Introduction • DFA Analysis for Individual Regular expression • Regular Expression Rewrites • Regular Expressions Grouping • Evaluation results

Introduction • A theoretical worst case study [14] shows a single regular expression of length n can be expressed as an NFA with O(n) states. When the NFA is converted into a DFA, it may generate O(Σn) states. (Σ : a finite set of input symbols , 28symbols from the ASCII code) • The processing complexity for each character in the input is O(1) in a DFA, but is O(n2) for an NFA when all n states are active at the same time.

Introduction • To handle mregular expressions, two choices are possible: -processing them individually in m automata : O(m) -compiling m regular expressions into a composite DFA : O(1)

Design Consideration • Completeness of matching result: Pattern : ab* Input : abbb -Exhaustive Matching : a , ab, abb ,abbb -Non-overlapping Matching : a (or abbb) left-most longest match, shortest match results • DFA execution model for substring matching : patterns without ^ attached at the beginning. -Repeated search :Start scanning from one position, if no match, start again at the next position -One-pass search : .* is pre-pended to each pattern without ^

DFA Analysis • We use Exhaustive Matching and One-pass search • Typical patterns in network payload scanning applications

Case 4 : DFA of Quadratic size • if an input contains multiple Bs, the DFA needs to remember the number of Bs it has seen and their locations

Case 4 Rewrites Rewrite Rule(1) • Rewriting is enabled by relaxing the requirement of exhaustive matching to that of non-overlapping matching • the new pattern essentially implements non-overlapping left-most shortest match. • Ex: ^SEARCH\s+[^\n]{1024}  ^SEARCH\s [^\n]{1024} input : SEARCH\s\s ... \s aa ... a • number of states linear in j because it has removed the ambiguity for matching \s 1024 1024

Case 5 : DFA of Exponential Size • we need to remember all possible effects of the preceding As as they may yield different results when combined with subsequent inputs. AAB ABA BCD O BCD X

ε A U T H \s [\^n] [\^n] [\^n] [\^n] 100 states Case 5 : DFA of Exponential Size • Often for detecting buffer overflow attempts : .*AUTH\s[^\n]{100} • DFA needs to remember all the possible AUTH\s : DFA > 10000states -A second AUTH\s can either match [^\n]{100} or be counted as a new match of the start of the pattern AUTH\s • Can’t be efficiently processed by an NFA-based approach either Input AUTH\sAUTH\sAUTH\s\s AUTH\s\s\s … NFA for .*AUTH\s[^\n]{100}

Case 5 Rewrites • Only the first AUTH\s matters -If there is a ‘\n’ within the next 100 bytes None of the AUTH\s matches the pattern -Otherwise, the first AUTH\s and the following characters have already matched the pattern • Rewrite the pattern to: ([^A]|A[^U]|AU[^T]|AUT[^H]|AUTH[^\s]|AUTH\s[^\n]{0,99}\n)*AUTH\s[^\n]{100} generates a DFA of only 106 states

Regular Expressions Grouping • Some composite patterns generate DFA of exponential sizes • interaction：two patterns interact with each other if their composite DFA contains more states than the sum of two individual ones

Regular Expressions Grouping Multi-core architectures (ex: IXP 2800 NPU ,16 processing unit) • Goal : design an algorithm that divides regular expressions into several groups, so that one processing unit can run one or several composite DFAs. • the size of local memory of each processing unit is quite limited -Compute pair-wise interactive results, form a graph -Pick a pattern with the fewest interactions to the new group -Keep adding patterns until reaching limit

Regular Expressions Grouping

Evaluation results • Effect of Rule Rewriting -L7-filter: protocol identifiers (70 regular expression) -Bro: intrusion patterns (2781 regular expression) -SNORT: No regular expression in April 2003 1131 out of 4867 regular expressions as of Jan 2006

Evaluation results • Effect of Grouping Multiple Patterns

Evaluation results

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection