Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection

Fast and Memory-Efficient Regular ExpressionMatching for Deep Packet Inspection Authors:Fang Yu, Zhifeng Chen, Yanlei Diao, T.V. Lakshman and Randy H. Katz Publisher: ANCS'06, December 3–5, 2006 Present:Yu-Tso Chen Date:November, 6, 2007 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

Outline • 1. Introduction • 2. Definitions and problem description • 3. Matching of Individual Patterns • 4. Selective Grouping of Multiple Patterns • 5. Evaluation Result • 6. Conclusion

Introduction • Three unique complex features • 1) Large numbers of wildcards can cause DFA to grow exponentially • 2) Wildcard are used with length restriction(‘?’, ‘+’) will increase the resource • 3) Groups of characters are also commonly used such interaction can result in highly complex state machine(ex.”^220[\x09-]*ftp”)

Introduction (cont.) • Make following contributions • 1) Analyze the computational and storage cost of building individual DFAs • 2) Two rewrite rules for specific regular expressions • 3) Combine multiple DFAs into a small number of group

Regular Expression Patterns • Compares the regular expressions used in two networking applications (Snort, Linux L-7 filter & XML filtering) • 1)Both types of app. Use wildcards (‘.’,’?‘,’+’,’*’) contain larger numbers of them • 2) Classes of characters (“[ ]”) are used only in packet scanning applications • 3) High percentage of scanning app. Have length restrictions on some of the classes or wildcards

Regular Expression Patterns

Solution Space for Regular Expression Matching • A single regular expression of length n can be expressed as an NFA with O(n) • When the NFA is converted into a DFA, it may generate states • The processing complexity for each character in the input is O(1) in DFA, but is O(n2) for an NFA when all n states are active at the same time

Solution Space for Regular Expression Matching (cont.) • To handle m regular expressions, two choices are possible： • Processing them individually in m automata • Compiling them into a single automaton

Problem Statement • DFA-based approaches in this paper • Our goal is to achieve O(1) computation cost • The focus of the study is to reduce memory overhead of DFA • There are two sources of memory usage in DFAs：states and transitions • We consider the number of states as the primary factor

Design Considerations • Define Completeness of Matching Results： • Exhaustive Matching：M(P,S)={substring S’ of S | S’ is accepted by the DFA of P} • It is expensive and often unnecessary to report all matching substrings • We propose a new concept, Non-overlapping Matching, that relaxes the requirements of exhaustive matching • Non-overlapping Matching： • Ex：ab* if input abbb non-overlapping matching will report one match instead of three • Exhaustive Matching will report, ab, abb, abbb

Design Considerations (cont.) • Define DFA Execution Model for Substring Matching：We focus on patterns without ‘^’ attached at the beginning • Repeater searches • One-pass search – this approach can truly achieve O(1) computation cost per character

DFA Analysis for Individual Regular Expressions • The study is based on the use of exhaustive matching & one-pass search

Case 4：DFA of Quadratic Size • The DFA needs to remember the number of Bs it has seen and their locations

Case 5：DFA of Exponential Size • An exponential number of states (22+1)are needed to represent these two wildcard characters AAB(AABBCD) is different from ABA(ABABCD) because a subsequence input BCD

Regular Expression Rewrites • Rewrite Rule(1) • “^SEARCH\s+[^\n]{1024}” to “^SEARCH\s [^\n]{1024}” • “Â+[A-Z]{j}” to “Â [A-Z]{j}” • We can prove match “Â+[A-Z]{j}” also match “Â [A-Z]{j}”

Regular Expression Rewrites (cont.) • Rewrite Rule(2) • We don’t need to keep track of the second AUTH\s • If there is a ‘\n’ within the next 100 bytes, the return character must also be within 100 bytes to the second AUTH\s • If there is no ‘\n’ within the next 100 bytes, the first already matched the pattern • “([^A]|A[^U]|AU[^T]|AUT[^H]|AUTH[^\s]|AUTH\s[^\n]{0,99}\n)*AUTH\s[^\n]{100}”

Selective Grouping of Multiple Patterns • The composite DFA may experience exponential growth in size, although none of the individual DFA has an exponential component

Regular Expressions Grouping Algorithm • Definition of interaction：two patterns interact with each other if their composite DFA contains more states than the sum of two individual ones

Grouping Algorithm

Evaluation Result • Effect of Rule Rewriting

Evaluation Result (cont.)

Conclusion • Rewriting techniques – memory-efficient DFA-based approaches are possible • Selectively groups patterns together – speed up the matching process

Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection