270 likes | 442 Views
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection. Authors: Fang Yu, Zhifeng Chen, Yanlei Diao, T.V. Lakshman and Randy H. Katz Publisher: ANCS'06, December 3–5, 2006 Present: Yu-Tso Chen Date: November, 6, 2007.
E N D
Fast and Memory-Efficient Regular ExpressionMatching for Deep Packet Inspection Authors:Fang Yu, Zhifeng Chen, Yanlei Diao, T.V. Lakshman and Randy H. Katz Publisher: ANCS'06, December 3–5, 2006 Present:Yu-Tso Chen Date:November, 6, 2007 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
Outline • 1. Introduction • 2. Definitions and problem description • 3. Matching of Individual Patterns • 4. Selective Grouping of Multiple Patterns • 5. Evaluation Result • 6. Conclusion
Introduction • Three unique complex features • 1) Large numbers of wildcards can cause DFA to grow exponentially • 2) Wildcard are used with length restriction(‘?’, ‘+’) will increase the resource • 3) Groups of characters are also commonly used such interaction can result in highly complex state machine(ex.”^220[\x09-]*ftp”)
Introduction (cont.) • Make following contributions • 1) Analyze the computational and storage cost of building individual DFAs • 2) Two rewrite rules for specific regular expressions • 3) Combine multiple DFAs into a small number of group
Outline • 1. Introduction • 2. Definitions and problem description • 3. Matching of Individual Patterns • 4. Selective Grouping of Multiple Patterns • 5. Evaluation Result • 6. Conclusion
Regular Expression Patterns • Compares the regular expressions used in two networking applications (Snort, Linux L-7 filter & XML filtering) • 1)Both types of app. Use wildcards (‘.’,’?‘,’+’,’*’) contain larger numbers of them • 2) Classes of characters (“[ ]”) are used only in packet scanning applications • 3) High percentage of scanning app. Have length restrictions on some of the classes or wildcards
Solution Space for Regular Expression Matching • A single regular expression of length n can be expressed as an NFA with O(n) • When the NFA is converted into a DFA, it may generate states • The processing complexity for each character in the input is O(1) in DFA, but is O(n2) for an NFA when all n states are active at the same time
Solution Space for Regular Expression Matching (cont.) • To handle m regular expressions, two choices are possible: • Processing them individually in m automata • Compiling them into a single automaton
Problem Statement • DFA-based approaches in this paper • Our goal is to achieve O(1) computation cost • The focus of the study is to reduce memory overhead of DFA • There are two sources of memory usage in DFAs:states and transitions • We consider the number of states as the primary factor
Outline • 1. Introduction • 2. Definitions and problem description • 3. Matching of Individual Patterns • 4. Selective Grouping of Multiple Patterns • 5. Evaluation Result • 6. Conclusion
Design Considerations • Define Completeness of Matching Results: • Exhaustive Matching:M(P,S)={substring S’ of S | S’ is accepted by the DFA of P} • It is expensive and often unnecessary to report all matching substrings • We propose a new concept, Non-overlapping Matching, that relaxes the requirements of exhaustive matching • Non-overlapping Matching: • Ex:ab* if input abbb non-overlapping matching will report one match instead of three • Exhaustive Matching will report, ab, abb, abbb
Design Considerations (cont.) • Define DFA Execution Model for Substring Matching:We focus on patterns without ‘^’ attached at the beginning • Repeater searches • One-pass search – this approach can truly achieve O(1) computation cost per character
DFA Analysis for Individual Regular Expressions • The study is based on the use of exhaustive matching & one-pass search
Case 4:DFA of Quadratic Size • The DFA needs to remember the number of Bs it has seen and their locations
Case 5:DFA of Exponential Size • An exponential number of states (22+1)are needed to represent these two wildcard characters AAB(AABBCD) is different from ABA(ABABCD) because a subsequence input BCD
Regular Expression Rewrites • Rewrite Rule(1) • “^SEARCH\s+[^\n]{1024}” to “^SEARCH\s [^\n]{1024}” • “^A+[A-Z]{j}” to “^A [A-Z]{j}” • We can prove match “^A+[A-Z]{j}” also match “^A [A-Z]{j}”
Regular Expression Rewrites (cont.) • Rewrite Rule(2) • We don’t need to keep track of the second AUTH\s • If there is a ‘\n’ within the next 100 bytes, the return character must also be within 100 bytes to the second AUTH\s • If there is no ‘\n’ within the next 100 bytes, the first already matched the pattern • “([^A]|A[^U]|AU[^T]|AUT[^H]|AUTH[^\s]|AUTH\s[^\n]{0,99}\n)*AUTH\s[^\n]{100}”
Outline • 1. Introduction • 2. Definitions and problem description • 3. Matching of Individual Patterns • 4. Selective Grouping of Multiple Patterns • 5. Evaluation Result • 6. Conclusion
Selective Grouping of Multiple Patterns • The composite DFA may experience exponential growth in size, although none of the individual DFA has an exponential component
Regular Expressions Grouping Algorithm • Definition of interaction:two patterns interact with each other if their composite DFA contains more states than the sum of two individual ones
Outline • 1. Introduction • 2. Definitions and problem description • 3. Matching of Individual Patterns • 4. Selective Grouping of Multiple Patterns • 5. Evaluation Result • 6. Conclusion
Evaluation Result • Effect of Rule Rewriting
Outline • 1. Introduction • 2. Definitions and problem description • 3. Matching of Individual Patterns • 4. Selective Grouping of Multiple Patterns • 5. Evaluation Result • 6. Conclusion
Conclusion • Rewriting techniques – memory-efficient DFA-based approaches are possible • Selectively groups patterns together – speed up the matching process