1 / 19

Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions

Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions. Michela Becchi and Patrick Crowley ACM CoNEXT 2008. Context. Matching Engine and RegEx set. Network intrusion detection and prevention systems. FTP.OPEN.* www.spyware Host= Server.*HTTP.

Download Presentation

Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions Michela Becchi and Patrick Crowley ACM CoNEXT 2008

  2. Context Matching Engine and RegEx set • Network intrusion detection and prevention systems FTP.OPEN.* www.spyware Host= Server.*HTTP Incoming packets Safe packets Hosxyz blaBLAbla Safe_payload Safe_payload ServerxHTTP xHost= Malicious packets ServerABHTTP • Email monitoring • Content based routing • Application level prioritizing and filtering • …

  3. Challenges • Networking context • Line rate operation (several Gbps) • Parallel search over data-sets consisting of hundreds or thousands of patterns • On memory-centric architectures • Bound per-character processing • Pre-computed large data structures Memory bandwidth Memory size

  4. Challenges (cont’d) [CoNEXT’07] • Snort rule-set, November 2007 snapshot • 8536 rules • 5549 Perl Compatible Regular Expressions • Note: • 99% with character ranges • 16.3 % with dot-star terms • 44 % with counting constraints • 6% with back-references mi[ck][ch]ela mi.*ela mi[^\n\r]*ela mi.{2,3}ela mi(ch|k)elabec\1i [this paper] • Lazy/greedy quantifiers • Positive/negative lookahead • Atomic groups • … • No expressive power added • Speed up text-based engines

  5. a b c 1 2 3/1 a ∑ d c d b 0 4 5 6/2 c d e 7 8 9/3 MEMORY BANDWIDTH: # of state traversals per input character Better for DFAs MEMORY SIZE: # of states and transitions Better for NFAs Deterministic vs. Non-Deterministic FA RegEx: (1) a+bc (2) bcd+(3) cde Text: a b c d NFA Match #1 Match #2 a:1-10 DFA Match #1 b c 2 3/1 1 b:2-10 a d d b c d d 0 4 5 6/2 7/2 Match #2 e c d e 8 9 10/3 c:1,3,5-10

  6. Counting constraints – NFA E.g: a.{n}bc • Memory size • For large n, number of statesNNFAlinear in n • Memory bandwidth • Input text: aaaaaaaa…aaabc nstates active in parallel • For large n ~ NNFA memory accesses/input character a c b 0 1 2 n+1 n+2 n+3 ∑ ∑ ∑ ∑ n

  7. Counting constraints - DFA E.g: a.{n}bc DFA • Memory size • For large n, number of statesDNFAexponential in n • For large n DFA practically infeasible • e.g. n=40 ~1000 billion states NFA

  8. Counting-NFAs E.g: a.{n}bc counter instantiation Conditional transitions ∑ ∑ | cntn ∑ c b| cnt=n a , cnt 2 0 1 3 4 cnt++ Counter increment • Advantage: Limited size (independent of n) • Functional equivalence: is one counter enough? • E.g.: a.{3}bc: • text: axaybcz • text: axaybzbc  Multiple (up to n) counter instances necessary • n active counter instances  unmodified memory bandwidth requirement!  match is detected match is missed!

  9. 10 Counting-NFAs: limiting memory bandwidth ∑ ∑ E.g: a.{n}bc ∑ | cntn c b| cnt=n a 2 , cnt 0 1 3 4 cnt++ • Observation: • Counter instances updated in parallel • Difference between ci and cj constant over time • Idea: • Differential representation: store oldest (and largest) instance ci’ and, for j>i,Δcj=cj-cj-1 • Condition evaluation: • cnt=n:ci’=n • cntn: ci’n OR another instances cj exists • Advantage: • Even if n instances are active, only 2 must be queried/updated n=10 9 3 2 2 7 - 2 2

  10. Counting-DFAs E.g: a.{n}bc Counting-DFA Counting-NFA • Extended NFA-DFA transformation • Counting states • Instantiating transitions • Conditional transitions • Possible conditions: • cnt=n: ci’=n and ci’ is single instance • cntn: ci’ n • cnt=: ci’=n and another instance cj exists • Consequences: • Limited memory bandwidth (1 state + 2 counter instances) • Limited size (independent of n)

  11. DFA1 DFA2 DFA3 DFA4 DFAk Combining multiple regex Patterns = {RE1, RE2, RE3,… REn} DFA • 1st solution: regex partitioning [Brodie, ISCA’06][Yu, ANCS’06] NFA RE1 RE2 RE3 RE4. . .REi-1 REi REi+1. . . Rn-1 Rn k concurrent DFAs  k memory accesses/input char ! High parallelism and memory bandwidth: ASIC, FPGA

  12. [^c1..ck] cnt+ Combining multiple regex (cont’d) ∑ ∑ • 2nd solution: hybrid-FA [Becchi, CoNEXT 2007] tail-NFA1 tail-DFA1 head-DFA head-DFA head-DFA tail-DFA2 tail-NFA2 tail-NFAk tail-DFAk • Memory Size: • Limited, independent of # of closures states • Memory Bandwidth: • Average: • only head-DFA active • one state traversal/character • Worst case: • All tail-FAs are active • Bandwidth= # DFAs state traversal + 2 accesses/counters, per char [^c1..ck] cnt+ Low-Medium parallelism and memory bandwidth: GPP, small CMP

  13. Back-references • Idea: a given sub-expression must be matched multiple times with the same text • Examples • (abc|bcd).\1ymatches abcdabcdy, does not match abcdaabcy • a([a-z]+)a\1ymatches babacabacy • Observations • The alternative in the referenced sub-expression may overlap • The capture sub-expression may overlap w/ previous/next char • The length of the referenced sub-expression may be variable GOAL: preserve NFA-like operation: • Find all matches/stop at the first • Process each char once • Allow parallel RegEx processing Memory needed

  14. Extended-FA Recording transitions E.g.: (abc|bcd).\1y Conditional transitions b 1 2 \1|sε • Extensions: • Recording and conditional transitions, consuming states • Each state associated with a set {PMk} of partial match strings c a \1|s=ε y 0 5 6 7 8 b d ∑ c Consuming state 3 4 ∑

  15. Extended-FA operation E.g.: (abc|bcd).\1y a ab b 1 2 \1|sε c ∑ abc a \1|s=ε y ∑ 0 5 6 7 8 b d c bc abc c 3 4 b bc Text: a b c e a b c y

  16. Results Hybrid-FA has memory bandwidth from 10X to 100X lower Memory size Worst case memory bandwidth

  17. Conclusion • Extended Finite Automata: • Dynamic state information: counters, partial matches • Manipulating states: counting, consuming states • Producing transitions: counter, partial match instantiation • Conditional transitions • Goals: • Limited memory size and bandwidth requirement • Integration w/ existing proposals • Multiple-DFAs • Hybrid-FA • DFA compression techniques • Future direction: • Use extensions for structured data parsing (e.g.: XML, application protocols)

  18. Thank you! • Questions?

  19. Matching architecture

More Related