400 likes | 417 Views
Efficient Memory Utilization on Network Processors for Deep Packet Inspection. Piti Piyachon Yan Luo Electrical and Computer Engineering Department University of Massachusetts Lowell. Our Contributions. Study parallelism of a pattern matching algorithm
E N D
Efficient Memory Utilization on Network Processors for Deep Packet Inspection Piti Piyachon Yan Luo Electrical and Computer Engineering Department University of Massachusetts Lowell
Our Contributions • Study parallelism of a pattern matching algorithm • Propose Bit-Byte Aho-Corasick Deterministic Finite Automata • Construct memory model to find optimal settings to minimize the memory usage of DFA U Mass Lowell
DPI and Pattern Matching • Deep Packet Inspection • Inspect: packet header & payload • Detect: computer viruses, worms, spam, etc. • Network intrusion detection application: Bro, Snort, etc. • Pattern Matching requirements • Matching predefined multiple patterns (keywords, or strings) at the same time • Keywords can be any size. • Keywords can be anywhere in the payload of a packet. • Matching at line speed • Flexibility to accommodate new rule sets U Mass Lowell
start state accept state accept state accept state accept state Classical Aho-Corasick (AC) DFA: example 1 • A set of keywords • {he, her, him, his} Failure edges back to state 1 are shown as dash line. Failure edges back to state 0 are not shown. U Mass Lowell
Memory Matrix Model of AC DFA • Snort (Dec’05): 2733 keywords • 256 next state pointers • width = 15 bits • > 27,000 states • keyword-ID width = 2733 bits • 27538 x (2733 + 256 x 15) = 22 MB 22 MB is too big for on-chip RAM U Mass Lowell
Bit-AC DFA (Tan-Sherwood’s Bit-Split) Need 8 bit-DFA U Mass Lowell
Memory Matrix of Bit-AC DFA • Snort (Dec’05): 2733 keywords • 2 next state pointers • width = 9 bits • 361 states • keyword-ID width = 16 bits • 1368 DFA • 1368 x 361 x (16 + 2 x 9) = 2 MB U Mass Lowell
Bit-AC DFA Techniques • Shrinking the width of keyword-ID • From 2733 to 16 bits • By dividing 2733 keywords into 171 subsets • Each subset has 16 keywords • Reducing next state pointers • From 256 to 2 pointers • By dividing each input byte into 1 bits • Need 8 bit-DFA • Extra benefits • The number of states (per DFA) reduces from ~27,000 to ~300 states. • The width of next state pointer reduces from 15 to 9 bits. • Memory • Reduced from 22 MB to 2 MB • The number of DFA = ? • With 171 subsets, each subset has 8 DFA. • Total DFA = 171 x 8 = 1,368 DFA What can we do better to reduce the memory usage? U Mass Lowell
Classical AC DFA: example 2 28 states Failure edges are not shown. U Mass Lowell
Byte-AC DFA • Considering 4 bytes at a time • 4 DFA • < 9 states / DFA • 256 next state pointers! Similar to Dharmapurikar-Lockwood’s JACK DFA, ANCS’05
Bit-Byte-AC DFA • 4 bytes at a time • Each byte divides into bits. • 32 DFA (= 4 x 8) • < 6 states/DFA • 2 next state pointers U Mass Lowell
Memory Matrix of Bit-Byte-AC DFA • Snort (Dec’05): 2733 keywords • 4 bytes at a time • < 36 states/DFA • 2 next state pointers • width = 6 bits • keyword-ID width = 3 bits • 29152 DFA (= 911 x 32) • 29152 x 36 x (3 + 2 x 6)= 1.9 MB • 1.9 MB is a little better than 2 MB. • This is because • It is not any optimal setting. • Each DFA has different number of states. • Don’t need to provide same size of memory matrix for every DFA. U Mass Lowell
Bit-Byte-AC DFA Techniques • Still keeping the width of keyword-ID as low as Bit-DFA. • Still keeping next state pointers as small as Bit-DFA. • Reducing states per DFA by • Skipping bytes • Exploiting more shared states than Bit-DFA • Results of reducing states per DFA • from ~27,000 to 36 states • The width of next state pointer reduces from 15 to 6 bits. U Mass Lowell
Construction of Bit-Byte AC DFA bit 3 of byte 0 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA 4 bytes (considered) at a time U Mass Lowell
Construction of Bit-Byte AC DFA Failure edges are not shown. U Mass Lowell
Construction of Bit-Byte AC DFA U Mass Lowell
Construction of Bit-Byte AC DFA 32 bit-byte DFA need to be constructed. U Mass Lowell
Bit-Byte-DFA: Searching U Mass Lowell
Bit-Byte-DFA: Searching 0 A failure edge is shown as necessary. U Mass Lowell
Bit-Byte-DFA: Searching U Mass Lowell
Bit-Byte-DFA: Searching 0 A failure edge is shown as necessary. U Mass Lowell
Match=> (keyword) ‘memory’ Only all 32 bit-DFA find the match in their own! Bit-Byte-DFA: Searching U Mass Lowell
Find the optimal settings to minimize memory • When k = keywords per subset • The width of keyword-ID = k bits • k = 1, 2, 3, … , K • when K = the number of keywords in the whole set. • Snort (Dec.2005) : K = 2733 keywords • b = bit(s) extracted for each byte • b = 1, 2, 4, 8 • # of next state pointers = 2b • The example 2: b = 1 • Beyond b > 8 • > 256 next state pointers • B = Bytes considered at a time • B = 1, 2, 3, … • The example 2: B = 4 • Total Memory (T) is a function of k, b, and B. • T = f(k, b, B) U Mass Lowell
T’s Formula , and , when Total memory of all bit-ACs in all subset U Mass Lowell
T_min at k=12 Find the optimal k • Each pair of (b, B) has one optimal k for a minimal T. U Mass Lowell keywords per subset
Find the optimal b • Each setting of k, b, and B has different optimal point. • Choosing only the optimal setting to compare. • b = 2 is the best. U Mass Lowell keywords per subset
Find the optimal B • b = 2 • T reduces while B increases. • Non-linearly • B > 16, • T begins to increase. • B = 16 is the best for Snort (Dec’05). U Mass Lowell keywords per subset
Comparing with Existing Works • Tan-Sherwood’s, Brodie-Cytron-Taylor’s, and Ours • Our Bit-Byte DFA when B=16 • The optimal point at b=2 and k=12 • 272 KB • 14 % of 2001 KB (Tan’s) • 4 % of 6064 KB (Brodie’s) U Mass Lowell keywords per subset
Comparing with Existing Works • Tan-Sherwood’s and Ours: At B = 1 • (Tan’s on ASIC) • 2001 KB • k = 16is not the optimal setting for B=1. • Each bit-DFA uses same storage’s capacity, which fits the largest one (worst case). • (Ours on NP) • 396 KB < 2001 KB • k = 3 is the optimal setting for B=1. • Each bit-DFA uses exactly memory space to hold it. U Mass Lowell keywords per subset
Results with an NP Simulator • NePSim2 • An open source IXP24xx/28xx simulator • NP Architecture based on IXP2855 • 16 MicroEngines (MEs) • 512 KB • 1.4 GHz • Bit-Byte AC DFA: b=2, B=16, k=12 • T = 272 KB • 5 Gbps U Mass Lowell keywords per subset
Conclusion • Bit-Byte DFA model can reduce memory usage up to 86%. • Implementing on NP uses on-chip memory more efficiently without wasting space, comparing to ASIC. • NP has flexibility to accommodate • The optimal setting of k, b, and B. • Different sizes of Bit-Byte DFA. • New rule sets in the future. • The optimal setting may change. • The performance (using a NP simulator) satisfies line speed up to 5 Gbps throughput. U Mass Lowell keywords per subset
Thank you Question? Piti_Piyachon@student.uml.edu Yan_Luo@uml.edu U Mass Lowell