350 likes | 366 Views
Performance Analysis of Packet Classification Algorithms on Network Processors. Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University November 18, 2004 IEEE Local Computer Networks. Network Processors. Emerging platform for high-speed packet processing
E N D
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University November 18, 2004 IEEE Local Computer Networks
Network Processors • Emerging platform for high-speed packet processing • Splice in a statistic here? • Provide device programmability while keeping performance • Architectures differ, but common features include… • Multiple processing units executing in parallel • Instruction set customized for network applications • Binary image pre-determined at compile time
IXP Architecture • Multi-processor • StrongARM core for slow-path processing • 6 microengines for fast-path processing • Hardware support for multi threading • Each microengine has 4 thread contexts • Zero or minimal overhead context switch
Motivation for study • NPs offer a programmable, parallel alternative, but current packet processing algorithms are • Written for sequential execution or • Designed using custom, invariant ASICs • To use them on NPs • Algorithms must be mapped onto NPs in different ways with each mapping having varying performance
Our study • Examine several mappings of a packet classification algorithm onto NP hardware • Identify general problems in performing such mappings
Why packet classification? • Fundamental function performed by all network devices • Routers, switches, bridges, firewalls, IDS • Increasing complexity makes packet classification the bottleneck • Increase in size of rulesets • Increase in dimension of rulesets • Algorithms must perform at high-speed on the fast-path
Picking an algorithm • Many algorithms sequential • Do not leverage inherent parallelism in NPs • Several parallel algorithms • BitVector [Lakshman98] • Parallel lookup implemented via FPGA • Maps well onto NP platform
Bit Vector algorithm • T.V. Lakshman, D. Stiliadis, “High-speed policy-based packet forwarding using efficient multi-dimensional range matching”, SIGCOMM 1998. • Parallel search algorithm • Preprocessing phase • Two-stage classification phase • Perform lookup for each dimension in parallel • Combine results to determine matching rule
Example ruleset Number of rules (N) = 4 Number of dimensions (d) = 3 Width of dimension (W) = 4 (bits)
BitVector example Packet = {6, 10, 2} Matching rule = r2
Recall: IXP has 6 μEngines Two design mappings • Consider multiple mappings of BitVector onto Intel’s IXP1200 microengines • Option 1: All processing for a single packet handled by one microengine (μEngine) - Parallel • Option 2: Processing for a single packet is split across μEngines - Pipelined
Evaluation platform • Intel IXP1200 Developer Workbench • Graphical IDE • Cycle-accurate simulator • Performance statistics • All experiments run within simulator • Configurable • Logging facility
Simulator configuration • IXP1200 chip • 1K microstore • Core frequency (~ 165 MHz) • 4 ports receive data • Simulations run until 75000 packets received by IXP • Simulator sends packets as fast as possible • Rulesets used • Experiments use a small, fixed set of rules • Availability of real-world firewall rulesets limited
Analysis • Overall, Parallel performs better than Pipelined • Pipelined : A single packet header in SDRAM is read multiple (3) times
Analysis • Aborted time is typically caused by branch instructions • Algorithms must reduce branch instructions to maximize throughput
Distribution of microengine time Parallel Pipelined
Analysis • High microengine idle time in Pipelined due to memory latency • Lower microengine aborted time in Pipelineddue to what?
Discussion • Pipelined mappings can bottleneck through memory • Repeated memory reads to send work from μEngine to μEngine • Direct hardware support for pipelining required • IXP2xxx = next-neighbor registers • Currently re-examining our results on IXP2400 • Algorithms with fewer branch instructions result in better microengine utilization (lower aborted time)
Conclusion • Packet classification is a fundamental function • Parallel nature of NPs well-suited for parallel search algorithms
Conclusion • Network processors offer high packet processing speed and programmability • Performance of an algorithm depends on the design mapping chosen • Contributions • Demonstrated that mapping has considerable impact on performance • Pipelined mappings benefit from hardware support • Algorithms with fewer branch instructions result in better processor utilization
Future work • Analyze other mappings • Split work across different hardware threads in a single microengine • Placement of data structures in different memory banks • IXP2400 • Examine how hardware features change trade-offs in algorithm mapping • Algorithms designed specifically for network processors
Definitions • Process of categorizing packets according to pre-defined rules • Classifieror ruleset: collection of rules • Dimension or field: packet header used • Rule: range of field values and action
Packet classification algorithms N: number of rules d: number of dimensions W: maximum number of bits l : number of levels occupied by a FIS-tree