Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond Author: Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher: FPT 2010 Presenter: Chun-Sheng Hsueh Date: 2013/10/23

Introduction • A FPGA-based architecture which based on HyperSplit targeting 100 Gbps packet classification. • Special logic is designed to support two packets to be processed every clock cycle. • A node-merging algorithm is proposed to reduce the number of pipeline stages. • A leaf-pushing algorithm is designed to control the memory usage.

Introduction • Software solutions have good flexibility and programmability, but they inherently lack high parallelism and abundant on-chip memory. • TCAM-based solutions can reach wire speed performance, they sacrifice scalability, programmability and power efficiency. • TCAM-based solutions also suffer from a range-to prefix conversion problem, making it difficult to support large and complex rule sets.

Background and Related Work • A. Problem Statement:

Background and Related Work • B. Packet Classification Algorithms:

Background and Related Work • C. Related Work • Jedhe et al. implemented the DCFL architecture on a Xilinx Virtex-2 Pro FPGA and achieved a throughput of 16 Gbps for 128 rules. • Luo et al. proposed a method called explicit range search to allow more cuts per node than the original HyperCuts algorithm. The tree height is reduced at the cost of extra memory consumption. • Jiang et al. proposed two optimization methods for the HyperCuts algorithm to reduce memory consumption. By deep pipelining, their FPGA implementation can achieve a throughput of 80 Gbps.

Architecture Design • A. Algorithm Motivation • Algorithm parallelism • Logic complexity • Memory efficiency

Architecture Design

Architecture Design • A. Algorithm Motivation • HyperSplit is well suited to FPGA implementation: • First, the tree structure of the binary search can be pipelined to achieve high throughput of one packet per clock cycle. • Second, the operation at each tree node is simple. Both the value comparison and address fetching can be efficiently implemented with small amount of logic.

Architecture Design • B. Basic Architecture

Architecture Design

Architecture Design • B. Basic Architecture • When an update is initiated, a write bubble is inserted into the pipeline. Each write bubble is assigned with a stage_identifier. • If the write enable bit is set, the write bubble will use the new content to update the memory at the stage specified by the stage_identifier.

Architecture Design • C. Design Challenges • Number of pipeline stages: • The pipeline can be long for large rule sets. For example, with 10K ACL rules, the pipeline has 28 stages. • Block RAM usage: • Because the number of nodes is different at each level of the tree, the memory usage at each stage is not equal. In order to support on-the-fly rule update, the size of block RAMs for each pipeline stage needs to be determined during the implementation of the design.

ARCHITECTURE OPTIMIZATION • A. Pipeline Depth Reduction

ARCHITECTURE OPTIMIZATION

ARCHITECTURE OPTIMIZATION • B. Controlled Block RAM Allocation • First, because the number tree nodes in the top l-level is small (less than 4l), it is not efficient to instantiate the 1024- entry block RAMs for those levels. Instead, we allow those entries in distributed RAMs. • Second, because pushing down the leaf nodes does not change the search semantics of the original tree and the number of leaf nodes is comparable to non-leaf nodes , we can reshape the tree by pushing leaf nodes down to lower levels to reduce the memory usage at certain stages.

ARCHITECTURE OPTIMIZATION • B. Controlled Block RAM Allocation

PERFORMANCE EVALUATION • A. Test bed and data set • Rule sets ranges from 100 to 50,000 and generated by the Classbench with acl1 seeds. • FPGA device used in the tests is the Xilinx Virtex-6, (XC6VSX475T) containing 7,640 Kb Distributed RAM and 38,304 Kb block RAMs.

PERFORMANCE EVALUATION • B. Node-merging Optimization

PERFORMANCE EVALUATION • C. Leaf-pushing Optimization

PERFORMANCE EVALUATION • D. FPGA Performance

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond

Presentation Transcript

Packet Classification

Packet Classification

Classification and Beyond

Packet classification on Multiple Fields

Granites Classification and Petrogenesis A Multi-Dimensional Problem

Two-dimensional packet classification algorithm using a quad-tree

Packet Classification # 3

Packet Classification On Multiple Fields

Packet Classification on Multiple Fields

Packet Classification Using Multi-Iteration RFC

Memory-Efficient 5D Packet Classification At 40 Gbps

Multi-Gbps TCP

High-performance Architecture for Dynamically Updatable Packet Classification on FPGA

Efficient Multi-match Packet Classification with TCAM

Fast Packet Classification Using Multi-Dimensional Encoding

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond

Efficient Multi-Match Packet Classification with TCAM

Packet Classification on Multiple Fields

Packet Classification on PLUG Architecture

Network Packet Broker Market by Bandwidth (1 and 10 Gbps, 40 Gbps, 100 Gbps) - 2023 | MarketsandMarkets