1 / 24

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond. Author : Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher: FPT 2010 Presenter: Chun-Sheng Hsueh Date: 2013/10/23. Introduction.

Download Presentation

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond Author: Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher: FPT 2010 Presenter: Chun-Sheng Hsueh Date: 2013/10/23

  2. Introduction • A FPGA-based architecture which based on HyperSplit targeting 100 Gbps packet classification. • Special logic is designed to support two packets to be processed every clock cycle. • A node-merging algorithm is proposed to reduce the number of pipeline stages. • A leaf-pushing algorithm is designed to control the memory usage.

  3. Introduction • Software solutions have good flexibility and programmability, but they inherently lack high parallelism and abundant on-chip memory. • TCAM-based solutions can reach wire speed performance, they sacrifice scalability, programmability and power efficiency. • TCAM-based solutions also suffer from a range-to prefix conversion problem, making it difficult to support large and complex rule sets.

  4. Background and Related Work • A. Problem Statement:

  5. Background and Related Work • B. Packet Classification Algorithms:

  6. Background and Related Work • C. Related Work • Jedhe et al. implemented the DCFL architecture on a Xilinx Virtex-2 Pro FPGA and achieved a throughput of 16 Gbps for 128 rules. • Luo et al. proposed a method called explicit range search to allow more cuts per node than the original HyperCuts algorithm. The tree height is reduced at the cost of extra memory consumption. • Jiang et al. proposed two optimization methods for the HyperCuts algorithm to reduce memory consumption. By deep pipelining, their FPGA implementation can achieve a throughput of 80 Gbps.

  7. Architecture Design • A. Algorithm Motivation • Algorithm parallelism • Logic complexity • Memory efficiency

  8. Architecture Design

  9. Architecture Design • A. Algorithm Motivation • HyperSplit is well suited to FPGA implementation: • First, the tree structure of the binary search can be pipelined to achieve high throughput of one packet per clock cycle. • Second, the operation at each tree node is simple. Both the value comparison and address fetching can be efficiently implemented with small amount of logic.

  10. Architecture Design • B. Basic Architecture

  11. Architecture Design

  12. Architecture Design • B. Basic Architecture • When an update is initiated, a write bubble is inserted into the pipeline. Each write bubble is assigned with a stage_identifier. • If the write enable bit is set, the write bubble will use the new content to update the memory at the stage specified by the stage_identifier.

  13. Architecture Design • C. Design Challenges • Number of pipeline stages: • The pipeline can be long for large rule sets. For example, with 10K ACL rules, the pipeline has 28 stages. • Block RAM usage: • Because the number of nodes is different at each level of the tree, the memory usage at each stage is not equal. In order to support on-the-fly rule update, the size of block RAMs for each pipeline stage needs to be determined during the implementation of the design.

  14. ARCHITECTURE OPTIMIZATION • A. Pipeline Depth Reduction

  15. ARCHITECTURE OPTIMIZATION

  16. ARCHITECTURE OPTIMIZATION

  17. ARCHITECTURE OPTIMIZATION • B. Controlled Block RAM Allocation • First, because the number tree nodes in the top l-level is small (less than 4l), it is not efficient to instantiate the 1024- entry block RAMs for those levels. Instead, we allow those entries in distributed RAMs. • Second, because pushing down the leaf nodes does not change the search semantics of the original tree and the number of leaf nodes is comparable to non-leaf nodes , we can reshape the tree by pushing leaf nodes down to lower levels to reduce the memory usage at certain stages.

  18. ARCHITECTURE OPTIMIZATION • B. Controlled Block RAM Allocation

  19. PERFORMANCE EVALUATION • A. Test bed and data set • Rule sets ranges from 100 to 50,000 and generated by the Classbench with acl1 seeds. • FPGA device used in the tests is the Xilinx Virtex-6, (XC6VSX475T) containing 7,640 Kb Distributed RAM and 38,304 Kb block RAMs.

  20. PERFORMANCE EVALUATION • B. Node-merging Optimization

  21. PERFORMANCE EVALUATION • C. Leaf-pushing Optimization

  22. PERFORMANCE EVALUATION • D. FPGA Performance

  23. PERFORMANCE EVALUATION • D. FPGA Performance

More Related