1 / 14

Design and Implementation of a Stateful Network Packet Processing Framework for GPUs

Explore the design and implementation of GASPP - a modular, fast, and flexible framework for packet processing on GPUs. Learn about efficient packet scheduling mechanisms, TCP support, and more at National Cheng Kung University CSIE. Discover the benefits of staging over zero-copy methods and delve into dynamic irregularities in packet processing to enhance network performance.

danthony
Download Presentation

Design and Implementation of a Stateful Network Packet Processing Framework for GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Implementation of a Stateful Network Packet Processing Framework for GPUs Author: GiorgosVasiliadis, LazarosKoromilas, MichalisPolychronakis, Sotiris Ioannidis Publisher/Conference: IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 25, NO. 1, FEBRUARY 2017 Referenced:41 Presenter: 林宇翔 Date: 2019/04/10 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

  2. Main feature for GASPP • Fast user-space packet capturing • Modular and flexible • Efficient packet scheduling mechanisms • TCP processing and flow management support National Cheng Kung University CSIE Computer & Internet Architecture Lab

  3. Fast user-space packet capturing • How to transport packets from CPU to GPU devices ? (a) Network packets are transferred to the page-locked memory of the NIC, then copied to the page-locked memory of the GPU, and from there, they are finally transferred to the GPU. (b) By adjusting the netmap module. The shared buffer is added to the internal tracking mechanism of the CUDA driver to automatically accelerate calls to functions, as it can be accessed directly by the GPU. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  4. Why staging is better than zero‐copy (for small packets) • To improve PCIe throughput, we batch several packets and transfer them at once. • Although making the NIC’s packet queue directly accessible to the GPU eliminates redundant copies, the fixed-size partitioning of the NIC’s queue leads to redundant data transfers for traffic with many small packets. • Occasionally it is better to copy packets back-to-back into a second buffer and transferring it to the GPU. GASPP dynamically switches to the optimal approach by monitoring the actual utilization of the slots. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  5. Modular and flexible • Basic abstraction of processing: “modules” • __device__ uintprocessEth(unsigned pktid, ethhdr *eth, uintcxtkey); • __device__ uintprocessIP(unsigned pktid, ethhdr *eth, iphdr *ip, uintcxtkey); • __device__ uintprocessUDP(unsigned pktid, ethhdr *eth, iphdr *ip, udphdr *udp, uintcxtkey); • __device__ uintprocessTCP(unsigned pktid, ethhdr *eth, iphdr *ip, tcphdr *tcp, uintcxtkey); • __device__ uintprocessStream(unsigned pktid, ethhdr *eth, iphdr *ip, tcphdr *tcp, uchar *chunk, unsigned chunklen, uintcxtkey); • Modules are executed sequentially or in parallel. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  6. Modular and flexible - Batch Processing Pipeline National Cheng Kung University CSIE Computer & Internet Architecture Lab

  7. Efficient packet scheduling mechanisms - SIMD • Schedule incoming packets on GPU cores. • Threads within the same warp have to execute the same instructions for regular computations. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  8. Efficient packet scheduling mechanisms – Parallelism in packet processing • Network packets are processed in batches. • Whenever a new batch of packets is available, it is processed in parallel using a number of threads equal to the number of packets in the batch. • As threads within the same warp have to execute the same instructions, load imbalance and code flow divergence within a warp can cause inefficiencies. • Received network packets mix is very dynamic. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  9. Efficient packet scheduling mechanisms - Dynamic Irregularities • Different packet lengths • Divergent parallel module processing: packets belong to different protocols may be executed with different modules. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  10. Efficient packet scheduling mechanisms - Dynamic Irregularities National Cheng Kung University CSIE Computer & Internet Architecture Lab

  11. Efficient packet scheduling mechanisms - Dynamic Irregularities • To group a batch of packets on the GPU, we have adapted a GPU-based radix sort (instead of fully sorting a batch of packets, we can group them to bins, in which only a subset of bits are compared) implementation. we assign a separate weight for each packet, consisting of the byte concatenation of the ip_protofield of its IP header, the value of the context-key returned by the previously executed module, and its length. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  12. TCP processing and flow management support – TCP Flow State Management • GASPP enables applications to access the processed data through a global key-value store. The key-value store is implemented as a hash-table that allows insertion, acquisition, deletion, and updates of data objects from both the host and the device. • Internally, data objects are hashed and mapped to a given bucket. If the bucket is already occupied, the thread allocates a new bucket node, using the on-GPU malloc(), and inserts the entry at the front of the bucket’s list. • To enable GPU threads to add or remove nodes from the table in parallel, we associate an atomic lock with each bucket, so that only a single thread can make changes to a given bucket at a time. National Cheng Kung University CSIE Computer & Internet Architecture Lab

  13. TCP processing and flow management support – TCP Flow State Management • Maintain the state of TCP connections: 17 bytes records • Hash key(4): hash of the source IP, destination IP, TCP ports • State(1): TCP Connection state • (4): most recently received client • (4) most recently received server • Next(4): pointer points to the next record (if any) National Cheng Kung University CSIE Computer & Internet Architecture Lab

  14. TCP processing and flow management support – TCPStream Reassembly • GASPP process packets in batch, it is very possible that a batch contains several packets of the same connection. • Each packets was hashed twice: • hash(addr_s, addr_d, port_s, port_d, seq) • hash(addr_s, addr_d, port_s, port_d, seq + len) • Two packets x and y are consecutive if: • (4-tuple, seq + len) = (4-tuple, seq) National Cheng Kung University CSIE Computer & Internet Architecture Lab

More Related