Design and Implementation of a Stateful Network Packet Processing Framework for GPUs

Design and Implementation of a Stateful Network Packet Processing Framework for GPUs Author: GiorgosVasiliadis, LazarosKoromilas, MichalisPolychronakis, Sotiris Ioannidis Publisher/Conference: IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 25, NO. 1, FEBRUARY 2017 Referenced:41 Presenter: 林宇翔 Date: 2019/04/10 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

Main feature for GASPP • Fast user-space packet capturing • Modular and flexible • Efficient packet scheduling mechanisms • TCP processing and flow management support National Cheng Kung University CSIE Computer & Internet Architecture Lab

Fast user-space packet capturing • How to transport packets from CPU to GPU devices ? (a) Network packets are transferred to the page-locked memory of the NIC, then copied to the page-locked memory of the GPU, and from there, they are finally transferred to the GPU. (b) By adjusting the netmap module. The shared buffer is added to the internal tracking mechanism of the CUDA driver to automatically accelerate calls to functions, as it can be accessed directly by the GPU. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Why staging is better than zero‐copy (for small packets) • To improve PCIe throughput, we batch several packets and transfer them at once. • Although making the NIC’s packet queue directly accessible to the GPU eliminates redundant copies, the fixed-size partitioning of the NIC’s queue leads to redundant data transfers for traffic with many small packets. • Occasionally it is better to copy packets back-to-back into a second buffer and transferring it to the GPU. GASPP dynamically switches to the optimal approach by monitoring the actual utilization of the slots. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Modular and flexible • Basic abstraction of processing: “modules” • __device__ uintprocessEth(unsigned pktid, ethhdr *eth, uintcxtkey); • __device__ uintprocessIP(unsigned pktid, ethhdr *eth, iphdr *ip, uintcxtkey); • __device__ uintprocessUDP(unsigned pktid, ethhdr *eth, iphdr *ip, udphdr *udp, uintcxtkey); • __device__ uintprocessTCP(unsigned pktid, ethhdr *eth, iphdr *ip, tcphdr *tcp, uintcxtkey); • __device__ uintprocessStream(unsigned pktid, ethhdr *eth, iphdr *ip, tcphdr *tcp, uchar *chunk, unsigned chunklen, uintcxtkey); • Modules are executed sequentially or in parallel. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Modular and flexible - Batch Processing Pipeline National Cheng Kung University CSIE Computer & Internet Architecture Lab

Efficient packet scheduling mechanisms - SIMD • Schedule incoming packets on GPU cores. • Threads within the same warp have to execute the same instructions for regular computations. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Efficient packet scheduling mechanisms – Parallelism in packet processing • Network packets are processed in batches. • Whenever a new batch of packets is available, it is processed in parallel using a number of threads equal to the number of packets in the batch. • As threads within the same warp have to execute the same instructions, load imbalance and code flow divergence within a warp can cause inefficiencies. • Received network packets mix is very dynamic. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Efficient packet scheduling mechanisms - Dynamic Irregularities • Different packet lengths • Divergent parallel module processing: packets belong to different protocols may be executed with different modules. National Cheng Kung University CSIE Computer & Internet Architecture Lab

Efficient packet scheduling mechanisms - Dynamic Irregularities National Cheng Kung University CSIE Computer & Internet Architecture Lab

Efficient packet scheduling mechanisms - Dynamic Irregularities • To group a batch of packets on the GPU, we have adapted a GPU-based radix sort (instead of fully sorting a batch of packets, we can group them to bins, in which only a subset of bits are compared) implementation. we assign a separate weight for each packet, consisting of the byte concatenation of the ip_protofield of its IP header, the value of the context-key returned by the previously executed module, and its length. National Cheng Kung University CSIE Computer & Internet Architecture Lab

TCP processing and flow management support – TCP Flow State Management • GASPP enables applications to access the processed data through a global key-value store. The key-value store is implemented as a hash-table that allows insertion, acquisition, deletion, and updates of data objects from both the host and the device. • Internally, data objects are hashed and mapped to a given bucket. If the bucket is already occupied, the thread allocates a new bucket node, using the on-GPU malloc(), and inserts the entry at the front of the bucket’s list. • To enable GPU threads to add or remove nodes from the table in parallel, we associate an atomic lock with each bucket, so that only a single thread can make changes to a given bucket at a time. National Cheng Kung University CSIE Computer & Internet Architecture Lab

TCP processing and flow management support – TCP Flow State Management • Maintain the state of TCP connections: 17 bytes records • Hash key(4): hash of the source IP, destination IP, TCP ports • State(1): TCP Connection state • (4): most recently received client • (4) most recently received server • Next(4): pointer points to the next record (if any) National Cheng Kung University CSIE Computer & Internet Architecture Lab

TCP processing and flow management support – TCPStream Reassembly • GASPP process packets in batch, it is very possible that a batch contains several packets of the same connection. • Each packets was hashed twice: • hash(addr_s, addr_d, port_s, port_d, seq) • hash(addr_s, addr_d, port_s, port_d, seq + len) • Two packets x and y are consecutive if: • (4-tuple, seq + len) = (4-tuple, seq) National Cheng Kung University CSIE Computer & Internet Architecture Lab

Design and Implementation of a Stateful Network Packet Processing Framework for GPUs

Design and Implementation of a Stateful Network Packet Processing Framework for GPUs

Presentation Transcript

A Framework for the Design and Implementation of Learning Objects: a Competence-based Approach

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors

Design and Implementation of a Framework for Efficient and Programmable Sensor Networks

Design and Implementation of a Framework for Adaptive Knowledge-Intensive Processes in a Wiki

Implementation and Processing outline Processing Framework of VIIRS instrument monitoring System Processing Framework of

Parallelizing Network Packet Processing in Software Router

A FRAMEWORK FOR IMPLEMENTATION OF PROFESSIONAL CURRENCY AND SCHOLARSHIP

Graphics Processing Units ( GPUs )

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs

DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM

Data Indexing for Stateful , Large-scale Data Processing

A Framework for Network Visualisation

Packet implementation: discretization

Network Design, Implementation, and Management

Framework for Design and Implementation of Digital Forensics Labs

Symmetrical Decentralized Processing Architecture Design and Implementation

Symmetrical Decentralized Processing Architecture Design and Implementation

Network Protocol Software: Design and Implementation

Network Design and Implementation

Framework for Implementation

Design and Implementation of a Multi-Channel Multi-Interface Network

A Protocol for Packet Network Intercommunication