1 / 95

PacketShader : A GPU-Accelerated Software Router

PacketShader is a high-performance software router accelerated by GPU technology, achieving speeds of 40 Gbps on a single unit. Learn how GPU offloading and I/O optimization overcome traditional CPU bottlenecks, allowing for fast packet processing. This innovative approach offers efficient packet handling meeting the demands of modern networking. Discover the advantages of utilizing GPU for packet processing and how it enhances raw computation power, memory access latency, and memory bandwidth. Explore the design, journey of packets, and scaling capabilities with multi-core CPUs. Experience the results firsthand with impressive speedups in processing IPv6 forwarding and other network tasks.

Download Presentation

PacketShader : A GPU-Accelerated Software Router

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PacketShader:A GPU-Accelerated Software Router Sangjin Han† In collaboration with: Keon Jang †, KyoungSoo Park‡, Sue Moon † † Advanced Networking Lab, CS, KAIST‡ Networked and Distributed Computing Systems Lab, EE, KAIST

  2. PacketShader:A GPU-Accelerated Software Router High-performance Our prototype: 40 Gbps on a single box

  3. Software Router • Despite its name, not limited to IP routing • You can implement whatever you want on it. • Driven by software • Flexible • Friendly development environments • Based on commodity hardware • Cheap • Fast evolution

  4. Now 10 Gigabit NIC is a commodity • From $200 – $300 per port • Great opportunity for software routers

  5. Achilles’ Heel of Software Routers • Low performance • Due to CPU bottleneck • Not capable of supporting even a single 10G port

  6. CPU Bottleneck

  7. Per-Packet CPU Cycles for 10G IPv4 + = 1,800 cycles 1,200 600 Packet I/O IPv4 lookup Cycles needed IPv6 + = 2,800 1,200 1,600 5,400 Packet I/O IPv6 lookup … IPsec + = 6,600 1,200 Packet I/O Encryption and hashing Yourbudget 1,400 cycles 10G, min-sized packets, dual quad-core 2.66GHz CPUs (in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)

  8. Our Approach 1: I/O Optimization • 1,200 reduced to 200 cycles per packet • Main ideas • Huge packet buffer • Batch processing + = 1,800 cycles 1,200 600 Packet I/O IPv4 lookup + = 2,800 1,200 1,600 5,400 Packet I/O IPv6 lookup … + = 6,600 1,200 Packet I/O Encryption and hashing Packet I/O

  9. Our Approach 2: GPU Offloading • GPU Offloading for • Memory-intensive or • Compute-intensive operations • Main topic of this talk + 600 Packet I/O IPv4 lookup + 1,600 5,400 Packet I/O IPv6 lookup … + Packet I/O Encryption and hashing

  10. What Is GPU?

  11. GPU = Graphics Processing Unit • The heart of graphics cards • Mainly used for real-time 3D game rendering • Massively-parallel processing capacity (Ubisoft’s AVARTAR, from http://ubi.com)

  12. CPU vs. GPU CPU: Small # of super-fast cores GPU: Large # of small cores

  13. “Silicon Budget” in CPU and GPU ALU Xeon X5550: 4cores731Mtransistors GTX480: 480 cores3,200M transistors

  14. GPU for Packet Processing

  15. Advantages of GPU for Packet Processing • Raw computation power • Memory access latency • Memory bandwidth • Comparison between • Intel X5550 CPU • NVIDIA GTX480 GPU

  16. (1/3) Raw Computation Power • Compute-intensive operations in software routers • Hashing, encryption, pattern matching, network coding, compression, etc. • GPU can help! Instructions/sec < CPU: 43×109= 2.66 (GHz) × 4 (# of cores) × 4 (4-way superscalar) GPU: 672×109= 1.4 (GHz) × 480 (# of cores)

  17. (2/3) Memory Access Latency • Software router  lots of cache misses • GPU can effectively hide memory latency Cachemiss Cachemiss GPU core Switch to Thread 2 Switch to Thread 3

  18. (3/3) Memory Bandwidth CPU’s memory bandwidth (theoretical): 32 GB/s

  19. (3/3) Memory Bandwidth 3. TX: CPU  RAM 2. RX: RAM  CPU 4. TX: RAM  NIC 1. RX: NIC  RAM CPU’s memory bandwidth (empirical) < 25 GB/s

  20. (3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s

  21. (3/3) Memory Bandwidth Your budget for packet processing can be less 10 GB/s GPU’s memory bandwidth: 174GB/s

  22. How to Use GPU

  23. Basic Idea Offload core operations to GPU(e.g., forwarding table lookup)

  24. Recap • For GPU, more parallelism, more throughput GTX480: 480 cores

  25. Parallelism in Packet Processing • The key insight • Stateless packet processing = parallelizable RX queue 2. Parallel Processingin GPU 1. Batching

  26. Batching  Long Latency? • Fast link = enough # of packets in a small time window • 10 GbE link • up to 1,000 packets only in 67μs • Much less time with 40 or 100 GbE

  27. PacketShader Design

  28. Basic Design • Three stages in a streamline Pre-shader Post-shader Shader

  29. Packet’s Journey (1/3) • IPv4 forwarding example • Checksum, TTL • Format check • … Collecteddst. IP addrs Pre-shader Post-shader Shader Some packetsgo to slow-path

  30. Packet’s Journey (2/3) • IPv4 forwarding example 2. Forwarding table lookup 1. IP addresses 3. Next hops Pre-shader Post-shader Shader

  31. Packet’s Journey (3/3) • IPv4 forwarding example Update packets and transmit Pre-shader Post-shader Shader

  32. Interfacing with NICs Packet RX Packet TX Pre-shader Post-shader Device driver Device driver Shader

  33. Scaling with a Multi-Core CPU Master core Shader Device driver Pre-shader Post-shader Device driver Worker cores

  34. Scaling with Multiple Multi-Core CPUs Shader Device driver Pre-shader Post-shader Device driver Shader

  35. Evaluation

  36. Hardware Setup CPU: Total 8 CPU cores Quad-core, 2.66 GHz Total 80 Gbps NIC: Dual-port 10 GbE GPU: Total 960 cores 480 cores, 1.4 GHz

  37. Experimental Setup Input traffic Processed packets … 8 × 10 GbE links Packet generator PacketShader (Up to 80 Gbps)

  38. Results (w/ 64B packets) GPU speedup 1.4x 4.8x 2.1x 3.5x

  39. Example 1: IPv6 forwarding • Longest prefix matching on 128-bit IPv6 addresses • Algorithm: binary search on hash tables [Waldvogel97] • 7 hashings + 7 memory accesses … … … … Prefix length 1 64 80 96 128

  40. Example 1: IPv6 forwarding Bounded by motherboardIO capacity (Routing table was randomly generated with 200K entries)

  41. Example 2: IPsec tunneling • ESP (Encapsulating Security Payload) Tunnel mode • with AES-CTR (encryption) and SHA1 (authentication) IP header IP payload Original IP packet IP header IP payload ESPtrailer + 1. AES ESPheader IP header IP payload ESPtrailer + 2. SHA1 New IPheader ESPheader IP header IP payload ESPtrailer ESPAuth. IPsec Packet +

  42. Example 2: IPsec tunneling • 3.5x speedup

  43. Kernel User

  44. Conclusions • GPU • a great opportunity for fast packet processing • PacketShader • Optimized packet I/O + GPU acceleration • scalable with • # of multi-core CPUs, GPUs, and high-speed NICs • Current Prototype • Supports IPv4, IPv6, OpenFlow, and IPsec • 40 Gbps performance on a single PC

  45. Future Work • Control plane integration • Dynamic routing protocols with Quagga or Xorp • Multi-functional, modular programming environment • Integration with Click? [Kohler99] • Opportunistic offloading • CPU at low load • GPU at high load • Stateful packet processing

  46. THANK YOU • Questions?(in CLEAR, SLOW, and EASY ENGLISH, please) • Personal AD (for professors): • I am a prospective grad student. • Going to apply this fall for my PhD study. • Less personal AD (for all): • Keon will present SSLShader in the poster session!

  47. Backup slides

  48. Latency: IPv6 forwarding

  49. Packet I/O Performance

  50. IPv4 Forwarding Performance

More Related