240 likes | 276 Views
Explore the development status of NVMe™/TCP protocol through a case study of SPDK user space solution. Discover key improvements, performance optimizations, and comparisons with other protocols. Learn about interrupt affinity, large transfer optimizations, throughput comparisons, scalability, and future enhancements. Delve into the impact of mixed workloads and commercial performance of NVMe™/TCP controllers.
E N D
NVMe™/TCP Development Status and a Case study of SPDK User Space Solution2019 NVMe™ Annual Members Meeting and Developer DayMarch 19, 2019Sagi Grimberg, Lightbits Labs Ben Walker and Ziye Yang, Intel
NVMe™/TCP Status • TP ratified @ Nov 2018 • Linux Kernel NVMe/TCP inclusion made v5.0 • Interoperability tested with vendors and SPDK • Running in large-scale production environments (backported though) • Main TODOs: • TLS support • Connection Termination rework • I/O Polling (leverage .sk_busy_loop() for polling) • Various performance optimizations (mainly on the host driver) • A few minor Specification wording issues to fixup
Performance: Interrupt Affinity • In NVMe™ we pay a close attention to steer an interrupt to the application CPU core • In TCP Networking: • TX interrupts are usually steered to the submitting CPU core (XPS) • RX interrupts steering is determined by: Hash(5-tuple) • That is not local to the application CPU core • But, aRFS comes to the rescue! • RPS mechanism is offloaded to the NIC • NIC driver implements: .ndo_rx_flow_steer • The RPS stack learns where the CPU core that processes the stream and teaches the HW with a dedicated steering rule.
Canonical Latency Overhead Comparison • The measurement tests the latency overhead for a QD=1 I/O operation • NVMe™/TCP is faster than iSCSI but slower than NVMe/RDMA
Performance: Large Transfers Optimizations • NVMe™ usually impose minor CPU overhead for large I/O • <= 8K (two pages) only assign 2 pointers • > 8K setup PRP/SGL • In TCP networking: • TX large transfers involves higher overhead for TCP segmentation and copy • Solution: TCP Segmentation Offload (TSO) and .sendpage() • RX large transfers involves higher overhead for more interrupts and copy • Solution: Generic Receive Offload (GRO) and Adaptive Interrupt Moderation • Still more overhead than PCIe though...
Throughput Comparison • Single-threaded NVMe™/TCP achieves 2x better throughput • NVMe/TCP scales to saturate 100Gb/s for 2-3 threads however iSCSI is blocked
NVMe™/TCP Parallel Interface • Each NVMe queue maps to a dedicated bidirectional TCP connection • No controller-wide sequencing • No controller-wide reassembly constraints
4K IOPs Scalability • iSCSI is serialized heavily and cannot scale with the number of threads • NVMe™/TCP scales very well reaching over 2M 4K IOPs
Performance: Read vs. Write I/O Queue Separation • Common problem with TCP/IP is head-of-queue (HOQ) blocking • For example, a small 4KB Read is blocked behind a large 1MB Write to complete data transfer • Linux supports Separate Queue mappings since v5.0 • Default Queue Map • Read Queue Map • Poll Queue Map • NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. • In the Future can contain Priority Based Queue Arbitration to eliminate even further
Performance: Read vs. Write I/O Queue Separation • NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. • Future: Priority Based Queue Arbitration can reduce impact even further
Mixed Workloads Test • Test the impact of Large Write I/O on Read Latency • 32 “readers” issuing synchronous READ I/O • 1 Writer that issues 1MB Writes @ QD=16 • iSCSI Latencies collapse in the presence of Large Writes • Heavy serialization over a single channel • NVMe™/TCP is very much on-par with NVMe/RDMA
Commercial Performance Software NVMe™/TCP controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts
Commercial Performance – Mixed Workloads Software NVMe™/TCP Controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts
Slab, sendpage and kernel hardening We never copy buffers NVMe™/TCP TX side (not even PDU headers) As a proper blk_mq driver, Our PDU headers were preallocated in advance PDU headers were allocated as normal Slab objects Can a Slab original allocation be sent to the network with Zcopy? • Linux-mm seemed to agree we can (Discussion)... But, every now and then, under some workloads the kernel would panic... kernel BUG at mm/usercopy.c:72! CPU: 3 PID: 2335 Comm: dhclient Tainted: G O 4.12.10-1.el7.elrepo.x86_64 #1 ... Call Trace: copy_page_to_iter_iovec+0x9c/0x180 copy_page_to_iter+0x22/0x160 skb_copy_datagram_iter+0x157/0x260 packet_recvmsg+0xcb/0x460 sock_recvmsg+0x3d/0x50 ___sys_recvmsg+0xd7/0x1f0 __sys_recvmsg+0x51/0x90 SyS_recvmsg+0x12/0x20 entry_SYSCALL_64_fastpath+0x1a/0xa5
Slab, sendpage and kernel hardening Root Cause: In high queue depth, TCP stack coalesce PDU headers into a single fragment At the same time, we have userspace programs applying bpf packet filters (in this case dhclient) Kernel Hardening applies heuristics to catch exploits: • In this case, panic if usercopy attempts to copy skbuff that contains a fragment that cross the Slab object boundary Resolution: Don’t allocate PDU headers from the Slab allocators Instead use a queue private page_frag_cache • This resolved the panic issue • But also improved the page referencing efficiency on the TX path!
Ecosystem Linux kernel support is upstream since v5.0 (both host and NVM subsystem) • https://lwn.net/Articles/772556/ • https://patchwork.kernel.org/patch/10729733/ SPDK support (both host and NVM subsystem) • https://github.com/spdk/spdk/releases • https://spdk.io/news/2018/11/15/nvme_tcp/ NVMe™ compliance program • Interoperability testing started at UNH-IOL in the Fall of 2018 • Formal NVMe compliance testing at UNH-IOL planned to start in the Fall of 2019 For more information see: • https://nvmexpress.org/welcome-nvme-tcp-to-the-nvme-of-family-of-transports/
Summary • NVMe™/TCP is a new NVMe-oF™ transport • NVMe/TCP is specified by TP 8000 (available at www.nvmexpress.org) • Since TP 8000 is ratified, NVMe/TCP is officially part of NVMe-oF 1.0 and will be documented as part of the next NVMe-oF specification release • NVMe/TCP offers a number of benefits • Works with any fabric that support TCP/IP • Does not require a “storage fabric” or any special hardware • Provides near direct attached NAND SSD performance • Scalable solution that works within a data center or across the world
Storage Performance Development Kit • User-space C Libraries that implement a block stack • Includes an NVMe™ driver • Full featured block stack • Open Source • 3-clause BSD • Asynchronous, event loop, polling design strategy • Very different than traditional OS stack (but very similar to the new io_uring in Linux) • 100% focus on performance (latency and bandwidth) https://spdk.io
NVMe-oF™ History NVMe™ over Fabrics Target • July 2016: Initial Release (RDMA Transport) • July 2016 – Oct 2018: • Hardening, Feature Completeness • Performance Improvements (scalability) • Design changes (introduction of poll groups) • Jan 2019: TCP Transport • Compatible with Linux kernel • Based on POSIX sockets (option to swap in VPP) NVMe over Fabrics Host • December 2016: Initial Release (RDMA Transport) • July 2016 – Oct 2018: • Hardening, Feature Completeness • Performance Improvements (zero copy) • Jan 2019: TCP Transport • Compatible with Linux kernel • Based on POSIX sockets (option to swap in VPP)
NVMe-oF™ Target Design Overview Target spawns one thread per core which runs an event loop • Event loop is called a “poll group” New connections (sockets) are assigned to a poll group when accepted Poll group polls the sockets it owns using epoll/kqueue for incoming requests Poll group polls dedicated NVMe™ queue pairs on back end for completions (indirectly, via block device layer) I/O processing is run-to-completion mode and entirely lock-free.
Adding a New Transport • Transports are abstracted away from the common NVMe-oF™ code via a plugin system • Plugins are a set of function pointers that are registered as a new transport. • TCP Transport implemented in lib/nvmf/tcp.c Transport Abstraction • Socket operations are also abstracted behind a plugin system • POSIX sockets and VPP supported FC? RDMA TCP Posix VPP
Future Work Better socket syscall batching! • Calling epoll_wait, readv, and writev over and over isn’t effective. Need to batch the syscalls for a given poll group. • Abuse libaio’s io_submit? io_uring? Can likely reduce number of syscalls by a factor of 3 or 4. Better integration with VPP (eliminate a copy) Integrate with TCP acceleration available in NICs NVMe-oF offload support