470 likes | 1.5k Views
Sockets Direct Protocol Over InfiniBand. Dror Goldenberg Senior Architect. Gilad Shainer Technical Marketing. gdror @ mellanox.co.il. shainer @ mellanox.com. Sockets Direct Protocol Over InfiniBand. Dror Goldenberg Senior Architect. Gilad Shainer Technical Marketing. gdror @ mellanox.co.il.
E N D
Sockets Direct Protocol Over InfiniBand Dror GoldenbergSenior Architect Gilad ShainerTechnical Marketing gdror @ mellanox.co.il shainer @ mellanox.com
Sockets Direct Protocol Over InfiniBand Dror GoldenbergSenior Architect Gilad ShainerTechnical Marketing gdror @ mellanox.co.il shainer @ mellanox.com
Agenda • Introduction to InfiniBand • Sockets Direct Protocol (SDP) overview • SDP in WinIB stack • SDP performance
End Node End Node End Node End Node Switch Switch Switch End Node Switch Router End Node I/O Node End Node I/O Node Introducing InfiniBand • Standard interconnect • Defined by the InfiniBand Trade Association • Defined to facilitate low cost and high performance implementations • From 2.5Gb/s to 120Gb/s • Low latency • Reliability • Scalability • Channel based I/O • I/O consolidation • Communication, computation, management and storage over a single fabric
InfiniBandOffload capabilities • Transport offload • Reliable/unreliable • Connection/datagram • RDMA and atomic operations • Direct access to remote node memory • Network-resident semaphores • Kernel bypass • Direct access from applicationto the HCA hardware
Clustering and Failover Native InfiniBand Storage InfiniBand Today Operating Systems InfiniBandLanded on Motherboard Switches and Infrastructure InfiniBand Blade Servers Servers Storage Embedded, Communications,Military, and Industrial *Partial Lists
InfiniBand Roadmap InfiniBand’s roadmap outpaces other proprietary and standard based I/O technologies in both pure performance and price/performance
Commodities BasedSuper Computer • Clustering efficiency • High performance I/O • Scales with CPU performance • Scalability • 1000s and 10,000s of nodes • Price/performance • 20Gb/s fully offloaded • 40Gb/s H2 06 • $69 – IC, $125 – adapter • Volume OEM price • Single port IN SDR • Industry standard • Horizontal market
Fabric Consolidation • Single fabric fits all • Communication • Computation • Storage • Management • Reduces fabric Total Cost of Ownership (TCO) InfiniBand truly enables fabric consolidation • Independent traffic virtual Lanes • Quality of service support • Logical partitioning • High capacity (BW↑, Latency↓)
Data Center Virtualization Today 2006/2007 2007/2008
Switch fabric links 2.5 to 120Gb/s HW transport protocol Reliable and unreliable Connected and datagram Kernel bypass Memory translation and protection tables Memory exposed to remote RDMA-read and RDMA-write Quality Of service Process at host CPU level QP at the adapter level Virtual lane at the link level Scalability and flexibility Up to 48K nodes in subnet, up to 2128 in network Network partitioning Multiple networks on a single wire Reliable, lossless, self-managing fabric End to end flow control Link level flow control Multicast support Congestion management Automatic path migration I/O Virtualization with channel architecture Dedicated services to guests OSes HW assisted protection and inter-process isolation Enable I/O consolidation The InfiniBand Fabric
The Vision • Computing and storage as a utility • Exactly the same as electricity
Sockets Direct ProtocolMotivation SDP enables existing socket based application to transparently utilize the InfiniBand capabilities and achieve superior performance Better bandwidth Lower latency Lower CPU utilization
Unmodified Applications WinSock API User SDP Kernel Access Layer Verbs Provider Driver HCA Hardware SDP In The Network Stack • Standardized wire protocol • Interoperable • Transparent • No need for API changes or recompilation • Socket semantics maintained • Leverages infiniBand capabilities • Transport offload – reliable connection • Zero copy – using RDMA • Kernel bypass* * Implementation dependent
SDP Data Transfer Modes • Buffer copy (BCopy) • Zero copy (ZCopy) • Read ZCopy • Write ZCopy
Transport offload SDP stack performs data copy SDP Data Message Buffer Copy Data Source Data Sink SDPBuf AppBuf SDP Buf SDPBuf SDP Buf SDP SDP Buf SDPBuf SDP Buf SDP Buf SDPBuf App Buf AppBuf SDP SDP Buf
SrcAvail Message RDMA Read Read Response RdmaRdCompl Msg Read ZCopy Data Source Data Sink • Transport offloaded by HCA • True zero copy AppBuf App Buf AppBuf
SinkAvail Message RDMA Write RdmaWrCompl Msg Write ZCopy Data Source Data Sink App Buf AppBuf AppBuf
SDP Versus WSD * SDP on Windows XP SP2 and Windows Server 2003 SP1 is supported by Mellanox It is unsupported by Microsoft
WinIB • Mellanox InfiniBand software stack for Windows • Based on open source development (OpenFabrics) • InfiniBand HCA verbs driver and Access Layer • InfiniBand subnet management • IPoverIB driver • SDP driver • WinSock Direct Driver (WSD) • SCSI RDMA Protocol Driver (SRP) • Windows Server 2003, Windows ComputeCluster Server 2003, Windows Server “Longhorn”* * WinIB on Windows XP SP2 is supported by Mellanox – It is unsupported by Microsoft
Applications MPI2* Winsock Socket Switch SDP SPI WinSock Provider Access Layer Library Verbs Provider Library Applications TCP/UDP/ICMP IP SDP StorPort Windows NDIS Kernel Bypass Win IB SRP Miniport IPoIB Miniport Hardware Access Layer * Windows Compute Cluster Server 2003 Verbs Provider Driver Win IB Software Stack User WSD SAN Provider MPI2 Management Tools Kernel HCA Hardware
Unmodified Applications WinSock API User Winsock Socket Switch SDP SPI SDP NDIS Applications IPoIB Miniport Windows Kernel Access Layer Win IB Verbs Provider Driver Hardware HCA Hardware SDP In Windows
SDP Socket Provider • User-mode library • Implements Winsock Service Provider Interface (SPI) • Supports SOCK_STREAM socket types • WSPxxx function for each socket call • Socket switch implemented in the library • Policy based selection of SDP versus TCP • SDP calls are redirected to SDP module (ioctl) • Takes routing decision and performs ARP
SDP Module • Kernel module • Implemented as a high level driver • Connection establishment/teardown • Mapping of MAC address to GID though IPoIB miniport • Path record query • IB Connection Management (CM) • Data transfer mechanism • Operations are implemented as asynchronous
Buffer Copy Implementation • Only asynchronous mode is implemented in kernel • Synchronous calls artificially converted into overlapped operations and wait for their completion • SDP private buffers • Mapped through physical MR • 16KB buffers for send and receive • Data copy performed in • Caller’s context preferably • Dedicated helper thread per process otherwise
Socket, WSASocket Connect, WSAConnect Bind Listen Accept, AcceptEx Close Send, WSASend,Recv, WSARecv Synchronous and overlapped Including IOCompletionPort getsockname getpeername getsockopt,setsockopt – partially WSAIoctl Data Transfer Modes Buffer Copy Current APIs And I/O Supported
Future Plans • Zero Copy • Improve administrable policy (SDP versus TCP) • Performance tuning(latency, bandwidth, CPU%) • Automatic Path Migration • Quality of Service • Additional functionality
Hardware Dual AMD Opteron 64 bit, 2.2GHz, 1MB Cache, 4GB RAM NVIDIA nForce 2200 MCP Mellanox InfiniHost III Ex DDR FW 4.7.600 Software Prediction for Windows Server “Longhorn” WinIB 1.3.0 (pre release) Benchmarks Bandwidth: nttcp 2.5 Latency: netpipe 3.6.2 InfiniBand 4x DDR 20Gb/s link HP Proliant DL145 G2 HP Proliant DL145 G2 Platform
Bandwidth 1400.0 1200.0 1000.0 800.0 Bandwidth (MB/s) 600.0 400.0 200.0 0.0 1 2 4 8 64 32 16 1K 2K 4K 8K 1M 2M 128 256 512 16K 32K 64K 512K 128K 256K Message Size (Bytes) 1 Socket 2 Sockets
Summary Of Results • Latency – 17.90us • for 1B message • Bandwidth – 1316 MB • at 128KB, 2 sockets
ZCopy addition Increases BW Reduces CPU% Better scalability with numberof sockets Zero Copy
Call To Action • Download WinIB • http://www.mellanox.com/products/win_ib.php • http://windows.openib.org/downloads/binaries/ • OpenFabrics InfiniBand Windows drivers development – sign up to contribute • http://windows.openib.org/openib/contribute.aspx
Additional Resources • Web Resources • Specs: http://www.infinibandta.org/specs/ • White Papers http://www.mellanox.com/support/whitepapers.php • Presentations http://www.mellanox.com/support/presentations.php • Open Fabrics • http://www.openfabrics.org/ • https://openib.org/tiki/tiki-index.php?page=OpenIB+Windows • http://openib.org/mailman/listinfo/openib-windows • Feedback: Gdror @ mellanox.co.il
© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.