150 likes | 361 Views
OFED TCP Port Mapper Proposal. June 15, 2011. Overview. Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA connections Hardware tags packets used for RDMA connection management for easy identification
E N D
OFED TCP Port Mapper Proposal June 15, 2011
Overview • Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA connections • Hardware tags packets used for RDMA connection management for easy identification • Host TCP/IP stack services used for address resolution and neighbor updates • RDMA CM claims TCP port creating a kernel socket when the unified portspace patch is applied and support is enabled via module option: http://git.openfabrics.org/git?p=~amirv/ofed_1_5.git;a=blob;f=kernel_patches/fixes/cma_0100_unified_tcp_ports.patch;h=cfe1288041929f2940252de9b8ba15f2e35b2997;hb=ofed_kernel_1_5 • Unified portspace kernel patch is applied only when OFED distribution is used intact • At least one OSV is moving to a model where OFED kernel patches will not be applied • RedHat starting with RHEL 6.0 • iSCSI hardware acceleration has moved to a separate MAC/IP address that is not visible to the linux TCP/IP stack (private interface) • Linux community has rejected previous push for including the portspace patch rather violently • Suggestion from linux community is to do what iSCSI did Goal of this presentation is to … • Describe a solution to the iWARP TCP portspace issue using the Sockets Direct Protocol Port Mapper and Netlink sockets 2
Current OFED iWARP CM Flows(Listen) 1. Rdma_listen(Local IP0, Local Port0) • Application issues rdma_listen • In case of userspace application, kernel transition occurs • Local IP address is the Linux IP address (IP0) • OFED CM selects an interface and selects a local port from the appropriate portspace • Simple case (IP0 and TCP Port0) • Local IP can be ANY; CM issues listen to all interfaces • Local port can be ANY; CM picks a port • IF local IP and Port are any, port must be accepted on all interfaces • Portspace patch issues Socket and Bind for iWARP providers • This portion has not been accepted to the kernel • Patch exists in the OFED package • Default just has kernel CM picking a port independent of the host TCP/IP stack 2. Transition to Kernel CM 5. Kern_socket, bind 6. create_ listen 3. Interface Selected 4. Port Selected 7. Setup Hardware 3
Current OFED iWARP CM Flows(Connect) • Rdma_connect( Local IP0, Local Port0, Remote IP2, Remote Port2) • Application issues rdma_connect • In case of userspace application, kernel transition occurs • Local and remote IP addresses are the Linux IP addresses (IP0, IP2) • OFED CM selects an interface and selects a local port from the appropriate portspace • Local IP can be ANY • CM uses the linux stack to pick an interface, this usually handles the Neighbour updated before getting to the provider • Portspace patch issues Socket and Bind for iWARP providers • Kernel provider is informed (and can trigger) Neighbour updates to stay in sync with the Linux TCP/IP stack • Kernel provider mini-cm issues handles TCP/IP three way handshake and MPA exchange through dev_queue_xmit and private receive path 2. Transition to Kernel CM 5. Kern_socket, bind 8. Neighbour Update 6. connect 9. CM Packets 3. Interface Selected 4. Port Selected 7. Setup Hardware 4
New OFED iWARP CM Architecture • Similar to current flow for CM • OFED has new iWARP Port Mapper Daemon in userspace • OFED has new netlink interface between user and kernel • Introduced for statistics • Extended for iWARP providers and new Port Mapper Daemon • Netlink interface roughly modeled after iSCSI • Supports (but does not require) second MAC/IP addresses on local and on remote peer (soft iWARP) • Netlink Messages: • Port Mapper Netlink Upcalls: Query PID, Add/Remove Mapping, Query Mapping • Provider Netlink Upcalls: Query PID, Connect, Listen, Resolve • Provider Netlink Downcalls: Inbound Connect, Operation Complete for upcalls, Interface Down • Three RNIC models supported • RNICs with CM in Kernel/Adapter • RNICs with CM in userspace • Hybrid RNICs with userspace CM that requires adapter assistance 5
iWARP Port Mapper Concept • Port Mapper concept was introduced by the RDMA Consortium as part of the Socket Direct Protocol specification • http://www.rdmaconsortium.org/home/draft-pinkerton-iwarp-sdp-v1.0.pdf • Provides a mechanism to have an iWARP port space separate from linux TCP port space • iWARP port space can be on an independent IP address or single IP address • Port Mapper service runs over TCP on a well known port (3935) on linux IP addresses • Listen issued at service startup • Port Mapper service rdma_listen steps: • Register a mapping between linux IP Address/TCP Port and iWARP IP Address/TCP Port with the Port Mapper service • Port Mapper service rdma_connect steps: • Receive a query request from a Port Mapper service client • Connect to remote peer on well known port • Query RDMA peer’s iWARP IP Address/TCP port using the SDP Port Mapper protocol (PMRequest) • Return information from the PMAccept message to the client of the Port Mapper service • Port Mapper service peer query steps: • Accept Port Mapper connection (port 3935 to linux IP address) from node issuing the query • Receive the PMRequest message • Look up the IP address and Port from the PM request in the local database from the rdma_listen step • Return the mapped IP address and port information in a PMAccept message • iWARP provider issues iWARP connect using an iWARP local and remote IP Address/TCP port “quad” after receiving the PMAccept message • Later slides show more detail 6
Pending Netlink Patch for OFED • A patch has been submitted recently to query RDMA connection information via netlink • Roland has rolled this patch into the linux-next patch set for late May • This patch introduces a single OFED netlink port and an Infiniband netlink infrastructure in ib_core • Support for 32 clients within OFED and 1024 operations for each client • Only a single client is currently defined (rdma_cm) • Components interested in adding netlink capabilities to OFED can register with Infiniband netlink infrastructure • The Port Mapper daemon consumes one client • Each iWARP provider consumes an additional client • The dump netlink operation is used to provide data back to the netlink client 7
New OFED iWARP CM Flows(Listen: Userspace provider CM) • Similar to current flow for CM • CM can now independently reserve ports since the Port Mapper allows providers to use any provider managed port number to represent CM port number • Netlink message used to issue listen to userspace library • Mini-cm or userspace TCP stack manages provider “port space” to get Local TCP port1 that is related to the CM local Port0 • Userspace library registers local IP1, Port1 • For compatibilty, bind could also be made on existing MAC/IP stack. Soft iWARP requires this, along with some customers. • If userspace provider library issues socket/bind to Linux TCP/IP stack (like soft iWARP would do), then IP0 = IP1 and Port0 != Port1 1. Rdma_listen(Local IP0, Local Port0) 7. Netlink: Register Port Map IP0, Port0 -> IP1, Port1 2. Transition to Kernel CM 6. Netlink: Listen 8. Netlink: Complete 3. Interface Selected 4. Port Selected 5. create_ listen 9. Setup Hardware (IP1, Port1) 8
New OFED iWARP CM Flows(Connect: Userspace provider CM) • Similar to current flow for CM • Netlink used to issue connect to userspace library • Mini-cm or userspace TCP stack manages provider “portspace” to get Local TCP port1 that is related to the CM local Port0 • Userspace library resolves remote IP2, Port2 through the Port Mapper and gets remote IP and port number IP3, Port3 • Userspace provider CM issues iWARP connect to IP3, Port3, including MPA handshake • Userspace Mini-cm sends Netlink Connect Complete call to the kernel provider indicating the new connection information: IP1:Port1, IP3:Port3 • The kernel driver sets up the RNIC hardware including transitioning the QP to RTS • Kernel CM Issues Connect Reply Event • Rdma_connect( Local IP0, Local Port0, Remote IP2, Remote Port2) • Netlink: Resolve Remote Port IP2, Port2 -> IP3, Port3 2. Transition to Kernel CM 8. SDP Port Mapper Protocol (IP0 <-> IP2) • Netlink: Connect • Netlink: Connect Complete 3. Interface Selected 4. Port Selected 10. Setup Hardware • connect • Connect Reply Event 9
New OFED iWARP CM Flows(Accept: Userspace provider CM) • Userspace provider CM receives a connect request on IP1, port1 • TCP three-way handshake and MPA request from peer received • Userspace library issues Connect Request netlink downcall to kernel provider library • Remote iWARP: IP3, Port3 (Port Mapped) • Remote TCP: Unknown, use Port Mapped IP3, Port3) • Local iWARP: IP1, Port1 (Port Mapped) • Local TCP: IP0, Port0 (from listen) • Kernel Mini-cm sends Netlink Connect Request event to the iWARP indicating the new connection information: IP0:Port0, IP3:Port3 • Application is notified of the connection request, it turns around with an rdma_accept call • The kernel CM issues an accept call to the kernel provider • The kernel provider then sets up the RNIC hardware, including sending the MPA response and transitioning the QP to RTS • The kernel provider issues an Established CM event 4. Rdma_accept( Local IP0, Local Port0, Remote IP3, Remote Port3) 3. Transition to Userspace CM 5. Transition to Kernel CM • Netlink: Connect Request 7. Setup Hardware • Connect Request Event • CM Accept 8. Established Event 10
New OFED iWARP CM Flows(kernel provider CM) • Changes to RNICs that support kernel only connection management drivers are minimal • On listen requests, the kernel provider CM must issue the Register Port Map request to the iWARP Port Mapper Daemon using netlink sockets • On connect requests, the kernel provider CM must: • Issue the Resolve Remote Port netlink message to the iWARP Port Mapper Daemon • On completion, use the local and remove iWARP IP addresses and Port numbers to issue the iWARP connect request (instead of the linux IP addresses and Port numbers from the connect request • On Connect Request event and accept request handling, map the local iWARP IP address and Port number to the original listen IP address and port number 11
New OFED iWARP CM Flows(hybrid provider CM) • A hybrid RNIC has a userspace Connection Manager or Private TCP stack that manages the iWARP IP address and port space, but does not get involved with connection setup • The Listen flow for a hybrid RNIC is the same as the flow for the userspace stack • The Accept flow is the same as the flow for a kernel provider • The Connect flow is slightly different and depicted on the following slide. 12
New OFED iWARP CM Flows(Connect: Hybrid CM) • Similar to current flow for CM • Netlink used to issue resolve message to userspace library • Mini-cm or userspace TCP stack manages provider “portspace” to get Local TCP port1 that is related to the CM local Port0 • Userspace library resolves remote IP2, Port2 through the Port Mapper and gets remote IP and port number IP3, Port3 • This information is returned to the kernel provider CM in a resolve complete netlink message • Kernel provider CM issues iWARP connect to IP3:Port3 from IP1:Port1, including MPA handshake • The kernel driver sets up the RNIC hardware including transitioning the QP to RTS • Kernel CM Issues Connect Reply Event indicating IP0:Port0 and IP2:Port2 as the connection information • Rdma_connect( Local IP0, Local Port0, Remote IP2, Remote Port2) • Netlink: Resolve Remote Port IP2, Port2 -> IP3, Port3 2. Transition to Kernel CM 8. SDP Port Mapper Protocol (IP0 <-> IP2) • Netlink: Resolve • Netlink: Resolve Complete 3. Interface Selected 4. Port Selected 10. Setup Hardware • connect • Connect Reply Event 13
Conclusions/Next Steps • This proposal supports moving iWARP traffic to an independent port space from TCP/IP sockets applications transparently to the RDMA verbs consumer • The iWARP port space can remain on the same IP address (like soft iWARP) or on a separate IP address (like iSCSI) • Three different RNIC connection management models are supported • The RDMA Consortium published the wire protocol for mapping TCP port numbers to iWARP port numbers • This proposal also resolves a port space issue with iSER targets and iWARP in OFED • Backward compatibility can be ensured by using timeouts on the port mapper protocol to fall back to the current behavior 14