240 likes | 255 Views
Reliable Datagram IPC. Richard.Frank@oracle.com , Zach.Brown@oracle.com. Vision Statement. A low overhead, low latency, high bandwidth, ultra reliable, supportable, IPC protocol and transport system Which matches Oracles existing IPC models for RAC communication
E N D
Reliable Datagram IPC Richard.Frank@oracle.com, Zach.Brown@oracle.com Oracle Corporation @ OpenIB 8/05
Vision Statement • A low overhead, low latency, high bandwidth, ultra reliable, supportable, IPC protocol and transport system • Which matches Oracles existing IPC models for RAC communication • Optimized for Xfers from 200 bytes to 8meg Oracle Corporation @ OpenIB 8/05
Goal and Objective • Support for a reliable datagram IPC in OpenIB • Based on Socket API • Minimal code change / testing for Oracle • Failover inter HCA and intra HCA ports • Runs over IB, Ether, iWARP, etc • 2 month validation / certification for RAC Oracle Corporation @ OpenIB 8/05
Today’s Situation • TCP streams used for connections to database by external clients, app servers, etc. • Reliable Data grams used for internal database IPC (RAC) • Thousands of processes • 200k+ associations (not connections) • 64 nodes Oracle Corporation @ OpenIB 8/05
Parallel Query • SQL decomposed into execution plan / tree • Set of producer / consumer pipelined stages • Based on data accessed (#rows,physical organization,logical operations (hash,index) • Each execution stage has producer / consumer slave groups (source,sync) • Each group can be many slaves – 32 Oracle Corporation @ OpenIB 8/05
Parallel Query • Operation tree / plan is not aware of slave locality – comm. could be local via shared memory or remote via IPC. • N : N, 1 : N, N : 1 comm. between groups (16 source, 16 sync, 16 nodes = ~ 65k associations for n:n com) 1 query • May change group organization / comm model at each stage of plan. • 64k msg size capable – typical today 16k Oracle Corporation @ OpenIB 8/05
Oracle Buffer Cache • Distributed Cache • Client / Server • Client sends request for buffer • Server Sends back buffer (DDP) • Each node has pool of servers • Any client can ask any server Oracle Corporation @ OpenIB 8/05
Oracle Buffer Cache • Buffer size is 8k by default but can be 2k, up to 32k in size • Associations per server are n-1 * C • C = clients per node, n = Nodes • 16-1*800 = 12k per server process. • 8 servers per node = 96k associations Oracle Corporation @ OpenIB 8/05
Oracle IPC Usage • New database functionality will significantly increase IPC utilization • Approaches database I/O rates • Very large msgs -> 8meg + Oracle Corporation @ OpenIB 8/05
Reliable Datagram IPC • UDP – Oracle adds reliable delivery via user mode wire protocol engine. • Two sockets per process, thousands of msgs on wire • Slow sends times (windowing,acks,retrans) • Holds together but degenerates under CPU load • Well tested ! Oracle Corporation @ OpenIB 8/05
Available Options • uDAPL / itAPI – not supporting • IPOIB – high CPU overhead, same unreliable delivery (UDP) • SDP – connection oriented • We want to take our existing well tested UDP module, shutoff most of it to run over an O/S provided RD IPC Oracle Corporation @ OpenIB 8/05
Recommendation • RD – Reliable Datagram IPC over IB • 50% less CPU than IPOIB, UDP • ½ Latency of UDP (no user-mode acks) • Within 5% of uDAPL thru-put using Oracle • Minimal code change – reduced our UDP module by 70% - removed windowing, acks, retransmissions, etc. • RDS driver ~ = 1k C lines (b-copy) • Decoupled from user-mode CPU loading • Passes all Oracle regression tests in < 2 wks !!!! • Supports fail-over across and within HCAs. Oracle Corporation @ OpenIB 8/05
RDS IPC over IB • Uses IB reliable connection (RC) • Node to Node level connection • User mode sockets share small pool of node to node RCs. • Formed either dynamically at send or at system startup Oracle Corporation @ OpenIB 8/05
Oracle Block Service Rate Oracle Corporation @ OpenIB 8/05
Service Response Time Oracle Corporation @ OpenIB 8/05
Cpu Cost Per Block Served Oracle Corporation @ OpenIB 8/05
RDS IPC • Implemented in 3 phases • b-copy • Zero Copy • Z-copy Directed Sends / Recvs (ES-API additions) Oracle Corporation @ OpenIB 8/05
B-Copy • Sends are copied and completed immediately • Sends are not guaranteed to have made it to remote application. • If Send fails async to submission – application must detect loss of send • Can only fail if no path to destination (remote port / process is gone or path has failed – no alternate path Oracle Corporation @ OpenIB 8/05
B-Copy Send/Recv • Recvs are buffered in kernel / queued to remote socket. • If total buffers queued to remote socket exceeds threshold – then sending socket is back pressured (ewouldblock) when sending to blocked remote socket. Oracle Corporation @ OpenIB 8/05
Z-Copy Send/Recv • Dynamic registration of buffer > size • Application is not required to do explicit registration. • Oracle IPC buffers are in shared memory and private heap • impractical to pre-register • O/S must manage any caching of registrations Oracle Corporation @ OpenIB 8/05
Directed Sends / Recvs(DDP) • Key for target buffer returned from RDS interface (get memory handle) • Key is sent by application to remote side • Remote side initiates directed send passing in key of remote target buffer • Uses RDMA write to move data • ES-API additions – working on definition Oracle Corporation @ OpenIB 8/05
Next Steps ? • RDS bcopy supported in Oracle 10.2.0.2. • RDS from SilverStorm ported to OpenIB Gen2 • Preparing to test OpenIB + RDS at Oracle Oracle Corporation @ OpenIB 8/05
Next Steps • Work on zcopy / directed send (ddp) specification now (ES-API). • RD IPC Docs from Oracle • Richard.Frank@oracle.com • RDS/eth • Zach.Brown@oracle.com Oracle Corporation @ OpenIB 8/05