340 likes | 350 Views
IP Communication Fabric. Mike Polston HP michael.polston@hp.com. Agenda. Data Center Networking Today IP Convergence and RDMA The Future of Data Center Networking. Communication Fabric versus Communication Network. The Need
E N D
IP Communication Fabric Mike Polston HP michael.polston@hp.com
Agenda • Data Center Networking Today • IP Convergence and RDMA • The Future of Data Center Networking
The Need Fast, efficient messaging between two users of a shared network or bus Predictable response and fair utilization for any 2 users of the ‘fabric’ Examples Telephone switch Circuit switch ServerNet Giganet InfiniBand RDMA over IP Communication Fabrics
How Many, How Far, How Fast? Internet LAN 1/Speed Distance SAN BUS Number of Systems Connected Exponential Scale “Fabrics”
Data Center Connections Connects for Management Public Net Access Client (PC) Access Storage Access Server to Server Messaging Load Balancing Server to Server Backup Server to DBMS Server to Server HA
Fabrics Within the Data Center Today • Ethernet Networks • Pervasive Infrastructure • Proven Technology • IT Experience • Management Tools • Volume and Cost Leader • Accelerated Speed Improvements
Fabrics Within the Data Center Today • Clustering • High Availability • Computer Clustering • Some on Ethernet • Memory Channel • Other Proprietary • Async connections • Early standards (ServerNet, Giganet, Myranet)
Fabrics Within the Data Center Today • Storage Area Networks • Fibre Channel • Mostly Standard • Gaining Acceptance • Record • File • Bulk Transfer
Fabrics Within the Data Center Today • Server Management • KVM Switches • HP Riloe, iloe • KVM over IP • Private IP nets
Solution to Scalability Partitionable Scale Up Sea of Servers Scale Out Business Growth …… And the need for Scale • Processors Scale at Moore’s Law • Doubling every 18 months • Networks Scaling at Gilder’s Law • Doubling every 6 months • Memory Bandwidth growth rate • Only 10-15% per year
Why Scale Out? • Provide benefits by adding, not replacing … • Fault Resiliance • HA Failover • N + 1 Protection • Modular System Growth • Blades, Density • Investment Protection • Parallel Processing • HPTC • DBMS Processing • Tiered Architectures
Agenda • Data Center Networking Today • IP Convergence and RDMA • The Future of Data Center Networking
remote management networking storage IP Convergence convergence clustering
Ethernet Bandwidth Evolution 20xx 1xx Gbps 2002 10Gbps 1998 1 Gbps 100 Mbps 1994 10 Mbps 1979 1973 3 Mbps
Sockets ScalabilityWhere is the Overhead? Traditional LAN Architecture Components • Send Message • 9000 Instructions • 2 mode switches • 1 memory registration • 1 CRC Calculation • Receive Message • 9000 Instructions • Less with CRC & LSS offload • 2 mode switches • 1 buffer copy • 1 interrupt • 1 CRC calculation • Systemic Effects • Cache, Scheduling • Single RPC Request • = 2 sends & 2 receives Application OSV API (Winsock) User Kernel OSV API Kernel Service(s) 50-150mSone-way Protocol Stack(s) Device Driver LANMediaInterface
What is RDMA? • Remote DMA (RDMA) • The ability to move data from the memory space of one process to another memory space, without minimal use of the remote node’s processor. • Provides error free data placement without CPU intervention or data movement at either node. a.k.a. Direct Data Placement (DDP) • Capable of being submitted and completed from user-mode without subverting memory protection semantics. (OS bypass) • Request processing for Messaging and DMA handled by receiver without host OS/CPU involvement.
The Need for RDMA • At 1Gbps and above, memory copy overhead is significant, and it’s not necessarily the CPU cycles • Server designs don’t have 100MBytes/sec of additional memory bandwidth for buffer copying • RDMA makes each segment self describing, it can be landed in the right place w/o copying and/or buffering • Classic networking requires two CPUs to be involved in a request/response pair for data access • End-to-end latency includes kernel scheduling events at both ends, which is guaranteed to be 10s-100s of milliseconds. • TOE alone doesn’t help with the kernel scheduling latency • RDMA initiates data movement from one CPU only, with no kernel transition. End-to-end latency is 10s of microseconds
Typical RDMA Implementation Applications DBMS Apps OS Vendor API (WinSock, MPI, Other) User Agent (Verbs) Open/Close/Map Memory Send/Receive/Read/Write QP QP QP CQ Kernel Agent Kernel HW Interface Fabric Media Interface ( ServerNet, IB, Ethernet)
“Big Three” wins for RDMA • Accelerate legacy sockets apps • User space sockets -> SDP -> RDMA • Universal 25% - 35% performance gain in Tier 2-3 application communication overhead • Parallel commercial database • <100us latency needed to scale real world apps • Requires user space messaging and RDMA • IP based storage • Decades old block storage access model (iSCSI, SRP) • Command/RDMA Transfer/Completion • Emerging user space file access (DAFS, NFS, CIFS) • Compaq experiment identified up to 40% performance advantage. First lab test beat hand-tuned TPC-C run by 25%
WHY IP versus IB? • Ethernet Hardware Continues to Advance Speed Low Cost Ubiquity • TCP Protocol Continues to Advance • Management and Software Tools • Internet WorldWide Trained Staff • World Standards – Power, Phone, IP
RDMA Consortium (RDMAC) Deliverables Include: Framing, DDP and RDMA Specifications Sockets Direct SCSI Mapping Investigation Deliverables to be submitted to the IETF as informational RFCs Formed in Feb, 2002 Went public in May, 2002 Founders were Adaptec, Broadcom, Compaq, HP, IBM, Intel, Microsoft, NetApp. Added EMC and Cisco Open group with no fees working fast and furious
The Stack RDMA • RDMA – Converts RDMA Write, RDMA Read, and Sends into a DDP message(s). • DDP – Segments outbound DDP Messages into 1 or more DDP Segments; reassembles 1 or more DDP Segments into a DDP Message. • MPA – Adds a backward marker at a fixed interval to DDP Segments. Also adds a length and CRC to each MPA Segment. • TCP – Schedules outbound TCP Segments and satisfies delivery guarantees. • IP – Adds necessary network routing information. DDP MPA TCP IP
RDMA Architectural Goals • Data transfer from local to remote system into an advertised buffer • Data retrieve from a remote system to local from an advertised buffer • Data transfer from a local to remote system into a non advertised buffer • Allow local system to signal completion to the remote system • Provide for reliable sequential delivery from local to remote • Provide for multiple stream support
RDMAP Data Transfer Operations • Send • Send with Invalidate • Send with Solicitated Event (SE) • Send with SE and Invalidate • RDMA Write • RDMA Read • Terminate
Direct Data Placement • Contain Placement Information • Relative Address • Record Length • Tagged Buffers • UnTagged Buffers • Allows NIC Hardware to access application memory (Remote DMA) • Can be implemented with or without TOE
RDMA over MPA/TCP Header Format RDMA or Anonymous Buffer RDMA Read RDMA Write Send RDMA Message Oper- ation ULP PDU (logical operation) DDP Segment DDP/RDMA Header(s) DDP/RDMA Payload Frame PDU Length ULP Payload Mar- ker TCP Segment TCP Header TCP Payload / TCP Data IP Datagram IP Header IP Data Ethernet Header Data
Agenda • Data Center Networking Today • IP Convergence and RDMA • The Future of Data Center Networking
Wave 2 • Fabric Computing pervasiveness • IP fabric solutions become the leading choice for data center fabric deployments • Leverage existing investment and improve infrastructure performance and utilization • InfiniBand used for specialized applications RDMA/TCP InfiniBand InfiniBand InfiniBand Emerging Fabric Adoption Two Customer Adoption Waves as Solutions Evolve Wave 1 • First fabric solutions available (InfiniBand) • Fabric evaluation within data centers begins Time
Ethernet RoadmapContinued Ethernet Pervasiveness in the Datacenter • Revolutionary IP Improvements & Advancements • Interconnect convergence • Scalability & performance • Resource virtualization • RDMA/TCP Fabrics • iSCSI (Storage over IP) • Lights-out Management (KVM over IP) Improved Ethernet Performance and Utilization • 10 Gigabit Ethernet Today’s Ethernet Infrastructure • TCP/IP Offload & acceleration • IP Sec Security acceleration • 1 Gigabit Ethernet
RDMA/TCP Robust, Scalable Computing Breakthrough Fabric Economics High Volume Knowledge InfiniBand Drive Fabric Standards Fibre Channel ServerNet Leading Storage Fabric Introduced the First Switched Fabric Fabric Computing hp Fabric LeadershipBringing NonStop Technologies to Industry Standard Computing Technology & Expertise
data center fabric management systems internet routing switches hp OpenView ProLiant Essentials hp utility data center • provisioning • monitoring • resource mgmt • by policy • service-centric edge router firewall - switches - storage fabric compute fabric IP to FC router IP to IB router (UNIX) NAS iSCSI SAN virtualized functions database servers application servers webservers Fibre Channel SAN Fabrics Within Future Data CentersFoundation for Future Adaptive Infrastructure Vision • Heterogeneous fabric “islands” • data center fabric connecting “islands” of compute & storage resources • RDMA/TCP enables practical fabric scaling across the datacenter • protocol routers translate between islands • Move from “tiers” to “elements” • n-tier architecture, like DISA, replaced by element “pools” available over the fabric • resource access managed by tools like ProLiant Essentials and hp OpenView • centrally administered automation tools