260 likes | 264 Views
This summary provides insights into leveraging Infiniband technology for high-performance storage solutions, covering basics, storage protocols, file system scenarios, and the benefits of RDMA, SRP, and iSER protocols. It explains how these technologies improve throughput, reduce latency, and simplify storage system architectures.
E N D
Native Infiniband Storage John Josephakis, VP, Data Direct Networks St. Louis – November 2007
Summary • Infiniband basics • IB Storage Protocols • File System Scenarios • DDN Silicon Storage Appliance
Infiniband Basics • Quick introduction • RDMA • Infiniband • RDMA Storage Protocols • SRP (SCSI RDMA Protocol) • iSER (iSCSI Extensions for RDMA)
RDMA • RDMA (Remote Direct Memory Access) • Enables access to main memory via direct memory access (zero-copy networking) • No CPU, cache or context switching overhead • High Throughput • Low Latency • Application data delivered directly to network
Infiniband • Collapses Network And Channel Fabrics Into One Consolidated Interconnect • Bidirectional, Double-data Rate (DDR) & Quad-data Rate (QDR) • Low-latency, Very High Performance, Serial I/O • Greatly Reduces Complexity and Cost • Huge Performance Reduces Storage System Count • Single Interconnect Adapter for Server to Server and Storage • Source: InfoStor
SRP (SCSI RDMA Protocol) • SRP == SCSI over Infiniband • Similar to FCP (SCSI over Fibre Channel) except that CMD Information Unit includes addresses to get/place data • Initiator drivers available with IB Gold and OpenIB
SRP (SCSI RDMA Protocol) • Advantages • Native Infiniband protocol • No new hardware required • Requests carry buffer information • All data transfers occur through Infiniband RDMA • No Need for Multiple Packets • No flow control for data packets necessary
iSER (iSCSI extensions for RDMA) • iSER leverages on iSCSI management and discovery • Zero-Configuration, global storage naming (SLP, iSNS) • Change Notifications and active monitoring of devices and initiators • High-Availability, and 3 levels of automated recovery • Multi-Pathing and storage aggregation • Industry standard management interfaces (MIB) • 3rd party storage managers • Security (Partitioning, Authentication, central login control, ..)
iSCSI mapping for iSER iSCSI Mapping to iSER / RDMA Transport Protocol frames (RDMA) X In HW X In HW iSCSI PDU BHS AHS HD Data DD RC Send RC RDMA Read/Write • iSER eliminates the traditional iSCSI/TCP bottlenecks : • Zero copy using RDMA • CRC calculated by hardware • Work with message boundaries instead of streams • Transport protocol implemented in hardware (minimal CPU cycles per IO)
iSCSI protocol (read) + IB HCA Send_Control + Buffer advertisement Control_Notify Send_Control (SCSI Read Cmd) iSCSI Initiator iSER HCA HCA iSER Target Target Storage Data_Put (Data-In PDU) for Read RDMA Write for Data Control_Notify Send_Control (SCSI Response) • SCSI Reads • Initiator Send Command PDU (Protocol data unit) to Target • Target return data using RDMA Write • Target send Response PDU back when completed transaction • Initiator receives Response and complete SCSI operation
IB/FC and IB/IP Routers • IB/FC and IB/IP routers make Infiniband integration easy in existing iSCSI fabrics • Both Service Location Protocol (SLP) and iSNS (Internet Storage Name Service) allow for smooth iSCSI discovery in presence of Infiniband
iSCSI Discovery with Service Location Protocol (SLP) • Client Broadcast:I’m xx where is my storage ? • FC Routers discover FC SAN • Relevant iSCSI Targets & FC gateways respond • Client may record multiple possible targets & Portals iSCSI Client IB to FC Routers IB to IP Router Native IB RAID GbE Switch FC Switch Portal – a network end-point (IP+port), indicating a path
iSCSI Discovery with Internet Storage Name Service (iSNS) • FC Routers discover FC SAN • iSCSI Targets & FC gateways report to iSNS Server • Client asks iSNS Server:I’m xx where is my storage ? • iSNS responds with targets and portals • Resources may be divided into domains • Changes notified immediately (SCNs) iSCSI Client iSNS Server IB to FC Routers IB to IP Router Native IB RAID GbE Switch FC Switch iSNS or SLP run over IPoIB or GbE, and can span both networks
SRP vs iSER • Both SRP and iSER make use of RDMA • Source and Destination Addresses in the SCSI transfer • Zero memory copy • SRP Uses • Direct server connections • Small controlled environments • Products: DDN, LSI, Mellanox, … • iSER Uses • Large switch connected Networks • Discovery fully supported • Products: Voltaire/Falconstor, …
Cisco SFS 7008 Qlogic 9024 Voltaire 9096 OpenIB Standards Partners Dell Power Edge 1855 Blade Server IBM 1350 Cluster Server Other companies: AMD Appro Cisco Systems Intel LSI Corporation Mellanox Technologies /Qlogic Network Appliance Silicon Graphics Inc. Sun Microsystems HP Proliant DL Server DDN S2A9550 and S2A9900
Silicon Storage ApplianceStorage System The Storage System Difference
IB and S2A9550-S2A9900 • Native IB and RDMA (and FC-4) • Greatly Reduces File System Complexity and Cost • Huge Performance Reduces Storage System Count • PowerLUNs Reduce Number Of LUNs & Striping Required • Supports >1000 Fibre-Channel Disks and SATA Disks • Parallelism Provides Inherent Load-Balancing, Multi-Pathing and Zero-Time Failover • Supports All Open Parallel and Shared SAN File Systems (and Some that Aren’t So Open) • Supports All Common Operating Systems • Linux, Windows, IRIX, AIX, Solaris, Tru64, OS-X and more
The Storage System Difference • High Performance Scalability • Up to 2.4 GBytes/s per Couplet • Active/Active Controllers • Full Host to Disk Parallel Access • 8 IB-4X and/or 8 FC-4 Host Ports to 20 FC Disk Loops • High Speed LUN (PowerLUN) • No Performance Loss in Degraded Mode • RDMA Enabled ─ Low Latency Application Access • Large Capacity, High Density Scalability • Scale Up to 960 TB in Two Racks!!! • Fibre Channel or SATA Storage • RAID 6 (8+2) and Read & Write Error Checking • 60% Power Savings (Sleep Mode) • Best $ per Performance • Best $ per Capacity per Sq.Ft.
Tier 1 A B C D E F G H P P A Tier 2 B C D E F G H P P A Tier 3 B C D E F G H P P RAID 0 RAID “3/5” 8+2 Byte Stripe S2A Architecture, 8+2 8 FC-4 and/or4 IB 4X Parallel Host Ports • Double Disk Failure Protection • Implemented in Hardware State Machine • Equivalent 8+1 & 8+2 performance • Parity Computed Writes • Read Error Checking • Multi-Tier Storage Support, Fibre Channel or SATA Disks • Up to 1120 disks total • 896 formattable disks 2 x 10 FC Loops to Disks
Example of a 8GB/s sustainedTier 1/2 Multi-Fabrics Solution FC4 & I/B together • Phasing out legacy Fibre Channel SAN’s to zero-copy InfiniBand RDMA Channeled I/O ! • leverages on DataDirect Networks, Inc. Active / Active [FC4+Infiniband] Host Ports • can do Tier 1 and / or Tier 2: dozens of Terabytes to multiple Petabytes • non-blocking heterogeneous channel I/O architecture • native SRP eliminates memory copies and enables commodity I/O node usage • less pipes, less I/O nodes, more efficient I/O nodes
System Memory Socket Buffer Adapter Buffer Adapter Buffer Socket Buffer RAID Cache Infiniband-less File System Scenario CPU Nodes Legacy FC Fabric • Standard SCSI Socket Block-Level Transfers • RAID System • Multiple Copies • I/O Server Nodes • Multiple Copies File System I/O Node RAID Raw Disk
Infiniband based File System Scenario 1 CPU Nodes • RDMA Block-Level Transfers • Zero Memory Copy • allows the storage system to RDMA data directly into file system memory space which can then be RDMAed to the client side as well Infiniband Fabric File System System Memory I/O Node RAID Cache RAID
Infiniband based File System Scenario 2 CPU Nodes Infiniband Fabric File System MDS RAID Cache RAID • RDMA Block-Level Transfers • Zero Memory Copy • Zero Server Hops Raw Disk
Infiniband Environments 2.4+GB/s Throughput per HPCSS Data Building Block Thousands of FileOps/s SDR or DDR Infiniband Network Standby Customer Supplied 10/100/1000 Mbit Ethernet Network HPCSS Metadata Service HPCSS Data Building Block • Disk Configured Separately: • Bundle Disk Solutions Available • 8 Host Ports Required • Customer Must Supply qty 4 PCI-E Infiniband Cards (one per OSS) • Customer Must Supply Separate Ethernet Management Network for Failover Services • Metadata Storage Holds 1.5M Files in Standard Configuration: More Optional • Customer Must Supply 2 PCI-E Infiniband Cards (one per MDS) • Customer Must Supply Separate Ethernet Management Network for Failover Services
IB Performance Results • Different Settings but here is what we have S2A9550 with Lustre or GPFS (S2A with SRP) ~2.4-2.6GB/sec per S2A9550 ~One stream: about 650MB/sec We have observed similar performance with both although GPFS is still behind Lustre in IB deployments With the upcoming S2A9900 using DDR2 we expect: ~5-5.6 GB/sec per S2A9900 ~one Stream about 1.25GB/sec via each IB PCIe Bus with one HCA • Over the next 3-9 months we expect significant improvements in both as File Systems will better utilize larger clock sizes. • At the present time Lustre is really the only File System that has an IB NAL. • Above numbers are an average between FC and SATA disks assuming best case scenarios. Obviously for small random I/O IB technology and SATA disks are not optimal and those numbers can be as low as 1GB/sec per S2A or 200MB/sec per stream
Reference Sites • LLNL • Sandia • ORNL • NASA • DOD • NSF Sites • NCSA, PSC, TACC, SDSC, IU • EMEA • CEA, AWE, Dresden, Cineca, DKRZ