750 likes | 919 Views
Playing Distributed Systems with Memory-to-Memory Communication Liviu Iftode Department of Computer Science University of Maryland. Outline. The M2M Game M2M Toys: VIA, InfiniBand, DAFS Playing with M2M Software DSM Intra-Server Communication Fault-Tolerance and Availability TCP Offloading
E N D
Playing Distributed Systems withMemory-to-Memory CommunicationLiviu IftodeDepartment of Computer ScienceUniversity of Maryland
Outline • The M2M Game • M2M Toys: VIA, InfiniBand, DAFS • Playing with M2M • Software DSM • Intra-Server Communication • Fault-Tolerance and Availability • TCP Offloading • Conclusions Most of this work has been done in the Distributed Computing (Disco) Lab at Rutgers University, http://discolab.rutgers.edu
How it all started... Multicomputers Clusters of computers • Cost-effective alternative to multicomputers • Commodity networks of high-volume uniprocessor or multiprocessor systems • track technology best • low cost/performance ratio • Networking became the headache of this approach • large software overheads
Too much OS... • Applications interact with network interface through the OS: exclusive access, protection, buffering, etc • OS involvement increases latency & overhead • Multiple copies (App-> OS, OS-> App) reduce effective bandwidth Send Receive Application Application COPY COPY OS OS Network Interface Network Interface
User-Level Protected Communication • Application has direct access to the network interface • OS involved only in connection setup to ensure protection • Performance benefits: zero-copy, low-overhead • Special support in the network interface Application Application Send [Receive] OS OS NIC NIC
Two User-Level Communication Models • Active Messages: send(local_buffer, remote _handler) Application Application Send Handler OS OS NIC NIC • Memory-to-Memory: send(local_buffer, remote_buffer) Application Application Send Buffer OS OS NIC NIC
Memory-to-Memory Communication send(local_buffer, remote_buffer) • Receive operation not required • Also called: (virtually) mapped comm, send-controlled comm, deliberate update, remote write, remote DMA, non-intrusive/silent comm • Application buffers must be (pre)registered with the NIC Receiver Sender Rid=import(rem_buf) send(local_buf1,Rid) send(local_buf2,Rid) export(rem_buf) OS OS NIC NIC
M2M Communication History • Started both in universities (SHRIMP-Princeton, UNet-Cornell ) and in industry (Hamlyn-HP, Memory Channel-DEC) • First application: High-Performance Computing • Software DSM: HLRC (Princeton), Cashmere(Rochester) • Lightweight message-passing libraries • Lightweight transport layer for cluster-based servers and storage • Industrial Standards • Virtual Interface Architecture (VIA) • InfiniBand I/O Architecture • Direct Access File System (DAFS) Protocol
Outline • The M2M Game • M2M Toys: VIA, InfiniBand, DAFS • Playing with M2M • Software DSM • Intra-Server Communication • Fault Tolerance and Availability • TCP Offloading • Conclusions
What is VIA? • M2M communication architecture similar to U-Net and VMMC/SHRIMP • Standard initiated by Compaq, Intel, and Microsoft in 1997 for cluster interconnect • Point-to-point connection oriented protocol • Two communication models • send/receive: a pair of descriptors queues • M2M: RDMA write and RDMA read
Virtual Interface Architecture • Data transfer at user level • Polling or interrupt for completions • Setup & Memory registration through kernel Application VI User Library Setup & Memory registration SEND QUEUE RECV QUEUE COMP QUEUE Kernel Agent VI NIC
InfiniBand: An I/O Architecture with M2M • Point-to-point switched-based I/O interconnect to replace the bus-based I/O architecture for servers • more bandwidth • more protection • Trade association founded by Compaq, Dell, HP, IBM, Intel, Microsoft and Sun in 1999 • M2M communication similar to VIA • RDMA write, RDMA read • Remote atomic operations
InfiniBand I/O Architecture • Hardware protocols for message-passing between devices implemented in channel adapters • A channel adapter(CA) is a programmable DMA engine with special protection features that allow DMA operations to be initiated locally and remotely Processor Memory HCA Switched I/O fabric TCA TCA TCA I/O Module I/O Module I/O Module
M2M Communication in InfiniBand Physical memory Memory window Memory region Physical memory Memory region • Memory region: virtually contiguous area of memory registered with the channel adapter (L_key) • Memory window: protected remote access to a specified area of the memory region (R_key) • Remote DMA Read/Write {L_key, R_key} RDMA Local_key Remote_key
Direct Access File System • Lightweight remote file access protocol designed to take advantage of M2M interconnect technologies • DAFS Collaborative group including 85 companies proposed the standard in 2001 • High Performance • Optimized for high throughput and low latency • Transfer directly to/from user buffers • Efficient file sharing using lock caching • Network-attached storage solution for data centers
DAFS Model Application Buffers DAFS Client File access API VIPL User DAFS Server DAFS File Server Buffers Driver Kernel KVIPL VI NIC Driver VI NIC Driver NIC NIC
M2M Product Market • VIA: Emulex (former Giganet) • InfiniBand: Mellanox • DAFS software distribution:Duke, Harvard, British Columbia, Rutgers (soon)
Outline • The M2M Game • M2M Toys: VIA, InfiniBand, DAFS • Playing with M2M • Software DSM • Intra-Server Communication • Fault Tolerance and Availability • TCP Offloading • Conclusions
Code Code Code Data Data Data Software DSM over VIA • Execution model: One application process on each node of the cluster • Invalidation-based memory coherence at page granularity using VM page protection • Data and Synchronization traffic using VIA Code Data Shared Virtual Address Space VIA Interconnect
Home-based Data Coherency using VIA Node 1 Node 2 RDMA page invalidation write (A) read (A) RDMA diff RDMA whole page Home of A
Lessons about M2M from DSM • Silent communication: used for diff propagation • Low-latency: 75 % of messages are small • Copy avoidance: not always possible • Useful but not available • Scatter-gather support • Remote read (RDMA Read) • Broadcast support
1 2 Scatter-Gather Support • “True” scatter-gather can avoid multiple message latencies • Potential gain of 5-10% Dest Dest Dest Source Source Source What we need What VIA supports What we do
RDMA Read • Allows fetching of data without involving the processor of the remote node • Potential gain of 10-20% Without RDMA Read With RDMA Read Page req (p) Page req (p) Scheduling Delay + Handling time Page (p) Page (p)
Broadcast Support • Useful for the software DSM protocol • Eager invalidation propagation • Eager update of data • Previous research (Cashmere’00) speculates a gain of 10-15% from the use of broadcast
Outline • The M2M Game • M2M Toys: VIA, InfiniBand, DAFS • Playing with M2M • Software DSM • Intra-Server Communication • Fault-Tolerance and Availability • TCP Offloading • Conclusions
TCP is really bad for Intra-Server Communication • Can easily steal 30-50% of the host cycles: • From 1 GHz processor only 500-700 MHz are available to the application • Processor saturates before the NIC • TCP Offloading Engine (TOE) solves the problem only partially • without TOE: 90% of the dual 1 GHz processor required to achieve 2x875 MHz bandwidth • with TOE: 52% of dual 1 GHz processors are required to obtain 1.9 GHz Ethernet bandwidth • With M2M (Mellanox InfiniBand): 90% of the 3.8 GHz bandwidth using only 7% of an 800 MHz processor
Distributed Intra-Cluster Protocols using M2M • Direct Access File System (DAFS): network-attached storage over VIA/IB • Sockets Direct Protocol (SDP): lightweight transport protocol over VIA/IB • SCSI Remote Protocol (SRP): connect servers to storage area networks over VIA/IB • Ongoing industry debate • “TCP or not TCP?” = “IP or M2M network?”
Distributed Intra-Cluster Server Applications using M2M • Cluster-Based Web Server • Storage Servers • Distributed File Systems
recv main disk send fs TCP VIA eth0 cLAN / clients cluster Cluster-based Web Server: Press • location-aware web server with request forwarding and load balancing (Bianchini et al -Rutgers)
Performance of VIA-based Press Web Server [Carrera et al, HPCA’02]
Lessons about M2M from Web Servers • M2M/VIA used for small messages (requests, cache summaries, load) and large messages (files) • low overhead is the most beneficial feature • trade off transparency for performance is necessary • sometimes zero copy traded off for number of messages (in the absence of scatter-gather)
Storage Server Database Server VI Local Disks … … VI VI Network Storage Server … VI Local Disks … VI-Attached Storage Server • M2M for database-storage interconnect (Zhou et al)
Database Performance with VI-Attached Storage Server • FC driver highly optimized by vendor • cDSA outperforms by 18% [Zhou et al, ISCA’02]
Lessons about M2M from Storage Servers • Zero-copy, low-overhead: most beneficial • Trade off transparency for performance • extend I/O API (asynchronous I/O, buffer registration) and/or relax I/O semantics (I/O completion) • require application modifications • Missed VIA Features • no flow control • no buffer management • Serious Competition: iSCSI
Federated File System (FedFS) Global file namespace for distributed applications built on top of autonomous local file systems A2 A2 A3 A3 A3 A1 FedFS FedFS Local FS Local FS Local FS Local FS M2M Interconnect
Location Independent Global Naming • Virtual Directory (VD): union of local directories • created on demand (dirmerge) and volatile • Directory Table: local cache of VDs (analogue to TLB) usr virtual directory file1 file2 usr usr local directories file1 file2
Role of M2M in FedFS • Directory Table - Virtual Directory Coherency • Cooperative Caching • File Migration • DAFS + VIA/IP = FedFS over the Internet A2 A2 A3 A3 A3 A1 FedFS FedFS VIA/IP DAFS DAFS DAFS DAFS
Outline • The M2M Game • M2M Toys: VIA, InfiniBand, DAFS • Playing with M2M • Software DSM • Intra-Server Communication • Fault-Tolerance and Availability • TCP Offloading • Conclusions
M2M for Fault Tolerance and Availability • Use RDMA write to efficiently mirror an application virtual space on remote memory: Fast Cluster Failover [Zhou et al] • fast checkpointing • fast failover • Use RDMA read for “silent” state migration: Migratory TCP • extract checkpoints from overloaded servers with zero overhead
TCP-based Internet Services • Adverse conditions to affect service availability • internetwork congestion or failure • servers overloaded, failed or under DoS attack • TCP has one response • network delays => packet loss => retransmission • TCP limitations • early binding of service to a server • client cannot dynamically switch to another server for sustained service
The Migratory TCP Model Server 1 Client Server 2
Migratory TCP: At a Glance • Migratory TCP solution to network delays: migrate connection to a “better” server • Migration mechanism is generic (not application specific) lightweight (fine-grain migration of a per-connection state) and low-latency • Requires changes to the server application but totally transparent to the client application • Interoperates with existing TCP
Connections Per-connection State Transfer Server 1 Server 2 Connections Application M-TCP RDMA `
Application- M-TCP “Contract” • Server application • Define per-connection application state • During connection service, export snapshots of per-connection application state when consistent • Upon acceptance of a migrated connection, import per-connection state and resume service • Migratory TCP • Transfer per-connection application and protocol state from the old to the new server and synchronize (here is where VIA/IP can help !)
Lazy Connection Migration Server 1 C (0) Client RDMA read (in the future) < State Reply> (3) < State Request> (2) C’ <SYN C,…> (1) <SYN + ACK> (4) Server 2
Future work: Connection Migration using M2M Server 1 C (0) Client RDMA write (eager) RDMA read (lazy) or C’ <SYN C,…> (1) <SYN + ACK> (4) Server 2
Stream Server Experiment Effective throughput close to average rate seen before server performance degrades (without VIA)