210 likes | 381 Views
Berk Atikoglu, Mohammad Alizadeh , Tom Yue, Balaji Prabhakar , Mendel Rosenblum. R2D2 Reliable and Rapid Data Delivery for DCs. Motivation. Unreliable packet delivery due to Corruption Dealt with via retransmission Congestion Particularly bad due to incast or fan-in congestion
E N D
Berk Atikoglu, Mohammad Alizadeh, Tom Yue, BalajiPrabhakar, Mendel Rosenblum R2D2Reliable and Rapid Data Delivery for DCs
Motivation • Unreliable packet delivery due to • Corruption • Dealt with via retransmission • Congestion • Particularly bad due to incastor fan-in congestion • These losses increase difficulty of reliable transmission • Loss of throughput • Increase in flow transfer times
Incast • The client sends a request to several servers. • The responses travel to the switch simultaneously. • The switch buffer overflows from the amount of data. Some packets are dropped. S S S S S S C S C S S S S S S S S C 3 2 1 1 2 3
Existing Approaches • High-resolution timers • Reduce retransmission timeouts (RTO) to hundreds of µs • Proposed in Vasudevan et al (Sigcomm 2009); see also Chen et al (WREN 2009) • Large number of CPU cycles on rapid interrupts or timer programming • In virtualized environments, high cost of processing hardware interrupts means even higher overhead • Large switch buffers • Reduce incastoccurences by caching enough packets • Increased packet latency • Complex implementation • Large caches are expensive • Increased power usage
Our Approach: R2D2 • R2D2: collapse all flows into a single “meta-flow” • Single wait queue holds packets sent by host that are not yet acked • Single retransmission timer, no per-flow state • Provides reliable packet delivery • Resides in Layer 2.5, a shim layer between Layer 2 and Layer 3 • Key observation: Exploit uniformity of Data Center environments • Path lengths between hosts are small (3 – 5 hops) • RTTs are small (100 – 400 µs) • Path bandwidths are uniformly high (1Gbps, 10Gbps) • Therefore, amount of data from a 1G/10G source “in flight” is less than 64/640 KB • Store source packets in R2D2 on-the-fly, rapidly retransmit dropped or corrupted packets
TCP L2 L3 L3 L2
R2D2 L2 L3 L3 L2.5 L2
R2D2 • When a flow times out: • Retransmit first un-ACKed packet (fill the hole). • Back-off: double the flow’s timeout value. • When an ACK comes in: • Reset the timeout back-off. • Outbound packet is intercepted by R2D2. • A timer is started. • A copy of the packet is placed in the wait queue. • The returned TCPack removes all ACKed packets held in the wait queue. Layer 3 1 2 2 Layer 2.5 R2D2 sender 3 4 1 4 3 Layer 2
Features • Reliable, but not guaranteed, delivery • Maximum number of retransmissions before giving up • State-sharing • Only one wait queue; all packets go in same queue • No change to network stack • Kernel module in Linux; driver in Windows • Hardware version is OS-independent • Incremental deployability • Possible to protect a subset of flows
Implementation • Implemented as a Linux Kernel Module on Kernel 2.6.* • No need to modify kernel • Can be loaded/unloaded easily • Incoming/outgoing TCP/IP packets are captured using Netfilter • Captured packets are put into a queue • just meta-data is kept in queue; packet is cloned • L2.5 thread processes the packets in the queue periodically
Test Setup • 48 Dell PowerEdge 2950 Servers • Intel Core 2 Quad Q9550 × 2 • 16GB ECC DRAM • Broadcom NetXtreme II 5708 1GbE NIC • CentOS 5.3 Final; Linux 2.6.28-10 • Switches • Netgear GS748TNA (48 ports, GbE) • Cisco Catalyst 4948 (48 ports, GbE) • BNT RackSwitch G8421 (24 ports, 10GbE) 1 rack 48 servers 1GbE / 10GbE …
Algorithms • R2D2 • Minimum timeout: 3ms • Max retransmissions: 10 • Delayed ack disabled • TCP: CUBIC TCP • minRTO: 200ms • Segmentation offloading: disabled • TCP timestamps: disabled
Workload – 1 GbE switches • Number of servers (N): 1, 2, 4, 8, 16, 32, 46 • File size (S): 1MB, 20MB • Client: • requests (S/N) MB from each server • Issues new request when all servers respond • Measurements: • Goodput • Retransmission ratio: Retransmitted packets Total packets sent by TCP
Netgear Test – Goodput 1MB 20MB
Netgear Test – Retransmission Ratio 1MB 20MB
Netgear Test – Multiple Clients • 6 clients (instead of 1 client) • 32 servers • Each client requests a file from each of the 32 servers 1MB 20MB
Catalyst 4948 Test – Goodput 1MB 20MB
10GbE test – Goodput • File size: 10MB • Number of servers: 1, 5, 9, 13, 17, 21
Conclusion • R2D2 is scalable and fast, provides reliable delivery • No need to modify kernel • Can be loaded/unloaded easily • Improves reliability in data center networks • Hardware implementation in NIC can be much faster • Work well with TCP offload options like segmentation and checksum offloading • Developing an FPGA implementation