1 / 21

R2D2 Reliable and Rapid Data Delivery for DCs

Berk Atikoglu, Mohammad Alizadeh , Tom Yue, Balaji Prabhakar , Mendel Rosenblum. R2D2 Reliable and Rapid Data Delivery for DCs. Motivation. Unreliable packet delivery due to Corruption Dealt with via retransmission Congestion Particularly bad due to incast or fan-in congestion

yannis
Download Presentation

R2D2 Reliable and Rapid Data Delivery for DCs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Berk Atikoglu, Mohammad Alizadeh, Tom Yue, BalajiPrabhakar, Mendel Rosenblum R2D2Reliable and Rapid Data Delivery for DCs

  2. Motivation • Unreliable packet delivery due to • Corruption • Dealt with via retransmission • Congestion • Particularly bad due to incastor fan-in congestion • These losses increase difficulty of reliable transmission • Loss of throughput • Increase in flow transfer times

  3. Incast • The client sends a request to several servers. • The responses travel to the switch simultaneously. • The switch buffer overflows from the amount of data. Some packets are dropped. S S S S S S C S C S S S S S S S S C 3 2 1 1 2 3

  4. Existing Approaches • High-resolution timers • Reduce retransmission timeouts (RTO) to hundreds of µs • Proposed in Vasudevan et al (Sigcomm 2009); see also Chen et al (WREN 2009) • Large number of CPU cycles on rapid interrupts or timer programming • In virtualized environments, high cost of processing hardware interrupts means even higher overhead • Large switch buffers • Reduce incastoccurences by caching enough packets • Increased packet latency • Complex implementation • Large caches are expensive • Increased power usage

  5. Our Approach: R2D2 • R2D2: collapse all flows into a single “meta-flow” • Single wait queue holds packets sent by host that are not yet acked • Single retransmission timer, no per-flow state • Provides reliable packet delivery • Resides in Layer 2.5, a shim layer between Layer 2 and Layer 3 • Key observation: Exploit uniformity of Data Center environments • Path lengths between hosts are small (3 – 5 hops) • RTTs are small (100 – 400 µs) • Path bandwidths are uniformly high (1Gbps, 10Gbps) • Therefore, amount of data from a 1G/10G source “in flight” is less than 64/640 KB • Store source packets in R2D2 on-the-fly, rapidly retransmit dropped or corrupted packets

  6. TCP L2 L3 L3 L2

  7. R2D2 L2 L3 L3 L2.5 L2

  8. R2D2 • When a flow times out: • Retransmit first un-ACKed packet (fill the hole). • Back-off: double the flow’s timeout value. • When an ACK comes in: • Reset the timeout back-off. • Outbound packet is intercepted by R2D2. • A timer is started. • A copy of the packet is placed in the wait queue. • The returned TCPack removes all ACKed packets held in the wait queue. Layer 3 1 2 2 Layer 2.5 R2D2 sender 3 4 1 4 3 Layer 2

  9. Features • Reliable, but not guaranteed, delivery • Maximum number of retransmissions before giving up • State-sharing • Only one wait queue; all packets go in same queue • No change to network stack • Kernel module in Linux; driver in Windows • Hardware version is OS-independent • Incremental deployability • Possible to protect a subset of flows

  10. Implementation • Implemented as a Linux Kernel Module on Kernel 2.6.* • No need to modify kernel • Can be loaded/unloaded easily • Incoming/outgoing TCP/IP packets are captured using Netfilter • Captured packets are put into a queue • just meta-data is kept in queue; packet is cloned • L2.5 thread processes the packets in the queue periodically

  11. Test Setup • 48 Dell PowerEdge 2950 Servers • Intel Core 2 Quad Q9550 × 2 • 16GB ECC DRAM • Broadcom NetXtreme II 5708 1GbE NIC • CentOS 5.3 Final; Linux 2.6.28-10 • Switches • Netgear GS748TNA (48 ports, GbE) • Cisco Catalyst 4948 (48 ports, GbE) • BNT RackSwitch G8421 (24 ports, 10GbE) 1 rack 48 servers 1GbE / 10GbE …

  12. Algorithms • R2D2 • Minimum timeout: 3ms • Max retransmissions: 10 • Delayed ack disabled • TCP: CUBIC TCP • minRTO: 200ms • Segmentation offloading: disabled • TCP timestamps: disabled

  13. Workload – 1 GbE switches • Number of servers (N): 1, 2, 4, 8, 16, 32, 46 • File size (S): 1MB, 20MB • Client: • requests (S/N) MB from each server • Issues new request when all servers respond • Measurements: • Goodput • Retransmission ratio: Retransmitted packets Total packets sent by TCP

  14. Netgear Test – Goodput 1MB 20MB

  15. Netgear Test – Retransmission Ratio 1MB 20MB

  16. Netgear Test – Multiple Clients • 6 clients (instead of 1 client) • 32 servers • Each client requests a file from each of the 32 servers 1MB 20MB

  17. Catalyst 4948 Test – Goodput 1MB 20MB

  18. Catalyst 4948 Test – Retransmission Ratio 1MB 20MB

  19. Catalyst 4948 Test – Multiple Clients 1MB 20MB

  20. 10GbE test – Goodput • File size: 10MB • Number of servers: 1, 5, 9, 13, 17, 21

  21. Conclusion • R2D2 is scalable and fast, provides reliable delivery • No need to modify kernel • Can be loaded/unloaded easily • Improves reliability in data center networks • Hardware implementation in NIC can be much faster • Work well with TCP offload options like segmentation and checksum offloading • Developing an FPGA implementation

More Related