1 / 22

Deconstructing Datacenter Packet Transport

Deconstructing Datacenter Packet Transport. Mohammad Alizadeh , Shuang Yang, Sachin Katti , Nick McKeown , Balaji Prabhakar , Scott Shenker Stanford University U.C. Berkeley/ICSI. Transport in Datacenters. Who does she know?. Large-scale Web Application.

vinaya
Download Presentation

Deconstructing Datacenter Packet Transport

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, SachinKatti, Nick McKeown, BalajiPrabhakar, Scott Shenker Stanford University U.C. Berkeley/ICSI

  2. Transport in Datacenters Who does she know? Large-scale Web Application What has she done? • Latency is King • Web app response time depends on completion of 100s of small RPCs • But, traffic also diverse • Mice AND Elephants • Often, elephants are the root cause of latency App Tier Alice App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic Fabric Data Tier Eric Minnie Pics Videos Apps

  3. Transport in Datacenters • Two fundamental requirements • High fabric utilization • Good for all traffic, esp. the large flows • Low fabric latency (propagation + switching) • Critical for latency-sensitive traffic • Active area of research • DCTCP[SIGCOMM’10], D3[SIGCOMM’11] HULL[NSDI’11], D2TCP[SIGCOMM’12] PDQ[SIGCOMM’12], DeTail[SIGCOMM’12] vastly improve performance, but fairly complex

  4. pFabric in 1 Slide Packets carry a single priority # • e.g., prio = remaining flow size pFabric Switches • Very small buffers (e.g., 10-20KB) • Send highest priority / drop lowest priority pkts pFabric Hosts • Send/retransmit aggressively • Minimal rate control: just prevent congestion collapse

  5. DC Fabric: Just a Giant Switch! H7 H8 H9 H1 H2 H3 H4 H5 H6

  6. DC Fabric: Just a Giant Switch! H7 H8 H9 H1 H2 H3 H4 H5 H6

  7. DC Fabric: Just a Giant Switch! TX RX H1 H1 H1 H2 H2 H2 H3 H3 H3 H4 H4 H4 H5 H5 H5 H6 H6 H6 H7 H7 H7 H8 H8 H8 H9 H9 H9

  8. DC Fabric: Just a Giant Switch! TX RX H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9

  9. DC transport = Flow scheduling on giant switch • Objective? • Minimize avg FCT TX RX H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 ingress & egress capacity constraints

  10. “Ideal” Flow Scheduling Problem is NP-hard  [Bar-Noy et al.] • Simple greedy algorithm: 2-approximation 1 1 2 2 3 3

  11. pFabric Design

  12. pFabric Switch • Priority Scheduling send higher priority packets first • Priority Dropping drop low priority packets first small “bag” of packets per-port 5 9 4 Switch Port 3 3 2 6 1 7 prio = remaining flow size

  13. Near-Zero Buffers • Buffers are very small (~1 BDP) • e.g., C=10Gbps, RTT=15µs → BDP = 18.75KB • Today’s switch buffers are 10-30x larger Priority Scheduling/Dropping Complexity • Worst-case:Minimum size packets (64B) • 51.2ns to find min/max of ~300 numbers • Binary tree implementation takes 9 clock cycles • Current ASICs: clock = 1-2ns

  14. pFabric Rate Control • Priority scheduling & dropping in fabric also simplifies rate control • Queue backlog doesn’t matter One task: Prevent congestion collapse when elephants collide 50% Loss H7 H8 H9 H1 H2 H3 H4 H5 H6

  15. pFabric Rate Control • Minimal version of TCP • Start at line-rate • Initial window larger than BDP • No retransmission timeout estimation • Fix RTO near round-trip time • No fast retransmission on 3-dupacks • Allow packet reordering

  16. Why does this work? Key observation: Need the highest priority packet destined for a port available at the port at any given time. • Priority scheduling • High priority packets traverse fabric as quickly as possible • What about dropped packets? • Lowest priority → not needed till all other packets depart • Buffer larger than BDP → more than RTT to retransmit

  17. Evaluation • 54 port fat-tree: 10Gbps links, RTT = ~12µs • Realistic traffic workloads • Web search, Data mining * From Alizadeh et al. [SIGCOMM 2010] <100KB >10MB 5% of flows 35% of bytes 55% of flows 3% of bytes

  18. Evaluation: Mice FCT (<100KB) Average 99th Percentile Near-ideal: almost no jitter

  19. Evaluation: Elephant FCT (>10MB) Congestion collapse at high load w/o rate control

  20. Summary pFabric’s entire design: Near-ideal flow scheduling across DC fabric • Switches • Locally schedule & drop based on priority • Hosts • Aggressively send & retransmit • Minimal rate control to avoid congestion collapse

  21. Thank You!

More Related