1 / 29

Curbing Delays in Datacenters: Need Time to Save Time?

Curbing Delays in Datacenters: Need Time to Save Time?. Mohammad Alizadeh Sachin Katti , Balaji Prabhakar Insieme Networks Stanford University . Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency . Datacenter Networks.

more
Download Presentation

Curbing Delays in Datacenters: Need Time to Save Time?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Curbing Delays in Datacenters:Need Time to Save Time? Mohammad Alizadeh SachinKatti, BalajiPrabhakar Insieme Networks Stanford University

  2. Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency

  3. Datacenter Networks • Message latency is King need very high throughput, very low latency 10-40Gbps links 1-5μs latency 1000s of server ports web app cache db map-reduce HPC monitoring

  4. Transport in Datacenters • TCP widely used, but has poor performance • Buffer hungry: adds significant queuing latency TCP ~1–10ms Baseline fabric latency: 1-5μs How do we get here? Queuing Latency DCTCP ~100μs ~Zero Latency

  5. S1 Reducing Queuing: DCTCP vs TCP Sn Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch (KBytes) ECN Marking Thresh = 30KB

  6. ECN@90% S1 Towards Zero Queuing Sn S1 S1 ECN@90% ECN@90% Sn Sn

  7. ECN@90% S1 Towards Zero Queuing Sn ns2 sim:10 DCTCP flows, 10Gbps switch, ECN at 9Gbps (90% util) Target Throughput Floor ≈ 23μs

  8. Window-based Rate Control RTT = 10  C×RTT = 10 pkts Cwnd = 1 Sender Receiver C = 1 Throughput = 1/RTT = 10%

  9. Window-based Rate Control RTT = 2  C×RTT = 2 pkts Cwnd = 1 Sender Receiver C = 1 Throughput = 1/RTT = 50%

  10. Window-based Rate Control RTT = 1.01  C×RTT = 1.01 pkts Cwnd = 1 Sender Receiver C = 1 Throughput = 1/RTT = 99%

  11. Window-based Rate Control RTT = 1.01  C×RTT = 1.01 pkts Cwnd = 1 Sender 1 Sender 2 Receiver As propagation time 0: Queue buildup is unavoidable Cwnd = 1

  12. So What? Window-based RC needs lag in the loop Near-zero latency transport must: • Use timer-based rate control / pacing • Use small packet size • Both increase CPU overhead (not practical in software) • Possible in hardware, but complex (e.g., HULL NSDI’12) Or… Change the Problem!

  13. Changing the Problem… Priority queue 5 Switch Port Switch Port 9 4 3 1 7 FIFO queue • Queue buildup costly • need precise rate control • Queue buildup irrelevant • coarse rate control OK

  14. pfabric

  15. DC Fabric: Just a Giant Switch H7 H8 H9 H1 H2 H3 H4 H5 H6

  16. DC Fabric: Just a Giant Switch TX RX H1 H1 H1 H2 H2 H2 H3 H3 H3 H4 H4 H4 H5 H5 H5 H6 H6 H6 H7 H7 H7 H8 H8 H8 H9 H9 H9

  17. DC Fabric: Just a Giant Switch TX RX H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9

  18. DC transport = Flow scheduling on giant switch • Objective? • Minimize avg FCT TX RX H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 ingress & egress capacity constraints

  19. “Ideal” Flow Scheduling Problem is NP-hard [Bar-Noy et al.] • Simple greedy algorithm: 2-approximation 1 1 2 2 3 3

  20. pFabric in 1 Slide Packets carry a single priority # • e.g., prio = remaining flow size pFabric Switches • Very small buffers (~10-20 pktsfor 10Gbps fabric) • Send highest priority / drop lowest priority pkts pFabric Hosts • Send/retransmit aggressively • Minimal rate control: just prevent congestion collapse

  21. Key Idea Switches implement flow scheduling via local mechanisms Queue buildup does not hurt performance Window-based rate control OK Decouple flow scheduling from rate control Hosts use simple window-based rate control(≈TCP) to avoid high packet loss H7 H8 H9 H1 H2 H3 H4 H5 H6

  22. pFabric Switch • Priority Scheduling send highest priority packet first • Priority Dropping drop lowest priority packets first small “bag” of packets per-port 5 9 4 Switch Port 3 3 2 6 1 7 H7 H8 H9 H1 H2 H3 H4 H5 H6 prio = remaining flow size

  23. pFabric Switch Complexity • Buffers are very small (~2×BDP per-port) • e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB • Today’s switch buffers are 10-30x larger Priority Scheduling/Dropping • Worst-case:Minimum size packets (64B) • 51.2ns to find min/max of ~600 numbers • Binary comparator tree: 10 clock cycles • Current ASICs: clock ~ 1ns

  24. Why does this work? Invariant for ideal scheduling: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. • Priority scheduling • High priority packets traverse fabric as quickly as possible • What about dropped packets? • Lowest priority → not needed till all other packets depart • Buffer > BDP → enough time (> RTT) to retransmit

  25. Evaluation (144-port fabric; Search traffic pattern) • Recall: “Ideal” is REALLY idealized! • Centralized with full view of flows • No rate-control dynamics • No buffering • No pktdrops • No load-balancing inefficiency

  26. Mice FCT (<100KB) Average 99th Percentile

  27. Conclusion • Window-based rate control does not work at near-zero round-trip latency • pFabric: simple, yet near-optimal • Decouples flow scheduling from rate control • Allows use of coarse window-base rate control • pFabricis within 10-15% of “ideal” for realistic DC workloads (SIGCOMM’13)

  28. Thank You!

More Related