1 / 38

End-to-End Congestion Control for System Area Networks

End-to-End Congestion Control for System Area Networks. Renato Santos, Yoshio Turner, John Janakiraman. Linux Systems, Networks Department Internet Systems and Storage Laboratory. Problem: Network Congestion. Cause: Network congestion arises when injected traffic exceeds network capacity

lfarnsworth
Download Presentation

End-to-End Congestion Control for System Area Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. End-to-End Congestion Control for System Area Networks Renato Santos, Yoshio Turner, John Janakiraman Linux Systems, Networks Department Internet Systems and Storage Laboratory CSC talk 06/13/2003

  2. Problem: Network Congestion • Cause: Network congestion arises when injected traffic exceeds network capacity • Effect: Performance degradation to levels below what could be achieved in the absence of congestion • Need a congestion control mechanism • Our focus: Cong. Control for System Area Networks (SAN) • Previous work: focused on traditional TCP networks • SAN has several unique characteristics that make the congestion control problem unique in this environment CSC talk 06/13/2003

  3. Outline • Motivation • Part 1: Congestion Detection and Notification • Part 2: Source Response • Conclusion CSC talk 06/13/2003

  4. System Area Networks (SAN) • High speed and low latency interconnect for high performance I/O and cluster communication • Data rates: 10s of Gb/s • Latency: 100s of ns to a few s, end to end delay • Examples of proprietary SANs • Myrinet, Quadrics, Memory Channel (HP), ServerNet (HP) • InfiniBand: Industry Standard for SAN CSC talk 06/13/2003

  5. SAN Characteristics and Congestion Control Implications • No packet dropping (link level flow control)  Need network support for detecting congestion • Low network latency (tens of ns cut-through switching)  Simple logic for hardware implementation • Low buffer capacity at switches (e.g., 8KB buffer per port can store only 4 packets of 2KB each)  TCP window mechanism inadequate (narrow operational range) • Input-buffered switches  Alternative congestion detection mechanisms CSC talk 06/13/2003

  6. Problem: Congestion Spreading switch switch switch root link (root of the congestion spreading tree) CSC talk 06/13/2003

  7. Avoiding Congestion Spreading • Congestion Control Mechanism (feedback-control loop) • Congestion detection mechanism (feedback) • Detect when congestion is forming • Source response (control) • Adjust flows injection rate based on feedback to avoid congestion • Discussed in the 2nd part of this talk CSC talk 06/13/2003

  8. Congestion Detection and Notification • Need network support for detecting congestion • Cannot use packet loss at end nodes to detect congestion • ECN (Explicit Congestion Notification) approach • Switch detects congestion when switch buffer becomes full • Switch sets a congestion bit on headers of packets in full buffer ( packet marking ) • Destination node copy congestion bit (mark) into ACK packet • Source adjusts flow rate according to the value of the congestion bit (mark) received in ACK packet. • What is unique in our ECN mechanism? • Packet marking appropriate for input-buffered switches • Simple to implement in hardware CSC talk 06/13/2003

  9. 1 2 4 2 2  2 output links Input links 3 2 4 2 2 4 Input buffers Input-buffered Switch Naive Approach: Marking Packets in Full Buffers When an input buffer becomes full: Mark all packets in input buffer CSC talk 06/13/2003

  10. remote flows local flows root link (congested) inter-switch link SWITCH A SWITCH B victim flow non-congested link Simulation Scenario congestion spreading in this scenario: contention for root link buffer used by remote flows fills up inter-switch link blocks victim flow cannot use available inter-switch link bandwidth • Assumptions: • 10 local flows + 10 remote flows +1 victim flow • All flows are greedy (try to use all BW available) • Buffer Size: 4 packets/input_port • Sources react to packet marking using an adequate response function (discussed later) CSC talk 06/13/2003

  11. naive packet marking adequate source response RL 1 IL,LF LF 0.9 VF 0.8 RL 0.7 1 normalized rate 0.6 0.9 0.5 0.8 LF normalized rate 0.4 0.7 0.3 0.6 0.2 0.5 IL,RF RF 0.1 0.4 IL IL,RF 0.3 0 0 10 20 30 40 50 60 70 80 90 100 RF 0.2 time (ms) VF 0.1 0 0 10 20 30 40 50 60 70 80 90 100 victim flow (VF) root link (RL) time (ms) local flows (LF) remote flows (RF) inter-switch link (IL) IL RL Simulation Results no congestion control • Effectively avoiding cong. spreading • Unfairness (remote vs. local flows) • shared full buffer causes remote packets to be marked more frequently than local packets • local flows get higher share of BW CSC talk 06/13/2003

  12. 2 1 2 4 2 2 2 1 2 4 2 2 2 2 Input-Triggered Packet Marking • Goal: Improve fairness • Mark all packets using congested link • Not only packets in full buffer Marking triggered by a full input buffer Mark all packets in input buffer • Identify root (congested) links: • Destination of packets at full buffer Mark any packet destined to root links 1  2 output links Input links 3 4 CSC talk 06/13/2003

  13. Efficient Implementation • Use counters to avoid expensive scan of all switch packets (when searching for packets destined to a congested link) • 2 counters per output port • CNT_1: Total number of packets in the switch that are destined to this output port. • CNT_2: Total number of packets destined to this output port that need to be marked COPY CNT_1 CNT_2 at full input- buffer event CSC talk 06/13/2003

  14. RL RL 1 1 IL,LF LF IL 0.9 0.9 VF 0.8 0.8 LF 0.7 0.7 normalized rate VF normalized rate 0.6 0.6 0.5 0.5 0.4 0.4 IL,RF RF 0.3 0.3 0.2 0.2 IL,RF RF 0.1 0.1 0 0 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 100 100 local flows (LF) time (ms) time (ms) remote flows (RF) victim flow (VF) root link (RL) inter-switch link (IL) IL RL Input-Triggered Packet Marking naive input-triggered • Fairness Improved (still some unfairness) • Marking still triggered by remote packets (bias marking towards remote packets) CSC talk 06/13/2003

  15. Input-Output-Triggered Packet Marking • Additional output triggered mechanism • Mark packets when total number of packets destined to an output port exceeds a threshold • Still mark packets when input buffer is full (input triggered) • To avoid link blocking and congestion spreading CSC talk 06/13/2003

  16. RL 1 0.9 IL RL 0.8 1 LF 0.7 0.9 IL VF 0.6 0.8 normalized rate 0.5 0.7 LF 0.4 IL,RF 0.6 normalized rate RF 0.3 0.5 VF IL,RF 0.2 0.4 0.1 0.3 RF 0 0.2 0 10 20 30 40 50 60 70 80 90 100 0.1 time (ms) local flows (LF) 0 0 10 20 30 40 50 60 70 80 90 100 remote flows (RF) victim flow (VF) root link (RL) inter-switch link (IL) IL RL Input-Output-Triggered Packet Marking Input-Output-Triggered (threshold: 8 packets) Input-Triggered time (ms) • High Bandwidth Utilization • Better fairness than input-triggered CSC talk 06/13/2003

  17. 1.4 1 1.2 0.9 1 Root link Utilization rate (remote/local) 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 buffer size (packets) buffer size (packets) Input-Output-Triggered Packet Marking efficiency fairness Threshold = 4 • Right threshold value need to be tuned (function of buffer size and traffic pattern) Threshold = 6 Threshold = 8 Threshold = 16 No output marking CSC talk 06/13/2003

  18. Proposed Packet Marking Mechanism • Input-triggered packet marking • Improve fairness over naive approach • Simple to implement • Does not require tuning CSC talk 06/13/2003

  19. Part 2: Source Response CSC talk 06/13/2003

  20. Source Response:Window or Rate Control • Flow source adjusts injection in response to ECN • Rate Control • Flow source adjusts rate limit explicitly (e.g., Enforce by adjusting delay between packet injections) • Window Control (e.g., TCP) • Flow source adjusts window = # of outstanding packets Corresponds to rate = window/RTT (round trip time) CSC talk 06/13/2003

  21. Window Control • Advantages • Self-clocked: congestion   RTT  instant  rate (rate = window/RTT) • Window size bounds switch buffer utilization • Disadvantage: Narrow operational range for SANs • Window=2 uses all bandwidth on path in idle network • Cut-through switching  packet header reaches destination before source can transmit last byte • Window=1 fails to prevent congestion spreading if # flows > # buffer slots at bottleneck CSC talk 06/13/2003

  22. RL 1 0.9 0.8 0.7 LF IL,LF 0.6 normalized rate 0.5 IL,RF RF 0.4 0.3 0.2 VF 0.1 0 0 10 20 30 40 50 60 70 80 90 100 time (ms) IL RL Congestion Spreading (Window=1) 5 local flows, 5 remote flows, 4 buffer slots CSC talk 06/13/2003

  23. Rate Control • Advantages: • Low buffer utilization possible (< 1 packet per flow) • Wide operational range • Disadvantage: Not self-clocked CSC talk 06/13/2003

  24. RL RL,IL 1 0.9 0.8 0.7 VF, normalized rate 0.6 IL,LF,RF LF,RF 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 time (ms) IL RL Fixed Optimal Rates 5 local flows, 5 remote flows, 4 buffer slots CSC talk 06/13/2003

  25. Proposed Source Response Mechanism • Rate control with a fixed window limit (window=1 packet) • Wide dynamic range of rate control • Self-clocking provided by the window (window=1 nearly saturates path bandwidth in low latency SAN) • Focus on design of rate control functions CSC talk 06/13/2003

  26. Designing Rate Control Functions • Definition: When source receives ACK Decrease rate on marked ACK: rnew = fdec(r) Increase rate on unmarked ACK: rnew = finc(r) • fdec(r) and finc(r) should provide : • Congestion avoidance • High network bandwidth utilization • Fair allocation of bandwidth among flows • Develop new sufficient conditions for fdec(r) & finc(r) • Exploit differences in packet marking rates across flows to relax conditions • Requires novel time-based formulation CSC talk 06/13/2003

  27. Avoiding Congested State • Steady state: flow rate oscillates around optimal value in alternating phases of rate decrease and increase • Want to avoid time in congested state • Magnitude of response to marked ACK is larger or equal to magnitude of response to unmarked ACK Congestion Avoidance Condition: finc(fdec(r))  r CSC talk 06/13/2003

  28. Fairness Convergence • [Chiu/Jain 1989][Bansal/Balakrishnan 2001] developed convergence conditions assuming all flows receive feedback and adjust rates synchronously • Each increase/decrease cycle must improve fairness • Observation: In congested state, the mean number of marked packets for a flow is proportional to the flow rate. • bias promotes flow rate fairness • Enables weaker fairness convergence condition • Benefit: fairness with faster rate recovery CSC talk 06/13/2003

  29. Finc(t) Rate as function of time (in absence of marks) . r . rate decrease step fdec(r) Trec(r) t time Fairness Convergence Relax condition: rate decrease-increase cycles need only maintain fairness in the synchronous case • If two flows receive marks, lower rate flow should recover earlier than or in the same timeas higher rate flow Fairness Convergence Condition: Trec(r1)  Trec(r2) for r1 < r2 CSC talk 06/13/2003

  30. Maximizing Bandwidth Utilization • Goal: as flows depart, remaining flows should recover rate quickly to maximize utilization • Fastest recovery: use limiting cases of conditions • Congestion Avoidance Condition finc(fdec(r))  r Use finc(fdec(r)) = r for minimum rate Rmin • Recovery from decrease event requires only one unmarked ACK at rate Rmin ( time = 1/Rmin) • Fairness Convergence Condition Trec(r1)  Trec(r2) Use Trec(r1) = Trec(r2) for higher rates Maximum Bandwidth Utilization Condition: Trec(r) = 1/ Rmin for all r CSC talk 06/13/2003

  31. Finc(t) Finc(t) r finc(r) decrease step rate rate increase step fdec(r) r 1/r Trec=1/Rmin tr t t time time Design Methodology: Choose fdec(r), find finc(r) satisfying conditions Use fdec(r) to derive Finc(t): Finc(t) = fdec(Finc(t + Trec)), Trec=1/Rmin Use Finc(t) to find finc(r): finc(r ) = Finc(tr+1/r) where Finc(tr) = r CSC talk 06/13/2003

  32. New Response Functions • Fast Increase Multiplicative Decrease (FIMD): • Decrease function: fdecfimd(r) = r/m, constant m>1 (same as AIMD) • Increase function:fincfimd(r) = r · mRmin/r • Much faster rate recovery than AIMD • Linear Inter-Packet Delay (LIPD): • Decrease function: increases inter-packet delay (ipd) by 1 packet transmission time r = Rmax/(ipd+1) • Increase function: finclipd(r) = r/(1- Rmin/Rmax) • Large decreases at high rate, small decreases at low rate • Simple Implementation: e.g., table lookup CSC talk 06/13/2003

  33. Increase Behavior Over Time: FIMD, AIMD, LIPD 1 0.9 0.8 FIMD (m=2) AIMD (m=2) 0.7 LIPD 0.6 normalized rate 0.5 0.4 0.3 0.2 0.1 0 2K 10K 20K 30K 40K 50K 60K 65K time (units of packet transmission time) Finc(t) CSC talk 06/13/2003

  34. IL RL Performance: Source Response Functions LIPD AIMD 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 normalized rate 0.6 normalized rate 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 4 5 6 7 8 9 10 11 4 5 6 7 8 9 10 11 buffer size (packets) buffer size (packets) FIMD root link (RL) local flows (LF) 1 0.9 inter-switch link (IL) remote flows (RF) 0.8 0.7 normalized rate 0.6 0.5 0.4 0.3 0.2 0.1 0 4 5 6 7 8 9 10 11 buffer size (packets) CSC talk 06/13/2003

  35. 1 1 0.8 0.8 root link utilization inter-switch link utilization 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.01 0.1 1 10 100 0.01 0.1 1 10 100 mean source ON duration (ms) mean source ON duration (ms) LIPD FIMD AIMD Performance: Bursty Traffic Each flow: ON/OFF periods exponentially distributed with equal mean CSC talk 06/13/2003

  36. Summary • Proposed/Evaluated congestion control approach appropriate for unique characteristics of SANs such as InfiniBand • ECN applicable to modern input-queued switches • Source response: rate control w/ window limit • Derived new relaxed conditions for source response function convergence  functions with fast bandwidth reclamation • Based on observation of packet marking bias • Two examples: FIMD/LIPD outperform AIMD • Submitted our proposal to the InfiniBand Trade Organization congestion control working group CSC talk 06/13/2003

  37. For Additional Information • “End-to-end congestion control for InfiniBand”, IEEE INFOCOM 2003. • “Evaluation of congestion detection mechanisms for InfiniBand switches”, IEEE GLOBECOM 2002. • “An approach for congestion control in InfiniBand”, HPL-2001-277R1, May 2002. CSC talk 06/13/2003

  38. CSC talk 06/13/2003

More Related