380 likes | 397 Views
End-to-End Congestion Control for System Area Networks. Renato Santos, Yoshio Turner, John Janakiraman. Linux Systems, Networks Department Internet Systems and Storage Laboratory. Problem: Network Congestion. Cause: Network congestion arises when injected traffic exceeds network capacity
E N D
End-to-End Congestion Control for System Area Networks Renato Santos, Yoshio Turner, John Janakiraman Linux Systems, Networks Department Internet Systems and Storage Laboratory CSC talk 06/13/2003
Problem: Network Congestion • Cause: Network congestion arises when injected traffic exceeds network capacity • Effect: Performance degradation to levels below what could be achieved in the absence of congestion • Need a congestion control mechanism • Our focus: Cong. Control for System Area Networks (SAN) • Previous work: focused on traditional TCP networks • SAN has several unique characteristics that make the congestion control problem unique in this environment CSC talk 06/13/2003
Outline • Motivation • Part 1: Congestion Detection and Notification • Part 2: Source Response • Conclusion CSC talk 06/13/2003
System Area Networks (SAN) • High speed and low latency interconnect for high performance I/O and cluster communication • Data rates: 10s of Gb/s • Latency: 100s of ns to a few s, end to end delay • Examples of proprietary SANs • Myrinet, Quadrics, Memory Channel (HP), ServerNet (HP) • InfiniBand: Industry Standard for SAN CSC talk 06/13/2003
SAN Characteristics and Congestion Control Implications • No packet dropping (link level flow control) Need network support for detecting congestion • Low network latency (tens of ns cut-through switching) Simple logic for hardware implementation • Low buffer capacity at switches (e.g., 8KB buffer per port can store only 4 packets of 2KB each) TCP window mechanism inadequate (narrow operational range) • Input-buffered switches Alternative congestion detection mechanisms CSC talk 06/13/2003
Problem: Congestion Spreading switch switch switch root link (root of the congestion spreading tree) CSC talk 06/13/2003
Avoiding Congestion Spreading • Congestion Control Mechanism (feedback-control loop) • Congestion detection mechanism (feedback) • Detect when congestion is forming • Source response (control) • Adjust flows injection rate based on feedback to avoid congestion • Discussed in the 2nd part of this talk CSC talk 06/13/2003
Congestion Detection and Notification • Need network support for detecting congestion • Cannot use packet loss at end nodes to detect congestion • ECN (Explicit Congestion Notification) approach • Switch detects congestion when switch buffer becomes full • Switch sets a congestion bit on headers of packets in full buffer ( packet marking ) • Destination node copy congestion bit (mark) into ACK packet • Source adjusts flow rate according to the value of the congestion bit (mark) received in ACK packet. • What is unique in our ECN mechanism? • Packet marking appropriate for input-buffered switches • Simple to implement in hardware CSC talk 06/13/2003
1 2 4 2 2 2 output links Input links 3 2 4 2 2 4 Input buffers Input-buffered Switch Naive Approach: Marking Packets in Full Buffers When an input buffer becomes full: Mark all packets in input buffer CSC talk 06/13/2003
remote flows local flows root link (congested) inter-switch link SWITCH A SWITCH B victim flow non-congested link Simulation Scenario congestion spreading in this scenario: contention for root link buffer used by remote flows fills up inter-switch link blocks victim flow cannot use available inter-switch link bandwidth • Assumptions: • 10 local flows + 10 remote flows +1 victim flow • All flows are greedy (try to use all BW available) • Buffer Size: 4 packets/input_port • Sources react to packet marking using an adequate response function (discussed later) CSC talk 06/13/2003
naive packet marking adequate source response RL 1 IL,LF LF 0.9 VF 0.8 RL 0.7 1 normalized rate 0.6 0.9 0.5 0.8 LF normalized rate 0.4 0.7 0.3 0.6 0.2 0.5 IL,RF RF 0.1 0.4 IL IL,RF 0.3 0 0 10 20 30 40 50 60 70 80 90 100 RF 0.2 time (ms) VF 0.1 0 0 10 20 30 40 50 60 70 80 90 100 victim flow (VF) root link (RL) time (ms) local flows (LF) remote flows (RF) inter-switch link (IL) IL RL Simulation Results no congestion control • Effectively avoiding cong. spreading • Unfairness (remote vs. local flows) • shared full buffer causes remote packets to be marked more frequently than local packets • local flows get higher share of BW CSC talk 06/13/2003
2 1 2 4 2 2 2 1 2 4 2 2 2 2 Input-Triggered Packet Marking • Goal: Improve fairness • Mark all packets using congested link • Not only packets in full buffer Marking triggered by a full input buffer Mark all packets in input buffer • Identify root (congested) links: • Destination of packets at full buffer Mark any packet destined to root links 1 2 output links Input links 3 4 CSC talk 06/13/2003
Efficient Implementation • Use counters to avoid expensive scan of all switch packets (when searching for packets destined to a congested link) • 2 counters per output port • CNT_1: Total number of packets in the switch that are destined to this output port. • CNT_2: Total number of packets destined to this output port that need to be marked COPY CNT_1 CNT_2 at full input- buffer event CSC talk 06/13/2003
RL RL 1 1 IL,LF LF IL 0.9 0.9 VF 0.8 0.8 LF 0.7 0.7 normalized rate VF normalized rate 0.6 0.6 0.5 0.5 0.4 0.4 IL,RF RF 0.3 0.3 0.2 0.2 IL,RF RF 0.1 0.1 0 0 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 100 100 local flows (LF) time (ms) time (ms) remote flows (RF) victim flow (VF) root link (RL) inter-switch link (IL) IL RL Input-Triggered Packet Marking naive input-triggered • Fairness Improved (still some unfairness) • Marking still triggered by remote packets (bias marking towards remote packets) CSC talk 06/13/2003
Input-Output-Triggered Packet Marking • Additional output triggered mechanism • Mark packets when total number of packets destined to an output port exceeds a threshold • Still mark packets when input buffer is full (input triggered) • To avoid link blocking and congestion spreading CSC talk 06/13/2003
RL 1 0.9 IL RL 0.8 1 LF 0.7 0.9 IL VF 0.6 0.8 normalized rate 0.5 0.7 LF 0.4 IL,RF 0.6 normalized rate RF 0.3 0.5 VF IL,RF 0.2 0.4 0.1 0.3 RF 0 0.2 0 10 20 30 40 50 60 70 80 90 100 0.1 time (ms) local flows (LF) 0 0 10 20 30 40 50 60 70 80 90 100 remote flows (RF) victim flow (VF) root link (RL) inter-switch link (IL) IL RL Input-Output-Triggered Packet Marking Input-Output-Triggered (threshold: 8 packets) Input-Triggered time (ms) • High Bandwidth Utilization • Better fairness than input-triggered CSC talk 06/13/2003
1.4 1 1.2 0.9 1 Root link Utilization rate (remote/local) 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 buffer size (packets) buffer size (packets) Input-Output-Triggered Packet Marking efficiency fairness Threshold = 4 • Right threshold value need to be tuned (function of buffer size and traffic pattern) Threshold = 6 Threshold = 8 Threshold = 16 No output marking CSC talk 06/13/2003
Proposed Packet Marking Mechanism • Input-triggered packet marking • Improve fairness over naive approach • Simple to implement • Does not require tuning CSC talk 06/13/2003
Part 2: Source Response CSC talk 06/13/2003
Source Response:Window or Rate Control • Flow source adjusts injection in response to ECN • Rate Control • Flow source adjusts rate limit explicitly (e.g., Enforce by adjusting delay between packet injections) • Window Control (e.g., TCP) • Flow source adjusts window = # of outstanding packets Corresponds to rate = window/RTT (round trip time) CSC talk 06/13/2003
Window Control • Advantages • Self-clocked: congestion RTT instant rate (rate = window/RTT) • Window size bounds switch buffer utilization • Disadvantage: Narrow operational range for SANs • Window=2 uses all bandwidth on path in idle network • Cut-through switching packet header reaches destination before source can transmit last byte • Window=1 fails to prevent congestion spreading if # flows > # buffer slots at bottleneck CSC talk 06/13/2003
RL 1 0.9 0.8 0.7 LF IL,LF 0.6 normalized rate 0.5 IL,RF RF 0.4 0.3 0.2 VF 0.1 0 0 10 20 30 40 50 60 70 80 90 100 time (ms) IL RL Congestion Spreading (Window=1) 5 local flows, 5 remote flows, 4 buffer slots CSC talk 06/13/2003
Rate Control • Advantages: • Low buffer utilization possible (< 1 packet per flow) • Wide operational range • Disadvantage: Not self-clocked CSC talk 06/13/2003
RL RL,IL 1 0.9 0.8 0.7 VF, normalized rate 0.6 IL,LF,RF LF,RF 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 time (ms) IL RL Fixed Optimal Rates 5 local flows, 5 remote flows, 4 buffer slots CSC talk 06/13/2003
Proposed Source Response Mechanism • Rate control with a fixed window limit (window=1 packet) • Wide dynamic range of rate control • Self-clocking provided by the window (window=1 nearly saturates path bandwidth in low latency SAN) • Focus on design of rate control functions CSC talk 06/13/2003
Designing Rate Control Functions • Definition: When source receives ACK Decrease rate on marked ACK: rnew = fdec(r) Increase rate on unmarked ACK: rnew = finc(r) • fdec(r) and finc(r) should provide : • Congestion avoidance • High network bandwidth utilization • Fair allocation of bandwidth among flows • Develop new sufficient conditions for fdec(r) & finc(r) • Exploit differences in packet marking rates across flows to relax conditions • Requires novel time-based formulation CSC talk 06/13/2003
Avoiding Congested State • Steady state: flow rate oscillates around optimal value in alternating phases of rate decrease and increase • Want to avoid time in congested state • Magnitude of response to marked ACK is larger or equal to magnitude of response to unmarked ACK Congestion Avoidance Condition: finc(fdec(r)) r CSC talk 06/13/2003
Fairness Convergence • [Chiu/Jain 1989][Bansal/Balakrishnan 2001] developed convergence conditions assuming all flows receive feedback and adjust rates synchronously • Each increase/decrease cycle must improve fairness • Observation: In congested state, the mean number of marked packets for a flow is proportional to the flow rate. • bias promotes flow rate fairness • Enables weaker fairness convergence condition • Benefit: fairness with faster rate recovery CSC talk 06/13/2003
Finc(t) Rate as function of time (in absence of marks) . r . rate decrease step fdec(r) Trec(r) t time Fairness Convergence Relax condition: rate decrease-increase cycles need only maintain fairness in the synchronous case • If two flows receive marks, lower rate flow should recover earlier than or in the same timeas higher rate flow Fairness Convergence Condition: Trec(r1) Trec(r2) for r1 < r2 CSC talk 06/13/2003
Maximizing Bandwidth Utilization • Goal: as flows depart, remaining flows should recover rate quickly to maximize utilization • Fastest recovery: use limiting cases of conditions • Congestion Avoidance Condition finc(fdec(r)) r Use finc(fdec(r)) = r for minimum rate Rmin • Recovery from decrease event requires only one unmarked ACK at rate Rmin ( time = 1/Rmin) • Fairness Convergence Condition Trec(r1) Trec(r2) Use Trec(r1) = Trec(r2) for higher rates Maximum Bandwidth Utilization Condition: Trec(r) = 1/ Rmin for all r CSC talk 06/13/2003
Finc(t) Finc(t) r finc(r) decrease step rate rate increase step fdec(r) r 1/r Trec=1/Rmin tr t t time time Design Methodology: Choose fdec(r), find finc(r) satisfying conditions Use fdec(r) to derive Finc(t): Finc(t) = fdec(Finc(t + Trec)), Trec=1/Rmin Use Finc(t) to find finc(r): finc(r ) = Finc(tr+1/r) where Finc(tr) = r CSC talk 06/13/2003
New Response Functions • Fast Increase Multiplicative Decrease (FIMD): • Decrease function: fdecfimd(r) = r/m, constant m>1 (same as AIMD) • Increase function:fincfimd(r) = r · mRmin/r • Much faster rate recovery than AIMD • Linear Inter-Packet Delay (LIPD): • Decrease function: increases inter-packet delay (ipd) by 1 packet transmission time r = Rmax/(ipd+1) • Increase function: finclipd(r) = r/(1- Rmin/Rmax) • Large decreases at high rate, small decreases at low rate • Simple Implementation: e.g., table lookup CSC talk 06/13/2003
Increase Behavior Over Time: FIMD, AIMD, LIPD 1 0.9 0.8 FIMD (m=2) AIMD (m=2) 0.7 LIPD 0.6 normalized rate 0.5 0.4 0.3 0.2 0.1 0 2K 10K 20K 30K 40K 50K 60K 65K time (units of packet transmission time) Finc(t) CSC talk 06/13/2003
IL RL Performance: Source Response Functions LIPD AIMD 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 normalized rate 0.6 normalized rate 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 4 5 6 7 8 9 10 11 4 5 6 7 8 9 10 11 buffer size (packets) buffer size (packets) FIMD root link (RL) local flows (LF) 1 0.9 inter-switch link (IL) remote flows (RF) 0.8 0.7 normalized rate 0.6 0.5 0.4 0.3 0.2 0.1 0 4 5 6 7 8 9 10 11 buffer size (packets) CSC talk 06/13/2003
1 1 0.8 0.8 root link utilization inter-switch link utilization 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.01 0.1 1 10 100 0.01 0.1 1 10 100 mean source ON duration (ms) mean source ON duration (ms) LIPD FIMD AIMD Performance: Bursty Traffic Each flow: ON/OFF periods exponentially distributed with equal mean CSC talk 06/13/2003
Summary • Proposed/Evaluated congestion control approach appropriate for unique characteristics of SANs such as InfiniBand • ECN applicable to modern input-queued switches • Source response: rate control w/ window limit • Derived new relaxed conditions for source response function convergence functions with fast bandwidth reclamation • Based on observation of packet marking bias • Two examples: FIMD/LIPD outperform AIMD • Submitted our proposal to the InfiniBand Trade Organization congestion control working group CSC talk 06/13/2003
For Additional Information • “End-to-end congestion control for InfiniBand”, IEEE INFOCOM 2003. • “Evaluation of congestion detection mechanisms for InfiniBand switches”, IEEE GLOBECOM 2002. • “An approach for congestion control in InfiniBand”, HPL-2001-277R1, May 2002. CSC talk 06/13/2003