610 likes | 801 Views
Congestion Management for Data Centers: IEEE 802.1 Ethernet Standard. Balaji Prabhakar Departments of EE and CS Stanford University. Background. Data Centers see the true convergence of L3 and L2 transport
E N D
Congestion Management for Data Centers:IEEE 802.1 Ethernet Standard Balaji Prabhakar Departments of EE and CS Stanford University
Background • Data Centers see the true convergence of L3 and L2 transport • While TCP is the dominant L3 transport protocol, and a significant amount of L2 traffic uses it, there is other L2 traffic; notably, storage and media • This, and other reasons, have prompted the IEEE 802.1 standards body to develop an Ethernet congestion management standard • In this lecture, we shall see the development of the QCN (Quantized Congestion Notification) algorithm for standardization in the IEEE 802.1 Data Center Bridging standards • We will also review the technical background of congestion control research • The lecture has 3 parts • A brief overview of the relevant congestion control background • A description of the QCN algorithm and its performance • The Averaging Principle: A new control-theoretic idea underlying the QCN and BIC-TCP algorithms which stabilizes them when loop delays increase; very useful for operating high-speed links with shallow buffers---the situation in 10+ Gbps Ethernets
Managing Congestion • Congestion is a standard feature of networked systems; in data networks, • Congestion occurs when links are oversubscribed when traffic and/or link bandwidth changes • A congestion notification mechanism allows switches/routers to directly control the rate of the ultimate sources of the traffic • We’ve been involved in developing QCN (for Quantized Congestion Notification) for standardization in the Data Center Bridging track of the IEEE 802.1 Ethernet standards • For deployment in 10 (and 40 and 100) Gbps Data Center Ethernets • Complete information on the QCN algorithm (p-code, draft of standard, detailed simulations of lots of scenarios) available at
Congestion control in the Internet • In the Internet • Queue management schemes (e.g. RED) at the links signal congestion by either dropping or marking packets using ECN • TCP at end-systems uses these signals to vary the sending rate • There exists a rich history of algorithm development, control-theoretic analysis and detailed simulation of queue management schemes and congestion control algorithms for the Internet • Jacobson, Floyd et al, Kelly et al, Low et al, Srikant et al, Misra et al, Katabi et al … • TCP is excellent, so why look for another algorithm? • There is other traffic on Ethernet than TCP; so, native Ethernet congestion management is needed • TCP’s “one size fits all” approach makes it too conservative for high bandwidth-delay product networks • A hardware-based algorithm is needed for the very high speeds of operation encountered in 10, 40 and 100 Gbps • Ethernet and the Internet have very different operating conditions
Switched Ethernet vs. the Internet • Some significant differences … • No per-packet acks in Ethernet, unlike in the Internet • Not possible to know round trip time! • So congestion must be signaled to the source by switches • Algorithm not automatically self-clocked (like TCP) • Links can be paused; i.e. packets may not be dropped • No sequence numbering of L2 packets • Sources do not start transmission gently (like TCP slow-start); they can potentially come on at the full line rate of 10Gbps • Ethernet switch buffers are much smaller than router buffers (100s of KBs vs 100s of MBs) • Most importantly, algorithm should be simple enough to be implemented completely in hardware • Note: The QCN algorithm we have developed has Internet relatives; notably BIC-TCP at the source and the REM/PI controllers at switches
L2 Transport: IEEE 802.1 • IEEE 802.1 Data Center Bridging standards: Enhancements to Ethernet • Reliable delivery (802.1Qbb): Link-level flow control (PAUSE) prevents congestion drops • Ethernet congestion management (802.1Qau): Prevents congestion spreading due to PAUSE • Consequences • Hardware-friendly algorithms: can operate on 10—100Gbps links • Partial offload of CPU: no packet retransmissions • Corruption losses require abort/restart; 10G over copper uses short cables to keep low BER • PAUSE absorption buffers: proportional to bdwdth x delay of links, high memory bandwidth • NOTE: Recent work addresses the last two points; this is not covered in the course Pause absorption buffers Congestion spreading X X
Stability • Congestion control algorithms aim to • deliver high throughput, maintain low latencies/backlogs, be fair to all flows, be simple to implement and easy to deploy • Performance is related to stability of control loop • “Stability” refers to the non-oscillatory or non-exploding behavior of congestion control loops. In real terms, stability refers to the non-oscillatory behavior of the queues at the switch. • If the switch buffers are short, oscillating queues can overflow (hence drop packets/pause the link) or underflow (hence lose utilization) • In either case, links cannot be fully utilized, throughput is lost, flow transfers take longer • So stability is an important property, especially for networks with high bandwidth-delay products operating with shallow buffers
Unit step response of the network • The control loops are not easy to analyze • They are described by non-linear, delay differential equations which are usually impossible to analyze • So linearized analyses are performed using Nyquist or Bode theory • Is linearized analysis useful? • Yes! It is not difficult to know if a zero-delay non-linear system is stable. As the delay increases, linearization can be used to tell if the system is stable for delay (or number of sources) in some range; i.e. we get sufficient conditions • The above stability theory is essentially studying the “unit step response” of a network • Apply many “infinitely long flows” at time 0 and see how long the network takes to settle them to the correct collective and individual rate; the first is about throughput, the second is about fairness
TCP--RED: A basic control loop TCP TCP TCP TCP p qavg minth maxth RED: Drop probability, p, increases as the congestion level goes up TCP: Slow start + Congestion avoidance Congestion avoidance: AIMD No loss: increase window by 1; Pkt loss: cut window by half
TCP--RED • Two ways to analyze and understand this control loop • Simulations: ns-2 • Theory: Delay-differential equations • ns-2: A widely used event-driven simulator for the Internet • Very detailed and accurate • Different types of transport protocols: TCP, UDP, … • Router mechanisms and algorithms: RED, DRR, … • Web traffic: sessions, flows, power law flow sizes, … • Different types of network: wired, wireless, satellite, mobility,…
The simulation setup 1200 # of TCP flows grp2 grp1 1200 # of TCP flows 0 50 100 150 200 time 300 0 50 100 150 200 time grp3 100Mbps 100Mbps RED RED # of TCP flows 600 0 50 100 150 200 time
TCP--RED: Analytical model 1/R C - q - Time Delay p TCP Control RED Control
TCP--RED: Analytical model 1.5 Users: Network: W: window size; RTT: round trip time; C: link capacity q: queue length; qa: ave queue length p: drop probability *By V. Misra, W. Dong and D. Towsley at SIGCOMM 2000 *Fluid model concept originated by F. Kelly, A. Maullo and D. Tan at Jour. Oper. Res. Society, 1998
Accuracy of analytical model Recall the ns-2 simulation from earlier: Delay at Link 1
Why are the Diff Eqn models so accurate? • They’ve been developed in Physics, where they are called Mean Field Models • The main idea • very difficult to model large-scale systems: there are simply too many events, too many random quantities • but, it is quite easy to model the mean or average behavior of such systems • interestingly, when the size of the system grows, its behavior gets closer and closer to that predicted by the mean-field model! • physicists have been exploiting this feature to model large magnetic materials, gases, etc. • just as a few electrons/particles don’t have a very big influence on a system, so is Internet resource usage not heavily influenced by a few packets: aggregates matter more
TCP--RED: Stability analysis • Given the differential equations, in principle one can figure out whether the TCP--RED control loop is stable • However, the differential equations are very complicated • 3rd or 4th order, nonlinear, with delays • There is no general theory, specific case treatments exist • “Linearize and analyze” • Linearize equations around the (unique) operating point • Analyze resultant linear, delay-differential equations using Nyquist or Bode theory • End result: • Design stable control loops • Determine stability conditions (RTT limits, number of users, etc) • Obtain control loop parameters: gains, drop functions, …
Instability of TCP--RED • As the bandwidth-delay-product increases, the TCP--RED control loop becomes unstable • Parameters: 50 sources, link capacity = 9000 pkts/sec, TCP--RED • Source: S. Low et. al. Infocom 2002
Summary • We saw a very brief overview of research on the analysis of congestion control systems • As loop lags increase, the control loop becomes very oscillatory • This is true of any control scheme, not just congestion control schemes • In networks, oscillatory queue sizes tend to underflow buffers, causing to a loss of throughput; especially true for high BDP networks with shallow buffers • This has led to much research on developing algorithms for high BDP networks; e.g. High-Speed TCP, XCP, RCP, Scalable TCP, BIC-TCP, etc • We shall return to this later, after describing the QCN algorithm we have developed for the IEEE 802.1 standard
Quantized Congestion Notification (QCN): Congestion control for Ethernet Joint work with: Mohammad Alizadeh, BerkAtikoglu and Abdul Kabbani, Stanford University AshvinLakshmikantha, Broadcom Rong Pan, Cisco Systems Mick Seaman, Chair, Security Group; Ex-Chair, Interworking Group, IEEE 802.1
Overview • The description of QCN is brief, restricted to the main points of the algorithm • A fuller description is available at the IEEE 802.1 Data Center Bridging Task Group’s website, including extensive simulations and pseudo-code • We will describe the congestion control loop • How is congestion measured at the switches? • What is the signal? And, how does the switch send it? (Remember there are no per-packet acks in Ethernet) • What does the source do when it receives a congestion signal? • Terminology: • Congestion Point: Where congestion occurs, mainly switches • Reaction Point: Source of traffic, mainly rate limiters in Ethernet NICs
QCN: Congestion Point Dynamics • Consider the single-source, single-switch loop below • Congestion Point (Switch) Dynamics: Sample packets, compute feedback (Fb), quantize Fb to 6 bits, and reflect only negative Fb values back to Reaction Point with a probability proportional to Fb. Qeq Source Pmax Reflection Probability Fb = -(Q-Qeq+ w . dQ/dt) = -(queue offset + w.rate offset) Pmin |Fb|
QCN: Reaction Point TR Target Rate CR Rd/8 Rd/4 Rd Rate Rd/2 Current Rate Time Congestion message recd • Source (reaction point): Transmit regular Ethernet frames. When congestion message arrives: • Multiplicative Decrease: • Fast Recovery similar to BIC-TCP: gives high performance in high bandwidth-delay product networks, while being very simple. • Active Probing Fast Recovery Active Probing
Timer-supported QCN • Byte-Counter • 5 cycles of FR (150KB per cycle) • AI cycles afterwards (75KB per cycle) • Fb < 0 sends timer to FR Byte-Ctr • RL • In FR if both byte-ctr and timer in FR • In AI if only one of byte-ctr or timer in AI • In HAI if both byte-ctr and timer in AI • Note: RL goes to HAI only after 500 pkts have been sent RL Timer • Timer • 5 cycles of FR (T msec per cycle) • AI cycles afterwards (T/2 msec/cycle) • Fb < 0 sends timer to FR
Simulations: Basic Case • Parameters • 10 sources share a 10 G link, whose capacity drops to 0.5G during 2-4 secs • Max offered rate per source: 1.05G • RTT = 50 usec • Buffer size = 100 pkts (150KB); Qeq = 22 • T = 10 msecs • RAI = 5 Mbps • RHAI = 50 Mbps 10 G 10 G Source 1 Source 2 0.5G Source 10
Recovery Time Recovery time = 80 msec
Fluid Model for QCN P = Φ(Fb) • Assume N flows pass through a single queue at a switch. State variables are TRi(t), CRi(t), q(t), p(t). 10% Fb 63
Accuracy:Equations vs. ns-2 simulations N = 10, RTT = 100 us N = 100, RTT = 500 us N = 10, RTT = 1 ms N = 10, RTT = 2 ms
Summary • The algorithm has been extensively tested in deployment scenarios of interest • Esp. interoperability with link-level PAUSE and TCP • All presentations are available at the IEEE 802.1 website: • The theoretical development is interesting, but most notably because QCN (and BIC-TCP) display strong stability in the face of increasing lags, or, equivalently in high bandwidth-delay product networks • While attempting to understand why these schemes perform so well, we have uncovered a method for improving the stability of any congestion control scheme; we present this next
Background to the AP • When the lags in a control loop increase, the system becomes oscillatory and eventually becomes unstable • Feedback compensation is applied to restore stability; the two main flavors of feedback compensation in are: • Determine lags (round trip times), apply the correct “gains” for the loop to be stable (e.g. XCP, RCP, FAST). • Include higher order queue derivatives in the congestion information fed back to the source (e.g. REM/PI, BCN). • Method 1 is not suitable for us, we don’t know RTTs in Ethernet • Method 2 requires a change to the switch implementation • The Averaging Principle is a different method • It is suited to Ethernet where round trip times are unavailable • It doesn’t need more feedback, hence switch implementations don’t have to change • QCN and BIC-TCP already turn out to employ it
The Averaging Principle (AP) • A source in a congestion control loop is instructed by the network to decrease or increase its sending rate (randomly) periodically • AP: a source obeys the network whenever instructed to change rate, and then voluntarily performs averaging as below TR = Target Rate CR = Current Rate
Recall: QCN does 5 steps of Averaging TR CR Target Rate Rd/8 Rd/4 Rd Rate Rd/2 Current Rate Time Congestion message recd • The Fast Recovery portion of QCN, there are 5 steps of averaging • In fact, QCN and BIC-TCP are the Ave Prin applied to TCP! Active Probing
Applying the APRCP: Rate Control ProtocolDukkipatti and McKeown • A router computes an upper bound R on the rate of all flows traversing it. • R recomputed every T (= 10) msec as follows: • Where • d0: Round trip time estimate (set constant= 10 msec in our case) • C: link capacity (= 2.4 Gbps) • Q: Current queue size at the switch • y(t): incoming rate • α = 0.1 • ß = 1 • A flow chooses the smallest advertised rate on its path. • We consider a scenario where 10 RCP sources share a single link.
AP-RCP Stability RTT = 60 msec RTT = 65 msec
AP-RCP Stability cont’d RTT = 120 msec RTT = 130 msec
AP-RCP Stability cont’d RTT = 230 msec RTT = 240 msec
Understanding the AP • As mentioned earlier, the two major flavors of feedback compensation are: • Determine lags, chose appropriate gains • Feedback higher derivatives of state • We prove that the AP is sense equivalent to both of the above! • This is great because we don’t need to change network routers and switches • And the AP is really very easy to apply; no lag-dependent optimizations of gain parameters needed
AP Equivalence: Single Source Case Source does AP Fb Regular source 0.5 Fb + 0.25 T dFb/dt • Systems 1 and 2 are discrete-time models for an AP enabled source, and a regular source respectively. • Main Result: Systems 1 and 2 are algebraically equivalent. That is, given identical input sequences, they produce identical output sequences. • Therefore the AP is equivalent to adding a derivative to the feedback and reducing the gain! • Thus, the AP does both known forms of feedback compensation without knowing RTTs or changing switch implementations
AP-RCP vs PD-RCP RTT = 120 msec RTT = 130 msec
A Generic Control Example • As an example, we consider the plant transfer function: P(s) = (s+1)/(s3+1.6s2+0.8s+0.6)
Step Response Two-step AP, Delay = 25 seconds Two-step AP is even more stable than Basic AP
Summary of AP • The AP is a simple method for making many control loops (not just congestion control loops) more robust to increasing lags • Gives a clear understanding as to the reason why the BIC-TCP and QCN algorithms have such good delay tolerance: they do averaging repeatedly • There is a theorem which deals explicitly with the QCN-type loop • Variations of the basic principle are possible; i.e. average more than once, average by more than half-way, etc • The theory is fairly complete in these cases