180 likes | 311 Views
Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control. M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and Clark Jeffries 26 Sept. 2006 IBM Zurich Research Lab. Outline. Part I Requirements of datacenter link-level flow control (LL-FC)
E N D
Tutorial Survey of LL-FC Methods for Datacenter Ethernet101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and Clark Jeffries 26 Sept. 2006 IBM Zurich Research Lab
Outline • Part I • Requirements of datacenter link-level flow control (LL-FC) • Brief survey of top 3 LL-FC methods • PAUSE, aka. On/Off grants • credit • rate • Baseline performance evaluation • Part II • Selectivity and scope of LL-FC • per-what? : LL-FC’s resolution
Req’ts of .3x’: Next Generation of Ethernet Flow Control for Datacenters • Lossless operation No-drop expectation of datacenter apps (storage, IPC) Low latency • Selective Discrimination granularity: link, prio/VL, VLAN, VC, flow...? Scope: Backpressure upstream one hop, k-hops, e2e...? • Simple... PAUSE-compatible !!
Generic LL-FC System • One link with 2 adjacent buffers: TX (SRC) and RX (DST) • Round trip time (RTT) per link is system’s time constant • LL-FC issues: • link traversal (channel Bw allocation) • RX buffer allocation • pairwise-communication between channel’s terminations • signaling overhead (PAUSE, credit, rate commands) • backpressure (BP): • increase / decrease injections • stop and restart protocol RTT
FC-Basics: PAUSE (On/Off Grants) “Over “Over - - run”= run”= FC Return path Stop Stop Send STOP Send STOP Go Go TX Queues RX Buffer OQ OQ Data Link Threshold Threshold PAUSE BP Semantics : STOP / GO / STOP.. Xbar - Down stream Links * Note: Selectivity and granularity of FC domains are not considered here.
FC-Basics: Credits Xbar * Note: Selectivity and granularity of FC domains are not considered here.
Correctness: Min. Memory for “No Drop” • "Minimum“: to operate lossless => O(RTTlink) • Credit : 1 credit = 1 memory location • Grant : 5 (=RTT+1) memory locations • Credits • Under full load the single credit is constantly looping between RX and TX RTT=4 => max. performance = f(up-link utilisation) = 25% • Grants • Determined by slow restart: if last packet has left the RX queue, it takes an RTT until the next packet arrives
PAUSE vs. Credit @ M = RTT+1 • "Equivalent" = ‘fair’ comparison • Credit scheme: 5 credit = 5 memory locations • Grant scheme: 5 (=RTT+1) memory locations Performance loss for PAUSE/Grants is due to lack of underflow protection, because if M < 2*RTT the link is not work-conserving (pipeline bubbles on restart) For equivalent (to credit) performance, M=9 is required for PAUSE.
FC-Basics: Rate • RX queue Qi=1 (full capacity). • Max. flow (input arrivals) during one timestep (Dt = 1) is 1/8. • Goal: update the TX probability Ti from any sending node during the time interval [t, t+1) to obtain the new Ti applied during the time interval [t+1, t+2). • Algorithm for obtaining Ti(t+1) from Ti(t) ... => • Initially the offered rate from source0 was set = .100 , and from source1 = .025. All other processing rates were .125. Hence all queues show low occupancy. • At timestep 20, the flow rate to the sink was reduced to .050 => causing a congestion level in Queue2 of .125/.050 = 2.5 times processing capacity. • Results: The average queue occupancies are .23 to .25, except Q3 = .13. The source flows are treated about equally and their long-term sum is about .050 (optimal).
Conclusion Part I: Which Scheme is “Better”? • PAUSE + simple + scalable (lower overhead of signalling) - 2xM size required • Credits (absolute or incremental) + are always lossless, independent of the RTT and memory size + adopted by virtually all modern ICTNs (IBA, PCIe, FC, HT, ...) • not trivial for buffer-sharing • protocol reliability • scalability • At equal M = RTT, credits show 30+% higher Tput vs. PAUSE *Note: Stability of both was formally proven here • Rate: in-between PAUSE and credits + adopted in adapters + potential good match for BCN (e2e CM) - complexity (cheap fast bridges)
Part II: Selectivity and Scope of LL-FC“Per-Prio/VL PAUSE” • The FC-ed ‘link’ could be a • physical channel (e.g. 802.3x) • virtual lane (VL, e.g. IBA 2-16 VLs) • virtual channel (VC, larger figure) • ... • Per-Prio/VL PAUSE is the often proposed PAUSE v2.0 ... • Yet, is it good enough for the next decade of datacenter Ethernet? • Evaluation of IBA vs. PCIe/As vs. NextGen-Bridge (PrizmaCi)
Already Implemented in IBA (and other ICTNs...) • IBA has 15 FC-ed VLs for QoS • SL-to-VL mapping is performed per hop, according to capabilities • However, IBA doesn’t have VOQ-selective LL-FC • “selective” = per switch (virtual) output port • So what? • Hogging - aka buffer monopolization, HOL1-blocking, output queue lockup, single-stage congestion, saturation tree(k=0) • How can we prove that hogging really occurs in IBA? • A. Back-of-the-envelope reasoning • B. Analytical modeling of stability and work-conservation (papers available) • C. Comparative simulations: IBA, PCI-AS etc. (next slides)
IBA SE Hogging Scenario • Simulation: parallel backup to a RAID across an IBA switch • TX / SRC • 16 independent IBA sources, e.g. 16 “producer” CPU/threads • SRC behavior: greedy, using any communication model (UD) • SL: BE service discipline on a single VL • (the other VLs suffer of their own ) • Fabrics (single stage) • 16x16 IBA generic SE • 16x16 PCI-AS switch • 16x16 Prizma CI switch • RX / DST • 16 HDD “consumers” • t0 : initially each HDD sinks data at full 1x (100%) • tsim : during simulation HDD[0] enters thermal recalibration or sector remapping; consequently • HDD[0] progressively slows down its incoming link throughput: 90, 80,..., 10%
R First: Friendly Bernoulli Traffic • 2 Sources (A, B) sending @ (12x + 4x) to 16*1x End Nodes (C..R) Fig. from IBA Spec achievable performance Throughput loss aggregate throughput actual IBA performance link 0 throughput reduction
Myths and Fallacies about Hogging • Isn’t IBA’s static rate control sufficient? • No, because it is STATIC • IBA’s VLs are sufficient...?! • No. • VLs and ports are orthogonal dimensions of LL-FC • 1. VLs are for SL and QoS => VLs are assigned to prios, not ports! • 2. Max. no. of VLs = 15 << max (SE_degree x SL) = 4K • Can the SE buffer partitioning solve hogging, blocking and sat_trees, at least in single SE systems? • No. • 1. Partitioning makes sense only w/ Status-based FC (per bridge output port - see PCIe/AS SBFC); • IBA doesn’t have a native Status-based FC • 2. Sizing becomes the issue => we need dedication per I and O ports • M = O( SL * max{RTT, MTU} * N2 ) very large number! • Academic papers and theoretical disertations prove stability and work-conservation, but the amounts of required M are large
Conclusion Part II: Selectivity and Scope of LL-FC • Despite 16 VLs, IBA/DCE is exposed to the “transistor effect”: any single flow can modulate the aggregate Tput of all the others • Hogging (HOL1-blocking) requires a solution even for the smallest IBA/DCE system (single hop) • Prios/VL and VOQ/VC are 2 orthogonal dimensions of LL-FC Q: QoS violation as price of ‘non-blocking’ LL-FC? • Possible granularities of LL-FC queuing domains: • A. CM can serve in single hop fabrics also as LL-FC • B. Introduce VOQ-FC: intermediate coarser grain no. VCs = max{VOQ} * max{VL} = 64..4096 x 2..16 <= 64K VCs Alternative: 802.1p (map prios to 8 VLs) + .1q (map VLANs to 4K VCs)? Was proposed in 802.3ar...
LL-FC Between Two Bridges Switch[k] Switch[k+1] TX Port[k,j] RX Port[k+1, i] VOQ[1] RX Mgnt. Unit (Buffer Allocation) TX Scheduler RX Buffer "send packet" VOQ[n] LL-FC TX Unit LL-FC Reception “return path of LL-FC token"