Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control

Tutorial Survey of LL-FC Methods for Datacenter Ethernet101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and Clark Jeffries 26 Sept. 2006 IBM Zurich Research Lab

Outline • Part I • Requirements of datacenter link-level flow control (LL-FC) • Brief survey of top 3 LL-FC methods • PAUSE, aka. On/Off grants • credit • rate • Baseline performance evaluation • Part II • Selectivity and scope of LL-FC • per-what? : LL-FC’s resolution

Req’ts of .3x’: Next Generation of Ethernet Flow Control for Datacenters • Lossless operation No-drop expectation of datacenter apps (storage, IPC) Low latency • Selective Discrimination granularity: link, prio/VL, VLAN, VC, flow...? Scope: Backpressure upstream one hop, k-hops, e2e...? • Simple... PAUSE-compatible !!

Generic LL-FC System • One link with 2 adjacent buffers: TX (SRC) and RX (DST) • Round trip time (RTT) per link is system’s time constant • LL-FC issues: • link traversal (channel Bw allocation) • RX buffer allocation • pairwise-communication between channel’s terminations • signaling overhead (PAUSE, credit, rate commands) • backpressure (BP): • increase / decrease injections • stop and restart protocol RTT

FC-Basics: PAUSE (On/Off Grants) “Over “Over - - run”= run”= FC Return path Stop Stop Send STOP Send STOP Go Go TX Queues RX Buffer OQ OQ Data Link Threshold Threshold PAUSE BP Semantics : STOP / GO / STOP.. Xbar - Down stream Links * Note: Selectivity and granularity of FC domains are not considered here.

FC-Basics: Credits Xbar * Note: Selectivity and granularity of FC domains are not considered here.

Correctness: Min. Memory for “No Drop” • "Minimum“: to operate lossless => O(RTTlink) • Credit : 1 credit = 1 memory location • Grant : 5 (=RTT+1) memory locations • Credits • Under full load the single credit is constantly looping between RX and TX RTT=4 => max. performance = f(up-link utilisation) = 25% • Grants • Determined by slow restart: if last packet has left the RX queue, it takes an RTT until the next packet arrives

PAUSE vs. Credit @ M = RTT+1 • "Equivalent" = ‘fair’ comparison • Credit scheme: 5 credit = 5 memory locations • Grant scheme: 5 (=RTT+1) memory locations Performance loss for PAUSE/Grants is due to lack of underflow protection, because if M < 2*RTT the link is not work-conserving (pipeline bubbles on restart) For equivalent (to credit) performance, M=9 is required for PAUSE.

FC-Basics: Rate • RX queue Qi=1 (full capacity). • Max. flow (input arrivals) during one timestep (Dt = 1) is 1/8. • Goal: update the TX probability Ti from any sending node during the time interval [t, t+1) to obtain the new Ti applied during the time interval [t+1, t+2). • Algorithm for obtaining Ti(t+1) from Ti(t) ... => • Initially the offered rate from source0 was set = .100 , and from source1 = .025. All other processing rates were .125. Hence all queues show low occupancy. • At timestep 20, the flow rate to the sink was reduced to .050 => causing a congestion level in Queue2 of .125/.050 = 2.5 times processing capacity. • Results: The average queue occupancies are .23 to .25, except Q3 = .13. The source flows are treated about equally and their long-term sum is about .050 (optimal).

Conclusion Part I: Which Scheme is “Better”? • PAUSE + simple + scalable (lower overhead of signalling) - 2xM size required • Credits (absolute or incremental) + are always lossless, independent of the RTT and memory size + adopted by virtually all modern ICTNs (IBA, PCIe, FC, HT, ...) • not trivial for buffer-sharing • protocol reliability • scalability • At equal M = RTT, credits show 30+% higher Tput vs. PAUSE *Note: Stability of both was formally proven here • Rate: in-between PAUSE and credits + adopted in adapters + potential good match for BCN (e2e CM) - complexity (cheap fast bridges)

Part II: Selectivity and Scope of LL-FC“Per-Prio/VL PAUSE” • The FC-ed ‘link’ could be a • physical channel (e.g. 802.3x) • virtual lane (VL, e.g. IBA 2-16 VLs) • virtual channel (VC, larger figure) • ... • Per-Prio/VL PAUSE is the often proposed PAUSE v2.0 ... • Yet, is it good enough for the next decade of datacenter Ethernet? • Evaluation of IBA vs. PCIe/As vs. NextGen-Bridge (PrizmaCi)

Already Implemented in IBA (and other ICTNs...) • IBA has 15 FC-ed VLs for QoS • SL-to-VL mapping is performed per hop, according to capabilities • However, IBA doesn’t have VOQ-selective LL-FC • “selective” = per switch (virtual) output port • So what? • Hogging - aka buffer monopolization, HOL1-blocking, output queue lockup, single-stage congestion, saturation tree(k=0) • How can we prove that hogging really occurs in IBA? • A. Back-of-the-envelope reasoning • B. Analytical modeling of stability and work-conservation (papers available) • C. Comparative simulations: IBA, PCI-AS etc. (next slides)

IBA SE Hogging Scenario • Simulation: parallel backup to a RAID across an IBA switch • TX / SRC • 16 independent IBA sources, e.g. 16 “producer” CPU/threads • SRC behavior: greedy, using any communication model (UD) • SL: BE service discipline on a single VL • (the other VLs suffer of their own ) • Fabrics (single stage) • 16x16 IBA generic SE • 16x16 PCI-AS switch • 16x16 Prizma CI switch • RX / DST • 16 HDD “consumers” • t0 : initially each HDD sinks data at full 1x (100%) • tsim : during simulation HDD[0] enters thermal recalibration or sector remapping; consequently • HDD[0] progressively slows down its incoming link throughput: 90, 80,..., 10%

R First: Friendly Bernoulli Traffic • 2 Sources (A, B) sending @ (12x + 4x) to 16*1x End Nodes (C..R) Fig. from IBA Spec achievable performance Throughput loss aggregate throughput actual IBA performance link 0 throughput reduction

Myths and Fallacies about Hogging • Isn’t IBA’s static rate control sufficient? • No, because it is STATIC • IBA’s VLs are sufficient...?! • No. • VLs and ports are orthogonal dimensions of LL-FC • 1. VLs are for SL and QoS => VLs are assigned to prios, not ports! • 2. Max. no. of VLs = 15 << max (SE_degree x SL) = 4K • Can the SE buffer partitioning solve hogging, blocking and sat_trees, at least in single SE systems? • No. • 1. Partitioning makes sense only w/ Status-based FC (per bridge output port - see PCIe/AS SBFC); • IBA doesn’t have a native Status-based FC • 2. Sizing becomes the issue => we need dedication per I and O ports • M = O( SL * max{RTT, MTU} * N2 ) very large number! • Academic papers and theoretical disertations prove stability and work-conservation, but the amounts of required M are large

Conclusion Part II: Selectivity and Scope of LL-FC • Despite 16 VLs, IBA/DCE is exposed to the “transistor effect”: any single flow can modulate the aggregate Tput of all the others • Hogging (HOL1-blocking) requires a solution even for the smallest IBA/DCE system (single hop) • Prios/VL and VOQ/VC are 2 orthogonal dimensions of LL-FC Q: QoS violation as price of ‘non-blocking’ LL-FC? • Possible granularities of LL-FC queuing domains: • A. CM can serve in single hop fabrics also as LL-FC • B. Introduce VOQ-FC: intermediate coarser grain no. VCs = max{VOQ} * max{VL} = 64..4096 x 2..16 <= 64K VCs Alternative: 802.1p (map prios to 8 VLs) + .1q (map VLANs to 4K VCs)? Was proposed in 802.3ar...

Backup

LL-FC Between Two Bridges Switch[k] Switch[k+1] TX Port[k,j] RX Port[k+1, i] VOQ[1] RX Mgnt. Unit (Buffer Allocation) TX Scheduler RX Buffer "send packet" VOQ[n] LL-FC TX Unit LL-FC Reception “return path of LL-FC token"

Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control

Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control

Presentation Transcript

Flow of Control

Flow of Control

Flow of Control

Flow of Control

methods of control

Practical TDMA for Datacenter Ethernet

Flow of Control

Flow Of Control

Flow of Control

Ethernet Tutorial

Forward Explicit Congestion Notification (FECN) for Datacenter Ethernet Networks

Flow of Control

Flow of Control

Flow of Control

Flow of Control

Flow of Control

Flow of Control

Flow of Control

Flow of Control