190 likes | 387 Views
The Need for an Improved PAUSE. Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006. Outline. I) Overcoming PAUSE-induced Deadlocks PAUSE exposed to circular dependencies Two deadlock-free PAUSE solutions II) PAUSE Interaction with Congestion Management III) Conclusions.
E N D
The Need for an Improved PAUSE Mitch Gusat and Cyriel Minkenberg IEEE 802 Dallas Nov. 2006 IBM Zurich Research Lab GmbH
Outline I) Overcoming PAUSE-induced Deadlocks • PAUSE exposed to circular dependencies • Two deadlock-free PAUSE solutions II) PAUSE Interaction with Congestion Management III) Conclusions IBM Zurich Research Lab GmbH
PAUSE Issues • PAUSE-related issues interfere with BCN simulations • Correctness • Deadlocks • cycles in the routing graph (if multipath adaptivity is enabled) • multiple solutions exist • circular dependencies (in bidir fabrics) • BCN can’t help this => Solutions required • Performance (to be elaborated in a future report) • low-order HOL-blocking and memory hogging • Non-selective PAUSE causes hogging, i.e., monopolization of common resources: e.g. shared memory may be monopolized by frames for a congested port (as shown here) • Consequences • best: reduced throughput • worst: unfairness, starvation, saturation tree, collapse • properly tuned, BCN can address this problem IBM Zurich Research Lab GmbH
A 3-level Bidir Fat Tree Unfolded Root = ‘hinge’ to unfold • Using shared-memory switches with global PAUSE in a bidirectional fat tree network can cause deadlock • Circular dependencies (CD) != loops in the routing graph (STD) • Deadlocks were observed in BCN simulations IBM Zurich Research Lab GmbH
PAUSE-caused Deadlocks in BCN Simulations 16-node 5-stage fabric Bernoulli traffic SM, BCN Partitioned, w/ BCN SM, no BCN Partitioned, no BCN IBM Zurich Research Lab GmbH
A B The Mechanism of PAUSE-induced CD Deadlocks • When incorrectly implemented, PAUSE-based flow control can cause hogging and deadlocks • PAUSE-deadlocking in shared-memory switches: • Switches A and B are both full (within the granularity of an MTU or Jumbo) => PAUSE thresholds exceeded • All traffic from A is destined to B and viceversa • Neither can send, waiting on each other indefinitely: Deadlock. • Note: Traffic from A never takes the path from B back to A and vice versa • Due to shortest-path routing IBM Zurich Research Lab GmbH
Two Solutions to Defeat the Deadlock • I. Architectural: Assert PAUSE on a per-input basis • No input is allowed to consume more than 1/N-th of the shared memory • All traffic in B’s input buffer for A is guaranteed to be destined to a different port than the one leading back to A (and vice versa) • Hence, the circular dependency has been broken! • Confirmed by simulations • Assert PAUSE on input i: • occmem >= Thor occ[i] >= Th/N • Deassert PAUSE on input i: • occmem < Thand occ[i] < Tl/N • Qeq = M / (2N) • II. LL-FC: Bypass Queue, distinctly PAUSE-d • Achieves similar result as (I), plus: • independent of switch architecture (and implementation) • required for IPC traffic (LD/ST, request/reply) • compatible w/ PCIe (dev. driver compatibility) A B IBM Zurich Research Lab GmbH
Simulation of BCN with Deadlock-free PAUSE • Observations • Qeq should be set to partition the shared memory • Setting it higher promotes hogging • Setting it lower wastes memory space • BCN works best with large buffers per port • Buffer size per port should be significantly larger than mean burst size • 256 frames per port IBM Zurich Research Lab GmbH
PAUSE Interaction with Congestion Management • What is the effect of deadlock-free PAUSE on BCN? • Memory partitioning ‘stiffens’ the feedback loop • PAUSE triggers backpressure tree earlier • Backrolling propagation speed depends not only on the available memory, but also on the switch service discipline • Next: Static analysis of PAUSE-BCN interference, function of the switch service discipline Note: To visualise the analytical iterations, enable animation. IBM Zurich Research Lab GmbH
Simple Analytical Method • Method used in this presentation • explicit assumptions • simple traffic scenario • reduced MIN topology, with static/deterministic routing (fixed) • This ‘model’ considers • queuing – in Eth. Channel Adapter (ECA) and switch element (SE) • scheduling – in ECA and SE • Ethernet’s per-prio PAUSE-based LL-FC (aka backpressure - BP) • reactive CM a la BCN • Linearization around steady-state => tractable static analysis • salient transients will be mentioned, but not computed • Compute the cumulative effects of • scheduling, • LL-FC backpressure per prio (only one used here), • CM source throttling (rate adjustment) • Do not compute the formulas for • blocking probability per stage and SE • variance of service time distribution • Lyapunov stability
Model and Traffic assumptions • Traffic = ∑(background + hot) “A total of 50% of link rate is attempted from 9 queues ( 8 background + 1 hot) from each ECA.” • Bgnd traffic: • 8 queue/ECA on the left. Each of the 8 queues is connected to one of the 8 ECAs on the right. => 64 flows (8 queue/ECA x 8) on the left that are each injecting packets. “80% of these [total link rate] are background, that is 80%x50% = 40% of link rate.” => background traffic intensity λ=0.4 is uniformly space-distributed • Hot traffic: “20% of these are hot, so hot traffic is 20%x50% = 10% of link rate.” .4 +.1 .4 .4 .4 +.2 .4 +.1 .4 .4 .4 +.4 .4 +.1 .4 .4 .4 +.2 .4 +.1 .4 .4 .4 +.8 .4 +.1 .4 .4 .4 +.2 .4 +.1 .4 .4 .4 +.4 .4 +.1 .4 .4 .4 +.2 .4 +.1 .4 .4 .4
120% Link Load => 20% Overload - What Happens Next? S1 S2 S3 • Hotspot arrival intensity: λbgnd + λhot= .4 + .8 = 1.2 > 1 => Overload , [mild] congestion factor = 1.2 @ SE (L2,S3) ...next ? • BP andCM will react • if SE(L2,S3) is work-conserving, 0.2 overload must be losslesy squelched by CM and BP • The exact sequence depends on the actual traffic, SE architecture and threshold settings. • Irrelevant for static analysis, albeit important in operation • Separation of concerns -> Study the independent effects of BP (1st) and CM (2nd) • iff linear system in steady-state -> superposition allows to compose the effects .4 +.1 .4 .4 .4 +.2 .4 +.1 .4 .4 .4 L1 +.4 .4 +.1 .4 .4 .4 +.2 cf = 1.2 .4 +.1 .4 .4 BP CM .4 +.6 L2 .4 +.1 .4 .4 .4 +.2 .4 +.1 .4 .4 .4 L3 +.4 .4 +.1 .4 .4 .4 +.2 .4 +.1 .4 .4 .4 L4
Link-Level FC will Back-Pressure: Whom? How Much? Whose 1st? Stop2 ? Stop1 ? • Depends on the SE’s service discipline • Most well-understood and used disciplines • Round-Robin RR versions: strict (non-WC) and work-conserving (skip invalid queues) • FIFO, aka FCFS, aka EDF (timestamps, aging) • Fair Queuing, WRR, WFQ • A future 802.3x should standardize only the LL-FC not its ‘fairness’ bgnd + hot’ = .4 + .4 .4 + .4 = .8 .4 hot” = .4 Buffers fill up bgnd + hot’ = .4 + .4 .4 + .4 = .8 1.2 .8 + .4 = 1.2 > 1
EDF-based BP: FCFS-type of Fairness (subset of max-min) S1 S2 S3 • New TX rates EDF-fair are backpropagated λ’ = (1 - θ) * λ = 0.834 * λ θ = 1- μj / (∑λij) , incremental upstream traversal rooted on SE (L2,S3) Hint: subtract the bgnd traffic λ = .4from the EDF-fair rates and compare w/ previous hot rates Obs.: If moderate-to-severe congestion θ->1 => λ’ -> 0 : Blocking spreads across all ingress branches => neither parking lot ‘unfairness’ nor flow decoupling is possible. (wide canopy saturation tree) * All flows sharing resources along the hot paths are backpressured proportional to their respective contribution (not their traffic class). No flow isolation. .4 +.1 .4 +.2 .4 +.1 BP .4 L1 +.4 .4 +.1 .4 .734 +.2 .4 +.1 BP 1.0 L2 .4 +.1 .4 .566 +.2 .666 .4 +.1 BP .4 L3 +.4 .417 .4 +.1 .4 .5 +.2 BP .4 +.1 L4 .483
RR-based BP: Prop. Fairness – Selective and Drastic S1 S2 S3 • New TX rates RR-fair are iteratively computed and backpropagated • 1. identify the INs exceeding RR quota, as members of N’ ≤ N • 2. distribute the overload δacross N’ • δij’ = N*λij - μj / (N*N’), δij’≤ δfor work-conserving service • 3. recompute the new admissible arrival rates λij’ = λij - δij’incrementally, upstream traversal rooted on SE (L2,S3) • 3’. If strict RR no longer δij’≤ δ => the BP effects are drastic and focused! Hint: subtract the bgdn traffic λ = .4from the RR-fair rates and compare w/ previous hot rates Obs. 1: Only the selected branch is BP-ed (discrimination) => RR-BP blocking always discriminates between ingress branches. Obs. 2: If severe congestion and/or many hops, selected branches will be swiftly choked down (bonsai – narrow trees). .4 +.1 .4 +.2 .4 +.1 BP .4 L1 +.4 .4 +.1 .4 .8 +.2 .4 +.1 BP 1.0 L2 .4 +.1 .4 .6 +.2 .6 / .5 .4 +.1 BP .4 L3 +.4 .3 .4 +.1 .4 .4 / .25 +.2 / .15 BP .4 +.1 L4 .5
20% Overload - Reaction According to CM • What’s the effect of CM only, if no LL-FC BP? • Congestion factor cf=1.2 : • 1. Marking by SE(L2, S3) • is done at flow resolution (queue connection here) • is based on SE queue occupancy and a set of thresholds (single one here, @8) • if fair w/ p=1%, BCN marking is pro-rated 33% (bgnd) + 67% (hot) • 2. ECA sources adapt their injection rate • per e2e flow • Desired result: convergence to proportionally fair stable rates λbgnd + λCM_hot= O(.33 + .67) - achievable by fair marking by CPID, proper tuning of BCN params and enhancements to self-increase (see recent Stanford U. proposal)
20% Overload - Reaction According to LL-FC Strictly depending on the service discipline 802 shouldn’t mandate scheduling to switch vendors, because • Round-Robin (RR: strict, or, work-conserving) • strong/prop. fairness • decouples flows • simple & scalable • globally unfair (parking lot problem) • FIFO/EDF (timestamps) • temporally & globally fair: first-come-first-served • locally unfair => flow coupling (can’t isolate across partitions and clients) • complex to scale • BP will impact the speed, strength and locality (fairness) of backpressure... (underlying CM) • hence different behaviors of the CM loop
Observations • PAUSE-induced deadlocks must be solved • two solutions were proposed • PAUSE + BCN: two feedback loops intercoupled • BP/LL-FC modulates CM’s convergence: +/- phase and amplitude depends on topology, RTTs, traffic and SE • Switch service disciplines impact (via PAUSE) BCN’s stability margin and transient response • Switches w/ RR service may require higher gains for w and Gd , or a higher Ps, than switches using EDF • ...how to signal this? • CM should trigger earlier than BP => the two mechanisms, albeit ‘independent’ should be codesigned and co-tuned. • thresholds’ choice depends on link and e2e RTTs
Instead of Conclusion: Improved PAUSE • 10GigE is a discontinuity in the Ethernet evolution • opportunity to address new needs and markets • however, improvements are needed • Requirements of next-generation PAUSE • Correct by design, not implementation • Deadlock-free • No HOL1- and, possibly reduced HOL2-blocking Note: Do not try to address high-order HOL-blocking at link layer • Configurable for both lossy and lossless operation • QoS / 802.1p support • Enables virtualization / 802.1q • Beneficial or neutral to CM schemes (BCN, TCP, ...) • Legacy PAUSE-compatible • Simple to understand and implement by designers • Min. no. of flow control domains: h/w queues and IDs in Ether-frame • Compelling to use => always enabled...!