370 likes | 610 Views
Scalable Multi-module Switches with Quality of Service Thesis Defense. Santosh Krishnan sk@cs.columbia.edu May 1, 2006 Advisor : Prof. Henning G. Schulzrinne Co-advisor : Dr. Fabio M. Chiussi. Outline. Problem Definition Motivations, list of contributions Switching Model: Components
E N D
Scalable Multi-module Switches with Quality of ServiceThesis Defense Santosh Krishnan sk@cs.columbia.edu May 1, 2006 Advisor: Prof. Henning G. Schulzrinne Co-advisor: Dr. Fabio M. Chiussi
Outline • Problem Definition • Motivations, list of contributions • Switching Model: Components • Related work: Formal methods in switching • Buffered Clos Switches • Concept of functional equivalence • BCS: Throughput and Quality of Service • Single-path BCS: CIOQ, aggregation, pipelining • Multi-path BCS: Parallelization • Conclusions
Problem Definition Goals: • How to methodically construct a high-capacity switch? • How to design high-performance algorithms for such switches? Importance: • Physical layer improvements: 10-G Ethernet, OC-768 • Converged network requiring QoS: IPTV, MPLS VPN • Case for modular design: component reuse What exists: • Ad-hoc approach to switch design • No benchmarks, varying performance satisfaction • Non-blocking, 100% throughput, nominal capacity
Contributions • Taxonomy of multi-module switches: Buffered Clos Switches • Performance framework: Functional equivalence with ideal switch Mimics circuit-switching rigor Applications Combined I/O Queueing Aggregation • QoS: Online maximal matching • Throughput: Critical matching • Strict stability: Maximal matching, SOQF • Switched Fair Airport matching • Shadow CIOQ and Decompose • Virtual Element Queueing Pipelining • Striping and Equal Dispatch • Concurrent Dispatch: 3D matching Parallelization • Flow-based PPS: Clos fitting • Cell-based PPS: Striping, Equal Dispatch Memory Space Memory • Combination methods • Recursive BCS
Switching Model • Basic property: Contention • Flows: Guaranteed QoS, Best-effort • Ideal Switch: Provide bandwidth trunks, sustain link capacity • Black box for network engineering purposes CPU Slow Path PPU PPU Switch Fabric Outputs PPU PPU Inputs PPU PPU Fast Path
Switching Model: Components Memory Element Space Element Buffers Matching: 2D Link Scheduling Mesh Conflict-free property Matching complexity Constraints: Memory bandwidth Full-mesh circuitry Monolithic OQ Switch: Ideal IQ Switch • Architecture: Interconnect memory and space elements • Algorithms: Meaningfully emulate the ideal switch for throughput and QoS
Background: Clos Networks • Strictly non-blocking: K ≥ 2M – 1(Clos theorem) • Re-arrangeable:K ≥ M(Slepian-Duguid) M Outputs Inputs- One circuit Recognize: • Space-time duality • Fitting: matrix decomposition K Fitting Algorithms Inspiration: Replace selected elements with memory
Background: CIOQ Switches Pro: • Low memory bandwidth Con: • Complexity of matching: • Switch size • Frequency • Reconfiguration rate Queue State Configuration 0 0 1 3 0 5 • Offline: Templates • Maximum, Maximal, Critical • Heuristics 1 0 0 7 0 1 0 1 0 0 5 0 What performance results when applied to a changing queue state?
Background: CIOQ Switch Results Based on combinatorics and stability theory QoS (Weller-Hajek ‘97) Throughput Auxiliary Results: Envelope matching (Kar ‘00), Packet-mode matching (Marsan ‘02)
Framework: Buffered Clos Switches Parallelize: Pool memory resources PPS Definition: • Switch size • Type of elements • Number in first stage • Number in second • Speedup Aggregate: Smaller elements CIOQ-A, G-MSM Pipeline: Lower speed, complexity CIOQ-P, G-MSM • Isomorphism: Non-blocking Clos network • Properties: Multi-stage, fully connected, symmetric, uniform
Framework: Functional Equivalence Characterize relative performance: Functional equivalence f1: Allocate known rates Shape: Bandwidth trunks f2: Relative stability for admissible traffic Literature: 100% throughput f3: Per-output relative stability Work conserving f4: Strict relative stability: all pairs f5: Exact emulation • Emulate an ideal switch: exact, asymptotic • Bandwidth trunks, independent throughput optimization
CIOQ: Bandwidth Trunks Shaping plus online matching is sufficient for bandwidth guarantees Offline BVN Templates Rate Matrix Cons: Template Storage Centralized rate processing Online Weight Scheduler Arbitrary Arrivals Shape/Batch VOQ Online: Maximal (s=2) Online: Critical (s=1) Split time into intervals: T = GCD (R) Batch traffic in each interval: Simple counters • Extension of Weller-Hajek maximal matching theorem • Clos analogy: Maximal matching as a strategy for orderly assignments
CIOQ: Admissible Traffic Best Throughput Results: • No speedup: MWM (McKeown et al.), Speedup 2: Maximal (Dai-Prabhakar) • Can a simple maximum size matching suffice for admissible traffic? Red Herring! Critical matching suffices for asymptotic 100% throughput (f2) 3 0 3 0 6 6 Augment MSM 7 7 0 1 1 1 Queue State Critical Matching 0 2 5 5 0 2 Intuition: 2x2 Line buckets R1 R2 C1 C2 Max
CIOQ: Strict Relative Stability • Maximal matching: Keeps under-subscribed outputs stable (f3) (s=2) • Shortest Output-Queue First: (f4) (s=3) • Output element scheduler: Identical to the one in emulated switch • Intuition: Give preference to less congested pairs at the output • Asymptotic emulation of an ideal switch: long-term fairness
Switched Fair Airport • Integrate two policies M1 and M2: • M1: Provides bandwidth trunks given rate reservations • M2: Optimize throughput independent of above rates Multi-phase Combination Exclusive Combination Speedup Required: M1 M2 Maximal matching is additive to any other policy, hence needs the least speedup
CIOQ-A: Aggregation Advantages: Smaller space element Lower arbitration complexity Heterogeneous subports • Shadow-Decompose: CIOQ emulation (f5) • VEQ Matching: Less complex, only for admissible traffic (f2)
CIOQ-P: Pipelining • Sequential Dispatch: CIOQ emulation (f5) • Concurrent Dispatch: • Limited candidates: stale-state issues • 3D Maximal Matching for relative stability • Striping: Shadow on envelope basis • Equal Dispatch: • Explicitly equalize load • Separate occupancy counters for each SE Implement arbitrarily complex policies! Advantages: Slower space element Lower arbitration complexity
G-MSM: Combination Combination methods: CIOQ-A/P No need for independent analysis Recursion possible
PPS: Architecture Core Advantages: Demux Mux Reuse low-capacity core switch Implement arbitrarily slow memories! provided Memoryless first and third stages Performance: Emulates OQ switch • Pool the resources on several switching paths • Dual of a CIOQ-P switch • Matching algorithm replaced by load balancing • Sequence control might be necessary
PPS: Flow-based • Model for clustered routers: • Per-flow path assignment: explicit or hashed • No need for sequence control • Memory in first stage • High speedup (Clos fitting) • Unbalanced load assignment • Requires knowledge of loads Split flows
PPS: Cell-based • Uniformly distribute the load of each flow • Premise: Each core element receives 1/K cells of each flow • Equal dispatch and striping suffice for asymptotic OQ emulation • Bandwidth trunks: Large buffers required
Summary: A Recipe Book • Taxonomy of multi-module switches: Buffered Clos Switches • Performance framework: Functional equivalence with ideal switch Applications Combined I/O Queueing Aggregation • QoS: Online maximal matching • Throughput: Critical matching • Strict stability: Maximal matching, SOQF • Switched Fair Airport matching • Shadow and Decompose • Virtual Element Queueing Pipelining • Striping and Equal Dispatch • Concurrent Dispatch: 3D matching Parallelization • Flow-based PPS: Clos fitting • Cell-based PPS: Striping, Equal Dispatch Memory Space Memory • Combination methods • Recursive BCS
Avenues for Follow-on Research • Efficient policies for multicast • Similar treatment on other interconnection networks • Theory of backpressure: • Recent interest in buffered crossbars • Quality of stability: Average delay analysis • Short-timescale equivalence • Emulation of a finite-memory ideal switch • Interplay of buffer management with matching algorithms
Relevant Publications • Dynamic Partitioning: Switch Memory Management, Infocom ’99 • Packet Switches with QoS Support, Hot Interconnects ’00 • Feedback Control for Distributed Scheduling, Globecomm ’00 • Buffered Clos Switches, Columbia TR ’02 • Inverse Multiplexing for Switches, Globecom ’98 • Switched Connections Inverse Multiplexing, Intl. Conf. ATM ’99 • Recognition of Parallel Packet Switches, GBN, Infocom ’01 • Stability Analysis of Parallel Packet Switches, ICC ’01 • Open-loop Schemes for Multi-path Switches, ICC ‘03 Switching Algorithms Parallel Switches
Proposal Conjectures Proposal: six conjectures • Maximal matching is sufficient to isolate oversubscribed outputs: DONE • SOQF is sufficient for strict relative stability: DONE • Equal dispatch for strict stability in CIOQ-P: DONE • Equal dispatch plus decomposition for strict stability in G-MSM: DONE • Rate shaping plus maximal matching suffices for QoS in CIOQ: DONE • SOQF suffices for long-term fairness in CIOQ: DONE Plus many more to round out the work
Additional Contributions Background: Survey of formal methods in switching– a new perspective Applications Combined I/O Queueing Aggregation • Maximal Matching: Delay analysis • Perfect Sequences: Uniform Traffic • Multicast support using Recycling • Batch Decomposition (Optical) • Support for Heterogeneous Subports Pipelining Parallelization • Concurrent Dispatch: BVN and SPS • SMM Switches: PPS without backpressure • Fractional Dispatch for memoryless inputs
Matching Flavors • Maximal matching: Non-idling, greedy • Maximum-size matching: Maximum flow in a bipartite graph • Ford-Fulkerson, Hopcroft-Karp Invariant: 3 0 6 At least one connection in the marked lines 7 0 1 Queue State Non-empty 0 5 0
Matching Flavors (continued) • Critical Matching: Covers all critical rows and columns • Critical line: A line with the maximum sum • Perfect Matching: Each configuration is a permutation • Maximum Weight Matching: Use queue length as weights • Optimization problem: simplex method • Template Matchings: • BVN: Decompose rate matrix as convex combination of permutations • Double: Lower number of permutations, wasted slots • Min: N permutations will cover all entries, large number of wasted slots • Stable Matching: Gale-Shapely algorithm
Stability Theory • Lyapunov functions: Kumar-Meyn ‘95 • Mechanism to extend Foster’s criterion to a system of queues • Weighted cartesian product of queue lengths • Symmetric and co-positive • Fluid limits: Dai-Prabhakar ‘00 • Function of discrete time: Interpolate • Limit: Scale time to infinity • The scaling parameter may be drawn from an increasing sequence rn F(t) = lim 1/r f(rt) r∞
CIOQ: Bandwidth Trunks Arrivals into GQ: Bounded admissible Bandwidth Trunk: Timescale = 1/GCD(R) Covers all entries in GQ before next batch • Delay comparable to BVN rate decomposition
CIOQ: Perfect Sequences • Sub-maximal Perfect Sequence: • A sequence of N permutations that covers the unit matrix • A repeating sequence guarantees 1/N to each pair • Suffices for 100% throughput to uniform traffic • Simple implementation: Staggered round-robin • Not even maximal! Concurrent SPS for CIOQ-P: K turns in KN slots Basis for iSLIP Basis for Atlanta arbitration
CIOQ-P: Equal Dispatch Explicitly equalize the load for each input-output pair Implemented as counters No mis-sequencing issues
CIOQ-P: 3D Maximal Matching Concurrent traversal of queue state matrix Pointers do not coincide with each other
Recursive G-MSM Any matching SPS SPS Memory element of a G-MSM: Replace with a CIOQ switch Virtual Element Queues Organized per space element