300 likes | 418 Views
048866: Packet Switch Architectures. Scaling. Dr. Isaac Keslassy Electrical Engineering, Technion isaac@ee.technion.ac.il http://comnet.technion.ac.il/~isaac/. Achieving 100% throughput. Switch model Uniform traffic Technique: Uniform schedule (easy)
E N D
048866: Packet Switch Architectures Scaling Dr. Isaac Keslassy Electrical Engineering, Technion isaac@ee.technion.ac.il http://comnet.technion.ac.il/~isaac/
Achieving 100% throughput • Switch model • Uniform traffic • Technique: Uniform schedule (easy) • Non-uniform traffic, but known traffic matrix • Technique: Non-uniform schedule (Birkhoff-von Neumann) • Unknown traffic matrix • Technique: Lyapunov functions (MWM) • Faster scheduling algorithms • Technique: Speedup (maximal matchings) • Technique: Memory and randomization (Tassiulas) • Technique: Twist architecture (buffered crossbar) • Accelerate scheduling algorithm • Technique: Pipelining • Technique: Envelopes • Technique: Slicing • No scheduling algorithm • Technique: Load-balanced router 048866 – Packet Switch Architectures
Outline Up until now, we have focused on high performance packet switches with: • A crossbar switching fabric, • Input queues (and possibly output queues as well), • Virtual output queues, and • Centralized arbitration/scheduling algorithm. Today we’ll talk about the implementation of the crossbar switch fabric itself. How are they built, how do they scale, and what limits their capacity? 048866 – Packet Switch Architectures
Crossbar switchLimiting factors • N2crosspoints per chip, or N x N-to-1 multiplexors • It’s not obvious how to build a crossbar from multiple chips, • Capacity of “I/O”s per chip. • State of the art: About 300 pins each operating at 3.125Gb/s ~= 1Tb/s per chip. • About 1/3 to 1/2 of this capacity available in practice because of overhead and speedup. • Crossbar chips today are limited by “I/O” capacity. 048866 – Packet Switch Architectures
Scaling • Scaling Line Rate • Bit-slicing • Time-slicing • Scaling Time (Scheduling Speed) • Time-slicing • Envelopes • Frames • Scaling Number of Ports • Naïve approach • Clos networks • Benes networks 048866 – Packet Switch Architectures
Bit-sliced parallelism 1 2 3 4 5 6 7 8 k Scheduler • Cell is “striped” across k identical planes. • Scheduler makes same decision for all slices. • However, doesn’t decrease scheduling speed • Other problem(s)? Linecard (from each input) Cell Cell Cell 048866 – Packet Switch Architectures
Time-sliced parallelism 1 2 3 4 5 6 7 8 k Scheduler • Cell carried by one plane; takes k cell times. • Centralized scheduler is unchanged. It works for each slice in turn. • Problem: same scheduling speed Linecard (from each input) Cell Cell Cell Cell 048866 – Packet Switch Architectures
Scaling • Scaling Line Rate • Bit-slicing • Time-slicing • Scaling Time (Scheduling Speed) • Time-slicing • Envelopes • Frames • Scaling Number of Ports • Naïve approach • Clos networks • Benes networks 048866 – Packet Switch Architectures
Time-sliced parallelismwith parallel scheduling Slow Scheduler Slow Scheduler Slow Scheduler Slow Scheduler • Now scheduling is distributed to each slice. • Scheduler has k cell times to schedule • Problem(s)? Linecard (from each input) 1 Cell 2 Cell Cell 3 Cell k Cell 048866 – Packet Switch Architectures
Envelopes Cell Cell Cell Cell Cell Slow Scheduler • Envelopes of k cells [Kar et al., 2000] • Problem: “Should I stay or should I go now?” • Waiting starvation (“Waiting for Godot”) • Timeouts loss of throughput Linecard (at each VOQ) 048866 – Packet Switch Architectures
Frames for scheduling Cell Cell Cell Cell Cell Cell Cell Cell Slow Scheduler • The slow scheduler simply takes its decision every k cell times and holds it for k cell times • Often associated with pipelining • Note: pipelined-MWM still stable (intuitively: the weight doesn’t change much) • Possible problem(s)? Linecard (at each VOQ) 048866 – Packet Switch Architectures
Scaling a crossbar • Conclusion: • Scaling the line rate is relatively straightforward (although the chip count and power may become a problem). • Scaling the scheduling decision is more difficult, and often comes at the expense of packet delay. • What if we want to increase the number of ports? • Can we build a crossbar-equivalent from multiple stages of smaller crossbars? • If so, what properties should it have? 048866 – Packet Switch Architectures
Scaling • Scaling Line Rate • Bit-slicing • Time-slicing • Scaling Time (Scheduling Speed) • Time-slicing • Envelopes • Frames • Scaling Number of Ports • Naïve approach • Clos networks • Benes networks 048866 – Packet Switch Architectures
Scaling number of outputs Naïve Approach 16x16 crossbar switch: Building Block: 4 inputs 4 outputs Eight inputs and eight outputs required! 048866 – Packet Switch Architectures
3-stage Clos Network mxm 1 nxk kxn 1 1 n 1 2 1 n 2 … 2 … … … N m … m N N = n x m k ≥ n k 048866 – Packet Switch Architectures
With k = n, is a Clos network non-blocking like a crossbar? Consider the example: scheduler chooses to match (1,1), (2,4), (3,3), (4,2) 048866 – Packet Switch Architectures
With k = n is a Clos network non-blocking like a crossbar? Consider the example: scheduler chooses to match (1,1), (2,2), (4,4), (5,3), … By rearranging matches, the connections could be added. Q: Is this Clos network “rearrangeably non-blocking”? 048866 – Packet Switch Architectures
With k = n a Clos network is rearrangeably non-blocking No two edges at a vertex may be colored the same. Each vertex corresponds to an n x k or k x n switch. Route matching is equivalent to edge-coloring in a bipartite multigraph. Colors correspond to middle-stage switches. (1,1), (2,4), (3,3), (4,2) Vizing ‘64: a D-degree bipartite graph can be colored in D colors. (remember: Birkhoff-von Neumann Decomposition Theorem) Therefore, if k = n, a Clos network is rearrangeably non-blocking (and can therefore perform any permutation). 048866 – Packet Switch Architectures
How complex is the rearrangement? • Method 1: Find a maximum size bipartite matching for each of D colors in turn, O(DN2.5). • Why does it work? • Method 2: Partition graph into Euler sets, O(N.logD) [Cole et al. ‘00] 048866 – Packet Switch Architectures
Euler partition of a graph • Euler partition of graph G: • Each odd degree vertex is at the end of one open path. • Each even degree vertex is at the end of no open path. 048866 – Packet Switch Architectures
Euler split of a graph G1 G2 G • Euler split of G into G1 and G2: • Scan each path in an Euler partition. • Place each alternate edge into G1 and G2 048866 – Packet Switch Architectures
Edge-Coloring using Euler sets • Assume for simplicity that • the graph is regular (all vertices have the same degree, D), and • D=2i • Perform i “Euler splits” and 1-color each resulting graph. This is log D operations, each of O(E). 048866 – Packet Switch Architectures
Implementation Scheduler Request graph Permutation Route connections Paths 048866 – Packet Switch Architectures
Implementation Pros • A rearrangeably non-blocking switch can perform any permutation • A cell switch is time-slotted, so all connections are rearranged every time slot anyway Cons • Rearrangement algorithms are complex (in addition to the scheduler) Can we eliminate the need to rearrange? 048866 – Packet Switch Architectures
Strictly non-blocking Clos Network Clos’ Theorem: If k >= 2n – 1, then a new connection can always be added without rearrangement. 048866 – Packet Switch Architectures
Clos Theorem m x m M1 n x k k x n 1 1 I1 M2 O1 n n I2 … O2 … … … Im … Om N N N = n x m k ≥ 2n-1 Mk 048866 – Packet Switch Architectures
Clos Theorem 1 1 n n k k 1 1 Ia Ob n – 1alreadyin use at inputand output. n-1 n-1 n? n? • Consider adding the n-th connection between1st stage Iaand 3rd stage Ob. • We need to ensure that there is always somecenter-stage M available. • If k > (n – 1) + (n – 1) , then there is always an Mavailable. i.e. we need k >= 2n – 1. 048866 – Packet Switch Architectures
Benes networksRecursive construction 048866 – Packet Switch Architectures
Benes networksRecursive construction 048866 – Packet Switch Architectures
Scaling Crossbars: Summary • Scaling the bit-rate through parallelism is easy. • Scaling the scheduler is hard. • Scaling the number of ports is harder. • Clos network: • Rearrangeably non-blocking with k = n, but routing is complicated, • Strictly non-blocking with k >= 2n – 1, so routing is simple. But requires more bisection bandwidth. • Benes network: scaling with small components 048866 – Packet Switch Architectures