230 likes | 441 Views
EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches. Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm. Outline.
E N D
EE384Y: Packet Switch Architectures Part II Scaling Crossbar Switches Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm
Outline Up until now, we have focused on high performance packet switches with: • A crossbar switching fabric, • Input queues (and possibly output queues as well), • Virtual output queues, and • Centralized arbitration/scheduling algorithm. Today we’ll talk about the implementation of the crossbar switch fabric itself. How are they built, how do they scale, and what limits their capacity?
Crossbar switchLimiting factors • N2crosspoints per chip, or NxN-to-1 multiplexors • It’s not obvious how to build a crossbar from multiple chips, • Capacity of “I/O”s per chip. • State of the art: About 300 pins each operating at 3.125Gb/s ~= 1Tb/s per chip. • About 1/3 to 1/2 of this capacity available in practice because of overhead and speedup. • Crossbar chips today are limited by “I/O” capacity.
16x16 crossbar switch: Scaling number of outputs: Trying to build a crossbar from multiple chips Building Block: 4 inputs 4 outputs Eight inputs and eight outputs required!
Scaling line-rate: Bit-sliced parallelism k • Cell is “striped” across multiple identical planes. • Crossbar switched “bus”. • Scheduler makes same decision for all slices. Linecard 8 7 6 5 4 Cell Cell Cell 3 2 1 Scheduler
Scaling line-rate: Time-sliced parallelism k • Cell carried by one plane; takes k cell times. • Scheduler is unchanged. • Scheduler makes decision for each slice in turn. Linecard Cell 8 7 6 5 4 Cell 3 Cell 2 Cell 1 Cell Cell Scheduler
Scaling a crossbar • Conclusion: scaling the capacity is relatively straightforward (although the chip count and power may become a problem). • What if we want to increase the number of ports? • Can we build a crossbar-equivalent from multiple stages of smaller crossbars? • If so, what properties should it have?
3-stage Clos Network mxm 1 nxk kxn 1 1 n 1 2 1 n 2 … 2 … … … N m … m N N = n x m k >= n k
With k = n, is a Clos network non-blocking like a crossbar? Consider the example: scheduler chooses to match (1,1), (2,4), (3,3), (4,2)
With k = n is a Clos network non-blocking like a crossbar? Consider the example: scheduler chooses to match (1,1), (2,2), (4,4), (5,3), … By rearranging matches, the connections could be added. Q: Is this Clos network “rearrangeably non-blocking”?
With k = n a Clos network is rearrangeably non-blocking Routing matches is equivalent to edge-coloring in a bipartite multigraph. Colors correspond to middle-stage switches. (1,1), (2,4), (3,3), (4,2) No two edges at a vertex may be colored the same. Each vertex corresponds to an n x k or k x n switch. Vizing ‘64: a D-degree bipartite graph can be colored in D colors. Therefore, if k = n, a 3-stage Clos network is rearrangeably non-blocking (and can therefore perform any permutation).
How complex is the rearrangement? • Method 1: Find a maximum size bipartite matching for each of D colors in turn, O(DN2.5). • Method 2: Partition graph into Euler sets, O(N.logD) [Cole et al. ‘00]
Edge-Coloring using Euler sets • Make the graph regular: Modify the graph so that every vertex has the same degree, D. [combine vertices and add edges; O(E)]. • For D=2i, perform i “Euler splits” and 1-color each resulting graph. This is logD operations, each of O(E).
Euler partition of a graph • Euler partiton of graph G: • Each odd degree vertex is at the end of one open path. • Each even degree vertex is at the end of no open path.
Euler split of a graph G G1 G2 • Euler split of G into G1 and G2: • Scan each path in an Euler partition. • Place each alternate edge into G1 and G2
Edge-Coloring using Euler sets • Make the graph regular: Modify the graph so that every vertex has the same degree, D. [combine vertices and add edges; O(E)]. • For D=2i, perform i “Euler splits” and 1-color each resulting graph. This is logD operations, each of O(E).
Implementation Scheduler Request graph Permutation Route connections Paths
Implementation Pros • A rearrangeably non-blocking switch can perform any permutation • A cell switch is time-slotted, so all connections are rearranged every time slot anyway Cons • Rearrangement algorithms are complex (in addition to the scheduler) Can we eliminate the need to rearrange?
Strictly non-blocking Clos Network Clos’ Theorem: If k >= 2n – 1, then a new connection can always be added without rearrangement.
m x m M1 n x k k x n 1 1 I1 M2 O1 n n I2 … O2 … … … Im … Om N N N = n x m k >= n Mk
1 1 n n k k Clos Theorem x Ia Ob n – 1alreadyin use at inputand output. x + n • Consider adding the n-th connection between1st stage Iaand 3rd stage Ob. • We need to ensure that there is always somecenter-stage M available. • If k > (n – 1) + (n – 1) , then there is always an Mavailable. i.e. we need k >= 2n – 1.
Scaling Crossbars: Summary • Scaling capacity through parallelism (bit-slicing and time-slicing) is straightforward. • Scaling number of ports is harder… • Clos network: • Rearrangeably non-blocking with k = n, but routing is complicated, • Strictly non-blocking with k >= 2n – 1, so routing is simple. But requires more bisection bandwidth.