Chapter 10: Scalable Interconnection Networks

Chapter 10:Scalable Interconnection Networks EECS 570: Fall 2003 -- rev1

Goals • Low latency • to neighbors • to average/furthest node • High bandwidth • per-node and aggregate • bisection bandwidth .. • Low cost • switch complexity, pin count • wiring cost (connectors!) • Scalability EECS 570: Fall 2003 -- rev1

Requirements from Above • Communication-to-computation ratio implies bandwidth needs • local or global? • regular or irregular (arbitrary)? • bursty or uniform? • broadcasts? multicasts? • Programming Model • protocol structure • transfer size • importance of latency vs. bandwidth EECS 570: Fall 2003 -- rev1

Basic Definitions switch is a device capable of transferring data from input ports to output ports in an arbitrary pattern network is a graph V={ switches/nodes} connected by communication channels (aka links) C  V x V • direct – node at each switch • indirect – node only on edge (like internet) • route: sequence of links message follows from node A to node B EECS 570: Fall 2003 -- rev1

What characterizes a network? • Topology • interconnection structure of the graph • Switching Strategy • circuit vs. packet switching • store-and-forward vs. cut-through • virtual cut-through vs. wormhole, etc. • Routing Algorithm • how is route determined • Control Strategy • centralized vs. distributed • Flow Control Mechanism • when does a packet (or a portion of it) move along its route • can't send two packets down same link at same time EECS 570: Fall 2003 -- rev1

Topologies Topology determines many critical parameters: • degree: number of input (output) ports per switch • distance: number of links in route from A to B • average distance: average over all (A,B) pairs • diameter: max distance between two nodes (using shortest paths) • bisection: min number of links to separate into two halves EECS 570: Fall 2003 -- rev1

Bus • Fully connected network topology • bus plus interface logic is a form of switch • Parameters: • diameter = average distance = 1 • degree = N • switch cost O(N) • wire cost constant • bisection bandwidth constant (or worse) • Broadcast is free EECS 570: Fall 2003 -- rev1

Crossbar • Another fully connected network topology • one big switch of degree N connects all nodes • Parameters: • diameter = average distance = 0(1) • degree=N • switch cost O(N2) • wire cost 2N • bisection bandwidth O(N) • Most switches in other topologies are crossbars inside EECS 570: Fall 2003 -- rev1

How to build a crossbar EECS 570: Fall 2003 -- rev1

Input Buffer Output Buffer Receiver Transmitter Cross-bar Input Ports Output Ports Control Routing, Scheduling Switches EECS 570: Fall 2003 -- rev1

Linear Arrays and Rings • Linear Array • Diameter? N -1 • Avg. Distance? N/3 • Bisection? 1 • Torus (ring): links may be unidirectional or bidirectional • Examples: FDDI, SCI, FiberChannel, KSR1 EECS 570: Fall 2003 -- rev1

Multidimensional Meshes and Tori • d-dimensional k-ary mesh N = kd k = d√N • nodes in each of d dimensions • torus has wraparound, array doesn't; “mesh” ambiguous • general d-dimensional mesh • n = kd-1 x ...x k0 nodes EECS 570: Fall 2003 -- rev1

Properties (N=kd,k=d√N) • Diameter? • Average Distance? • d x k/3 for mesh • Degree • Switch Cost? • Wire Cost? • Bisection? • kd+1 EECS 570: Fall 2003 -- rev1

Hypercubes EECS 570: Fall 2003 -- rev1

Trees Usually indirect, occasionally direct Diameter and avg. distance are logarithmic • k-ary tree, height d = logk N Fixed degree Route up to common ancestor and down • benefit for local traffic Bisection? EECS 570: Fall 2003 -- rev1

Fat-Trees Fatter links (really more of them) as you go up, so bisection BW scales with N EECS 570: Fall 2003 -- rev1

Butterflies • Tree with lots of roots ! • N log N (actually N/2 x logN) • Exactly one route from any source to any dest • R = A xor B, at level i use “straight” edge if ri=o, otherwise cross edge • Bisection N/2 vs N(d-1)/d EECS 570: Fall 2003 -- rev1

Benes Networks and Fat Trees • Back-to-back butterfly can route all permutations • off line • What if you just pick a random mid point? EECS 570: Fall 2003 -- rev1

Relationship of Butterflies to Hypercubes • Wiring is isomorphic • Except that Butterfly always lakes log n steps EECS 570: Fall 2003 -- rev1

Summary of Topologies Topology Degree Diameter Ave Dist Bisection Diam/Ave Dist N=1024 1DArray 2 N-1 N/3 1 1023/341 1D Ring 2 N/2 N/4 2 512/2 2DArray 4 2(N½ -1) ⅔N½ N½ 63/21 3D Array 6 3(N⅓ -1) N⅓ N⅔ -30/-10 2DTorus 4 N½½N 2N½ 32/16 k-ary n-cube 2n nk/2 nk/4 nk/4 15/7.5 @ n=3 Hypercube n=logN n n/2 N/2 10/5 2DTree 3 2log2N -2Iog2N 1 20/-20 4DTree 5 2log4N -2Iog4N-2/3 1 10/-9 2D fat tree 4 log2N -2Iog2N N 20/-20 2D butterfly 4 log2N log2N N/2 20/20 EECS 570: Fall 2003 -- rev1

Choosing a Topology Cost vs. performance For fixed cost which topology provides best performance? • best performance on what workload? • message size • traffic pattern • define cost • target machine size Simplify tradeoff to dimension vs. radix • restrict to k-ary d-cubes • what is best dimension? EECS 570: Fall 2003 -- rev1

How Many Dimensions in a Network? d=2 or d=3 • Short wires, easy to build • Many hops, low bisection bandwidth • Benefits from traffic locality d>=4 • Harder to build, more wires, longer average length • Higher switch degree • Fewer hops, better bisection bandwidth • Handles non-local traffic better Effect of # of hops on latency depends on switching strategy... EECS 570: Fall 2003 -- rev1

Store&Forward vs. Cut-Through Routing • messages typ. fragmented into packets & pipelined • cut-through pipelines on flits EECS 570: Fall 2003 -- rev1

Handling Output Contention What if output is blocked? • virtual cut-through • switch w/blocked output buffers entire packet • degenerates to: • requires lots of buffering in switches • wormhole • leave flits strung out over network (In buffers) • minimal switch buffering • one blocked packet can tie up lots of channels EECS 570: Fall 2003 -- rev1

Traditional Scaling: Latency(P) Assumes equal channel width • independent of node count or dimension • dominated by average distance EECS 570: Fall 2003 -- rev1

Average Distance but equal channel width is not equal cost! Higher dimension => more channels average distance = d(k-1)/2 EECS 570: Fall 2003 -- rev1

In the 3-D world For n nodes, bisection area is O(n2/3 ) For large n, bisection bandwidth is limited to O(n2/3 ) • Dally, IEEE TPDS, [Dal90a] • For fixed bisection bandwidth, low-dimensional k-ary n-cubes are better (otherwise higher is better) • i.e., a few short fat wires are better than many long thin wires • What about many long fat wires? EECS 570: Fall 2003 -- rev1

Equal cost in k-ary n-cubes Equal number of nodes? Equal number of pins/wires? Equal bisection bandwidth? Equal area? Equal wire length? What do we know? switch degree: d diameter = d(k-1) total links = Nd pins per node = 2wd bisection = kd-1 = N/k links in each direction 2Nw/k wires cross the middle EECS 570: Fall 2003 -- rev1

Latency(d) for P with Equal Width • total links(N)=Nd EECS 570: Fall 2003 -- rev1

Latency with Equal Pin Count Baseline d=2, has w = 32 (128 wires per node) fix 2dw pins => w(d) = 64/d distance down with increasing d, but channel time up EECS 570: Fall 2003 -- rev1

Latency with Equal Bisection Width • N-node hypercube has N bisection links • 2d torus has 2N½ • Fixed bisection => w(d) = N1/d/2= k/2 • 1 M nodes, d=2 has w=512! EECS 570: Fall 2003 -- rev1

Larger Routing Delay (w/equal pin) • ratio of routing to channel time is key EECS 570: Fall 2003 -- rev1

Topology Summary Rich set of topological alternatives with deep relationships Design point depends heavily on cost model • nodes, pins, area, … • Need for downward scalability lends to fix dimension • high-degree switches wasted in small configuration • grow machine by increasing nodes per dimension Need a consistent framework and analysis to separate opinion from design Optimal point changes with technology • store-and-forward vs. cut-through • non-pipelined vs. pipelined signaling EECS 570: Fall 2003 -- rev1

Real Machines • Wide links, smaller routing delay • Tremendous variation EECS 570: Fall 2003 -- rev1

What characterizes a network? Topology • interconnection structure of the graph Switching Strategy • circuit vs. packet switching • store-and-forward vs. cut-through • virtual cut-through vs. wormhole, etc, Routing Algorithm • how is route determined Control Strategy • centralized vs. distributed Flow Control Mechanism • when does a packet (or a portion of it) move along its route • can't send two packets down same link at same time EECS 570: Fall 2003 -- rev1

Typical Packet Format Two basic mechanisms for abstraction • encapsulation • fragmentation EECS 570: Fall 2003 -- rev1

Routing Recall: routing algorithm determines • which of the possible paths are used as routes • how the route is determined • R: N x N → C, which at each switch maps the destination node nd to the next channel on the route Issues: • Routing mechanism • arithmetic • source-based port select • table driven • general computation • Properties of the routes • Deadlock free EECS 570: Fall 2003 -- rev1

Routing Mechanism need to select output port for each input packet • in a few cycles • Simple arithmetic in regular topologies • ex: ∆x, ∆y routing in a grid • west(-x) ∆x<0 • east (+x) ∆x>0 • south(-y) ∆x=0, ∆y<0 • north(+y) ∆x=0, ∆y>0 • processor ∆x=0, ∆y=0 Reduce relative address of each dimension in order • Dimension-order routing in k-ary d-cubes • e-cube routing in n-cube EECS 570: Fall 2003 -- rev1

Routing Mechanism (cont) P3 P2 P1 P0 Source-based • message header carries series of port selects • used and stripped en route • CRC? header length … • CS-2, Myrinet, MIT Artic Table-driven • message header carried index for next port at next switch • o = R[i] • table also gives index for following hop • , I’ = R[i] • ATM, HPPI EECS 570: Fall 2003 -- rev1

Properties of Routing Algorithms Deterministic • route determined by (source, dest), not intermediate state (i.e. traffic) Adaptive • route influenced by traffic along the way Minimal • only selects shortest paths Deadlock free • no traffic pattern can lead to a situation where no packets move forward EECS 570: Fall 2003 -- rev1

Deadlock Freedom How can it arise? • necessary conditions: • shared resource • incrementally allocated • non-pre-emptible • think of a channel as a shared that is acquired incrementally • source buffer then dest. buffer • channels along a route How do you avoid it? • constrain how channel resources are allocated • ex: dimension order How do you prove that a routing algorithm is deadlock free? EECS 570: Fall 2003 -- rev1

Proof Technique Resources are logically associated with channels Messages introduce dependences between resources as they move forward Need to articulate possible dependences between channels Show that there are no cycles in Channel Dependence Graph • find a numbering of channel resources such that every legal route follows a monotonic sequence => no traffic pattern can lead to deadlock Network need not be acyclic, only channel dependence graph EECS 570: Fall 2003 -- rev1

Example: k-ary 2D array Theorem: dimension-order x,y routing is deadlock free • Numbering • +x channel (i,y) → (i+1,y) gets i • similarly for -x with 0 as most positive edge • +y channel (x,j) -> (x,j+ I) gets N+j • similarly for -y channels Any routing sequence' x direction, turn, y direction is increasing EECS 570: Fall 2003 -- rev1

Channel Dependence Graph EECS 570: Fall 2003 -- rev1

More examples Why is the obvious routing on X deadlock free? . • butterfly? • tree? • fat tree? Any assumptions about routing mechanism? amount of buffering? What about wormhole routing on a ring? EECS 570: Fall 2003 -- rev1

Deadlock free wormhole networks? Basic dimension-order routing doesn't work for k-ary d-cubes • only for k-ary d-arrays (bi-directional, no wrap-around) Idea: add channels! • provide multiple “virtual channels” to break dependence cycle • good for BW too! • Don't need to add links, or x bar, only buffer resources • This adds nodes to the CDG, remove edges? EECS 570: Fall 2003 -- rev1

Breaking deadlock with virtual channels EECS 570: Fall 2003 -- rev1

Turn Restrictions in X, Y • XY routing forbids 4 of 8 turns and leaves no room for adaptive routing • Can you allow more turns and still be deadlock free EECS 570: Fall 2003 -- rev1

Minimal turn restrictions in 2D +y +x -x -y EECS 570: Fall 2003 -- rev1 north-last negative first

Example legal west-first routes • Can route around failures or congestion • Can combine turn restrictions with virtual channels EECS 570: Fall 2003 -- rev1

Chapter 10: Scalable Interconnection Networks