270 likes | 387 Views
Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008. Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University of California, San Diego. Motivation: Networks-on-Chip. Chip-multiprocessors (CMPs) increasingly popular
E N D
Near-Optimal Oblivious Routingfor 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University of California, San Diego
Motivation: Networks-on-Chip • Chip-multiprocessors (CMPs) increasingly popular • 2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 2.0mm 21.72mm Tilera Tile64 Intel 80-core I/O Area
2D to 3D Motivation: 3D Integrated Circuits • 3D Benefits • Reduced wire delays • Enormous bandwidth • Heterogeneous system integration • Natural progression • 3D-mesh for 3D CMPs
Routing Algorithm Objectives • Maximize throughput • How much load the network can handle • Minimize hop count • Minimize routing delay between source and destination
Challenges • For 2D-case, a near-optimal throughput routing algorithm with minimal hop count called O1TURN is known [Seo’05]. • Surprisingly, optimality of O1TURN does not extend to 3D case, actual throughput performance degrades severely. • Only known optimal throughput routing algorithm is Valiant (VAL) load-balancing, but VAL performs poorly on hop count (latency), twice that of minimal routing.
Main Contribution • Developed a new oblivious routing algorithm called “Randomized Partially Minimal” (RPM) routing. • RPM provably guarantees near-optimal worst-case throughput in 3D case. • Optimal for even radix k (e.g. 8 x 8 x 8 mesh). • Within factor of 1/k2 for odd radix (e.g. 7 x 7 x 7 mesh). • Good latency performance. • Only factor of 1.33 of minimal routing (much better than 2x cost of VAL, only known routing algorithm with optimal throughput) • In practice, 3D-meshes are asymmetric because number of device layers less than number of tiles per edge. • e.g., for 16 x 16 x 4 mesh (4 layers), RPM’s hop count just factor of 1.1 of minimal routing.
Outline • Motivation for our work • Existing 2D routing algorithms don’t extend well into 3D • RPM routing algorithm • Simulation results • Extensions and future work
The 2D case Dimension-Ordered Routing (DOR) Route minimal XY Valiant load-balancing (VAL) Route source → randomly chosen intermediate node → destination Route minimal XY in both phases ROMM Same as VAL, but intermediate node restricted to minimal direction Orthogonal 1-TURN (O1TURN) Route minimal XY and YX with equal probability Extending to the 3D case … Dimension-Ordered Routing (DOR) Route minimal XYZ Valiant load-balancing (VAL) Route source → randomly chosen intermediate node → destination Route minimal XYZ in both phases ROMM Same as VAL, but intermediate node restricted to minimal direction Orthogonal 1-TURN (O1TURN) Route along one of 6 minimal orthogonal paths (XYZ, XZY, YXZ, YZX, ZXY, ZYX) with equal probability Existing Routing Algorithms
Worst-Case Throughput • Best theoretical normalized worst-case throughput known to be 50% (well-known result). • Worst-case throughput analysis can be reduced to a maximal weighted matching problem [Towles’02]. • VAL achieves this optimal throughput, but has poor latency. • As shown next, DOR, ROMM, and O1TURN are all far from optimal in 3D.
Poor Worst-Case Throughput VAL/Optimal Only6-15%
How do 2D mesh algorithms fare in 3D? 8 x 8 x 8 Network • Worst case throughput of DOR, ROMM, O1TURN far from optimal • Average hop count of VAL far from minimal • Need a routing algorithm that can trade latency for worst-case throughput VAL DOR ROMM O1TURN Normalized Worst-Case Throughput 0.5 0.063 0.132 0.15 Normalized Average-Case Throughput 0.5 0.316 0.454 0.513 VAL DOR ROMM O1TURN Hop Count (normalized tominimal) 2 1 1 1
Why O1TURN performs poorly in 3D? • O1TURN – Worst-Case throughput optimal for 2D but more than 3 times worse than optimal for 3D • The difference • 2D traffic matrix is “admissible” for 2D mesh • In 3D, projected traffic on each 2D plane is no longer admissible !! • Can we transform the 3D routing problem to routing admissible traffic on each 2D plane ?
Outline • Motivation for our work • Existing 2D algorithms don’t extend well into 3D • RPM routing algorithm • Simulation results • Extensions and future work
Randomized Partially-Minimal Routing (RPM) Phase-2 Z Intermediate layer to destination Random intermediate layer Destination Phase-1 Z Source to intermediate layer Z Y Source X XYorYX routing on the intermediate layer
Main Idea • Load-balance uniformly across the vertical layers • Min XY/YX used on each layer • Main Result: RPM has near-optimal worst-case throughput • Achieves optimal worst-case throughput when network radix kis even • Within a factor of 1/k2 optimal when k is odd.
RPM achieves Near-Optimal Worst Case Throughput (optimal for even radix) VAL/Optimal RPM
Average-Case Throughput • RPM outperforms VAL, DOR, ROMM and O1TURN in average-throughput on randomly generated traffic.
Average Hop Count • Normalized hop count of RPM • Symmetric Meshes - 1.33 times minimal compared to 2x for VAL • Asymmetric 16x16x4 Mesh – 1.1 times minimal
Outline • Motivation for our work • Existing 2D routing algorithms don’t extend well into 3D • RPM routing algorithm • Simulation results • Extensions and future work
Flit-Level Simulation • Ideal throughput evaluation assumes • Ideal single-cycle router • Infinite buffers • No contention in switches, no flow control • Flit-level simulation • PopNet network simulator • 4 stage router pipeline – Route computation, VC allocation, Switch arbitration, Link traversal • Credit-based flow control • 8 virtual channels, each 5 flits deep • Multi-flit packets injected into the network (5 flits/packet)
Flit-Level Simulation (cont’d) • Network configurations simulated • 4 x 4 x 4 Mesh • 8 x 8 x 8 Mesh • 16 x 16 x 4 Mesh • Routing algorithms compared: DOR, VAL, ROMM, O1TURN, DUATO, RPM • DUATO is a minimal adaptive routing algorithm implemented for comparison • Four different traffic traces used • Transpose traffic – (x,y,z) → (y,z,x) • Complement traffic – (x,y,z) → (k-x-1, k-y-1, k-z-1) • Uniform traffic • Worst Case traffic pattern for DOR (DOR-WC) – (x,y,z) → (k-z-1, k-y-1, k-x-1)
Uniform Traffic 8x8x8 Mesh 16x16x4 Mesh
Transpose Traffic 8x8x8 Mesh 16x16x4 Mesh
Complement Traffic 8x8x8 Mesh 16x16x4 Mesh
DOR-WC Traffic 8x8x8 Mesh 16x16x4 Mesh
To sum it up … • 3D IC technology is emerging. • Stacking cores in 3 dimensions offers several advantages over 2D placement of cores. • 2D minimal Mesh routing algorithms have poor worst-case throughput in 3D, VAL has high latency penalty. • RPM trades off latency (partially-minimal) for better worst case performance (near-optimal).
Thank You Questions?