370 likes | 621 Views
A Novel 3D Layer-Multiplexed On-Chip Network. Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego. Networks -on-Chip. Chip-multiprocessors ( CMPs ) increasingly popular 2D-mesh networks often used as on-chip fabric. 12.64mm. I/O Area.
E N D
A Novel 3D Layer-Multiplexed On-Chip Network Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California, San Diego
Networks-on-Chip • Chip-multiprocessors (CMPs) increasingly popular • 2D-mesh networks often used as on-chip fabric 12.64mm I/O Area single tile 1.5mm 2.0mm 21.72mm Tilera Tile64 Intel 80-core I/O Area
3D Integrated Circuits Through Silicon Via Device layer 2 ≥ 2 active device layers Short inter-layer distances Device layer 1 • Reduced chip footprint • Reduced wire delays • High inter-layer bandwidth • Heterogeneous system integration
Natural Progression: 3D Mesh for 3D CMPs 3D Mesh 2D Mesh What routing algorithms to use for 3D mesh networks?
Outline Oblivious routing on a 3D mesh Layer-multiplexed 3D architecture Evaluation
Oblivious Routing Objectives • Maximize throughput • Distribute traffic evenly on network links • Maximize worst-case throughput as traffic is application dependent • Minimize hop count • Minimize routing delay between source and destination • Reduce power
Routing Algorithms for 3D Mesh Networks • Valiant Routing • Optimal worst-case throughput • Poor latency 2 VAL • Dimension Ordered Routing • Minimal latency • Poor worst-case throughput • O1TURN Routing • Minimal latency • Poor worst-case throughput • Ideal routing algorithm • Minimal latency • Maximum worst-case throughput Average hop count (normalized to minimal) 1 IDEAL DOR O1TURN 0.5 0.25 Worst-case throughput (fraction of network capacity)
Randomized Partially-Minimal Routing (RPM) Z Y X Random intermediate layer Destination Source Phase-2Z Intermediate layer to the destination Phase-1Z Source to the intermediate layer XYorYX routing on the intermediate layer
Main Idea • Load-balance uniformly across the vertical layers • 2 phases of vertical routing • Min XY/YX used on each layer
Routing Algorithms for 3D Mesh Networks 2 VAL • Randomized Partially Minimal Routing • Near-optimal worst-case throughput • Low latency Average hop count (normalized to minimal) RPM 1.1 1 IDEAL DOR O1TURN 0.5 0.25 Worst-case throughput (fraction of network capacity)
RPM has Near-optimal Worst-case Throughput RPM is optimal for even radix, within 1/k2 of optimal for odd radix.
Outline Oblivious routing on a 3D mesh Layer-multiplexed (LM) 3D architecture Evaluation
Unique Features of 3D ICs 50μm TSV • Inter-layer distances are very small (~50 μm) • Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) • Vertical interconnects implemented using Through-Silicon-Vias (TSVs) have very low delay 1500μm
Unique Features of 3D ICs 4 μm • Inter-layer distances are very small (~50 μm) • Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) • Vertical wires using Through-Silicon-Vias (TSVs) have very low delay • Vertical bandwidth abundant as TSVs can be densely packed in 2D with small via pitch (~4 μm) 4 μm
Unique Features of 3D ICs • Inter-layer distances are very small (~50 μm) • Order of magnitude lower than distances between adjacent tiles on a 2D plane (~1500 μm) • Vertical wires using Through-Silicon-Vias (TSVs) have very low delay • Vertical wiring abundant as TSVs can be packed in 2D with small via pitch (~4 μm) • Number of device layers likely to remain small (4-5 layers) due to thermal and manufacturing issues
RPM on a 3D Mesh Z Y X Random intermediate layer Destination Source Phase-2Z Intermediate layer to the destination Phase-1Z Source to the intermediate layer * XYorYX routing on the intermediate layer
Proposed Layer-Multiplexed Architecture Y Phase-2Z Intermediate layer to the destination Phase-1Z Source to the intermediate layer Z X Random intermediate layer P1 P2 P1 P3 P2 P4 RPM routing adapted to the LM architecture : RPM-LM P3 Destination * P4 XYorYX routing on the intermediate layer Source
Power and Area Savings P1 P2 . . . P3 P1 P1 P2 P2 P4 Conventional 3D Mesh P3 P3 P4 P4 Layer-Multiplexed Architecture • Decouple vertical routing from horizontal routing • Restrict vertical routing to packet injection and packet ejection Packet injection demultiplexer Packet ejection multiplexer • 5x5 crossbar in LM vs. 7x7 crossbar in 3D mesh
Single Hop Vertical Communication • Single hop vertical routing more power efficient than one-layer-per-hop routing • Leverages short inter-layer distances in 3D ICs • Better utilizes available vertical bandwidth
Packet Injection Demultiplexer Route Selection/Load Balancing VC Allocation Credits in from the injection port of routers on layers 1-4 Flit Counters Switch Arbitration To the injection port of the Layer 1 router P1 . . . P2 P3 To the injection port of the Layer 4 router P4
Packet Ejection Multiplexer Credits out for L1-P1, L2-P1, L3-P1 and L4-P1 Arbiter VCID L1-P1 P1 L2-P1 Router on Layer 1 Packets from layer2 L3-P1 Packets from layer3 Packets from layer4 L4-P1 . . . P2 P3 Credits out for L1-P4, L2-P4, L3-P4 and L4-P4 Arbiter L1-P4 P4 Packets from layer2 L2-P4 Packets from layer3 L3-P4 Packets from layer4 L4-P4
Outline • Oblivious routing on a 3D mesh • Layer-multiplexed 3D architecture • Evaluation • Power and Area • Performance
Power and Area Evaluation • Used Orion 2.0 models for router power and area estimation. • 65nm process at 1V and 1GHz • Buffers • 4VCs/port, 5flits/VC for routers • 5 flits/port for packet injection demultiplexer • 5 flits/port for each packet ejection multiplexer
Power Comparison • 3D mesh • One 7-port router per tile • LM • One 5-port router per tile • One packet injection demultiplexer for every 4 tiles • One packet ejection multiplexer per tile
Power Evaluation 27% power reduction
Area Evaluation 26.5% power reduction
Outline • Oblivious routing on a 3D mesh • Layer-multiplexed 3D architecture • Evaluation • Power and Area • Performance
RPM on a 3D mesh vs. RPM-LM • Worst-case throughput • RPM-LM achieves same (near-optimal) worst-case throughput as RPM • Average-case throughput
Flit-Level Simulation • Ideal throughput evaluation assumes • Ideal single-cycle router • Infinite buffers • No contention in switches, no flow control • Flit-level simulation • PopNet network simulator • 5stage router pipeline • Credit-based flow control • 8 virtual channels, each 5 flits deep • Multi-flit packets injected into the network (5 flits/packet)
Flit-Level Simulation (cont’d) • Network configurations simulated • 4 x 4 x 4 mesh • 8 x 8 x 4 mesh • Four different traffic traces used • Uniform traffic • Transpose traffic: (x,y,z) → (y,z,x) • Complement traffic: (x,y,z) → (k-x-1, k-y-1, k-z-1) • Worst Case traffic pattern for DOR (DOR-WC): (x,y,z) → (k-z-1, k-y-1, k-x-1)
Summary of Contributions Proposed a 3D Layer-multiplexed architecture which is an optimization of a 3D mesh Exploits the optimality of RPM together with the high vertical bandwidth enabled in 3D technology LM architecture consumes 27% less power, occupies 26% less area than a 3D mesh RPM-LM has comparable (marginally better) performance to RPM on a 3D mesh