220 likes | 344 Views
The NoX Router. Mitchell Hayenga Mikko Lipasti. Overview. New low-latency router technique Don’t arbitrate or speculate! Encode. XOR Property (A^B) ^ B = A Hides arbitration latency Eliminates dead cycles The NoX Router Single-cycle/wormhole/mesh implementation
E N D
The NoX Router Mitchell Hayenga MikkoLipasti
Overview • New low-latency router technique • Don’t arbitrate or speculate! Encode. • XOR Property (A^B) ^ B = A • Hides arbitration latency • Eliminates dead cycles • The NoX Router • Single-cycle/wormhole/mesh implementation • Frequency competitive with pure speculative • 2.7%-34.4% better ED2 on application traces • Up to 9.9% better throughput on synthetic traffic Control Input Channel Switch Fabric
Motivation • Modern On-Chip Networks • Bandwidth Plentiful, Latency Critical • Control • Complex, Speculative, Critical Path • Datapath • Fast, Simple, Wire-Dominated • NoX Tradeoff • Marginal increase in datapath complexity • Hide control latency Virtual Channel Router Pipeline Evolution BW RC VA SA ST LT BW NRC VA SA ST LT BW NRC VA SA ST LT VA NRC SA ST LT Intel Teraflops Router
Switch Arbitration Techniques • Non-Speculative • Arbitration occurs before switch traversal • Speculative Switch Traversal [Mullins ISCA 2004] • Assume contention doesn’t happen • Wasted cycle in the event of contention • Arbiter decides what gets sent on the next cycle Control B Wins A Wins ? A A A A A Switch Fabric B B B cycle 0 1 2 3 4 clk port 0 port 1 grant valid out data out A A A A B B p1 p0 p0 A B A ??? No Contention Contention
Switch Arbitration Techniques • Non-Speculative • Arbitration occurs before switch traversal • Speculative Switch Traversal [Mullins ISCA 2004] • Assume contention doesn’t happen • Wasted cycle in the event of contention • Arbiter decides what gets sent on the next cycle • Encoding • Blindly transmit, XOR within switch fabric • No contention - data sent unmodified • Contention - data sent XOR’d • Arbiter decides what was sent Control B Wins A A A A A A^B Switch Fabric B B cycle 0 1 2 3 4 clk port 0 port 1 grant valid out data out A A A B p1 p0 A A B^A No Contention Contention
Receive Logic • Works upon simple XOR property. • (A^B^C) ^ (B^C) = A • Simple Decode • Always able to decode by XORing two sequential values • Maintains previous router’s arbitration order/fairness 0 1 Coded B B^C A A^B^C B^C A B^C A^B^C C C Flit Buffer 0
Tradeoffs and Scaling • Arbitration • O(log n) delay for most arbiters • Decode logic • Constant with respect to # of ports • Switch Fabric • XOR delay scales slightly worse than a mux/tristate-based solution • Maybe not an issue (control latency) Control Input Channel Switch Fabric Switch Fabric
The NoX Router • Network of XORs • Implementation Details • 8x8 Mesh, 2mm long 64-bit links • Single Cycle (Router+Link) • Wormhole • Dimension ordered routing • Minimally buffered
Baseline Designs • Non-Speculative • Serial arbitration & switch logic • Long cycle time • Efficient link utilization • Speculative Techniques [Mullins ISCA 2004] • Hides arbitration latency • Potential for wasted link bandwidth • Spec-Fast & Spec-Accurate [Mullins ASP-DAC 2006]
Frequency Analysis • Overheads present in all designs • 248ps SRAM delay • 98ps link latency
Synthetic Traffic - Latency bandwidth (MB/s/node) bandwidth (MB/s/node)
Synthetic Traffic – ED2 bandwidth (MB/s/node) bandwidth (MB/s/node)
Power @ Fixed Bandwidth • Traffic Pattern • Uniform Random • 2GB/s/node injection rate • Spec-Fast saturated • Switch/Link glitching in speculative • Marginal additional decode power Decode negligible
Area Floorplanning Standard Router NoX Router ~17% More Area XOR Switch Decoding and Masking Crossbar 161.2 µm 161.2 µm 140 µm 140 µm Port 1 – 64x4 SRAM Port 2 – 64x4 SRAM Port 3 – 64x4 SRAM Port 4 – 64x4 SRAM Port 0 – 64x4 SRAM Port 4 – 64x4 SRAM Port 2 – 64x4 SRAM Port 3 – 64x4 SRAM Port 1 – 64x4 SRAM Port 0 – 64x4 SRAM 101.0 µm 102.2 µm 70 µm 70 µm 28 µm
Going Further • Input Speedup • What if we could drive two values from an input buffer in a single cycle • Final decode step has 2 values available • Last packet sees no additional delay from contention at the previous router • Multi-hop encoded forwarding • Don’t decode @ every hop, decode when packets diverge • Allow new collisions with the “head” flit • Requires additional sideband info Switch Fabric Flit Buffer B B A^B A
Conclusion • New encoding-based low-latency router technique • Hides arbitration latency • Comparable frequency to speculative switch traversal techniques • Eliminates wasted interconnect bandwidth • Promising application to multiple router architectures
Virtual Channels • Future Work • Physical Channels vs. Virtual Channels • VC Router Benefits • Dynamic bandwidth sharing (performance) • VC Router Negatives • Increased arbitration delay (performance) • Increased buffer energy (power) • Large unified crossbar (area, power) • Possible but tradeoffs need to be re-evaluated • Structuring of input buffers/decode logic • VC credit accounting
Multi-Flit Support • Current support is conservative • Performs similarly to speculative routers if multi-flit packets collide • Not all bad though • ~70% of packets are single-flit coherence packets • Only head-flit collisions matter • Requests all single-flit • Alternatives • Fragment multi-flit packets • Provide sufficient buffering space