George Michelogiannakis, James Balfour, William J. Dally

Elastic-Buffer Flow-Control for On-Chip Networks George Michelogiannakis, James Balfour, William J. Dally Computer Systems Laboratory Stanford University

Introduction • Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs • Input buffers at routers are not needed • Can provide 12% more throughput per unit power • Equal zero-load latency • Reduces router cycle time by 18% • Compared to VC routers

Outline • Building elastic-buffered channels • By using what is already there • Router microarchitecture • Deadlock avoidance • Load-sensing for adaptive routing • Evaluation

The Idea • Use the network channels as distributed FIFOs • Use that storage instead of input buffers at routers • To remove input buffer area and power costs Pipelined channel Channel as FIFO

Building an Elastic Buffer • To build an EB in a pipelined channel with master-slave flip-flops (FFs): • Use latches for storage by driving their enables independently Elastic buffer Master-slave FF

How Elastic Buffer Channels Work • Ready/valid handshake between elastic buffers • Ready: At least one free storage slot • Valid: Non-empty (driving valid data) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Control Logic Area Overhead • Control logic is implemented as a four-state FSM with 10 gates and 2 FFs • Cost is amortized over channel width • Example: control logic increases area of a 64-bit channel by 5%

Outline • Building elastic-buffered channels • Router microarchitecture • Use EB flow-control through the router • Deadlock avoidance • Load-sensing for adaptive routing • Evaluation

Use EB Flow-Control Through the Router VC input-buffered router Three-slot output EB to cover for arbitration done one cycle in advance. VC & SW allocators removed. Per-output arbiters instead. Input buffer replaced by input EB LA routing also applicable to EB networks. EB router

Outline • Building elastic-buffered channels • Router microarchitecture • Deadlock avoidance • How to provide isolation without VCs • Load-sensing for adaptive routing • Evaluation

Deadlock Avoidance: Duplicate Channels • No input buffers no virtual channels • Three types of possible deadlocks: • Protocol deadlock • Cyclic flit dependency in network • Solution: Duplicate physical channels

Deadlock Avoidance: No Interleaving • Interleaving deadlock • New head flits require destination registers • Occupied destination registers depend on tail flits • Tail flits cannot bypass the new head flit • Solution: Disallow packet interleaving

Duplicating Channels Between Routers • Duplicate channels with neckdown • Small improvement (still one switch port), large cost • Duplicate channels with duplicate switch ports • Excessive cost (switch quadratic cost)

Dividing Into Sub-Networks More Efficient • Divide into sub-networks • Double bandwidth, double the cost • However, when narrowing datapath down to normalize for throughput or power more beneficial • Again, due to switch quadratic cost

Outline • Building elastic-buffered channels • Router microarchitecture • Deadlock avoidance • Load-sensing for adaptive routing • Propose a load metric for EB networks • Evaluation

Output Channel Occupancy Load Metric • Flit-buffered networks use credit count • EB networks measure output channel occupancy • At a certain segment of the output channel (shown in red) • Occupancy decremented when flits leave that segment • Incremented by a packet’s length when routing decision is made. Packets see other decisions in same cycle

Outline • Building elastic-buffered channels • Router microarchitecture • Deadlock avoidance • Load-sensing for adaptive routing • Evaluation • Compare throughput, power, area, latency, cycle time

Evaluation Methodology • Used a modified version of booksim • Area/power estimations from a 65nm library • Input buffers modeled as SRAM cells • Throughput/power optimal # of VCs and buffer depth • Two sub-networks: request and reply • Averaged over a set of 6 traffic patterns • Constant packet size (512 bits) • Swept channel width from 28 to 192 bits • Low-swing channels: 0.3 of the full-swing repeated wire traversal power

Throughput-Power Gains in 2D Mesh Throughput gain EB network improvement: Same power: 10% increased throughput Same throughput: 12% reduced power

Throughput-Area Gains in 2D Mesh 2% improvement for EB networks

Latency-Throughput in 2D Mesh Zero-load latency equal

Power Breakdown: No Input Buffer Power

Area Breakdown: No Input Buffer Area

Router RTL Implementation • No buffers, VCs, allocators, credits • VC router had look-ahead routing • Buffers: FF arrays. 2 VCs, 8 slots each 45nm, LP-CMOS, worst-case Mesh 5x5 routers. DOR. 64-bit datapath

Conclusions • EB flow-control uses channels as distributed FIFOs • Removes input buffers from routers • Uses duplicate physical channels instead of VCs • Increases throughput per unit power up to 12% for low-swing • Depends on what fraction of the overall cost input buffers constitute • Reduces router cycle time by 18% • Flow-control choice depends on design parameters and priorities

Thanks for your attention Questions?

George Michelogiannakis, James Balfour, William J. Dally