260 likes | 353 Views
Elastic-Buffer Flow-Control for On-Chip Networks. George Michelogiannakis, James Balfour, William J. Dally. Computer Systems Laboratory Stanford University. Introduction. Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs Input buffers at routers are not needed
E N D
Elastic-Buffer Flow-Control for On-Chip Networks George Michelogiannakis, James Balfour, William J. Dally Computer Systems Laboratory Stanford University
Introduction • Elastic-buffer (EB) flow-control uses the channels as distributed FIFOs • Input buffers at routers are not needed • Can provide 12% more throughput per unit power • Equal zero-load latency • Reduces router cycle time by 18% • Compared to VC routers
Outline • Building elastic-buffered channels • By using what is already there • Router microarchitecture • Deadlock avoidance • Load-sensing for adaptive routing • Evaluation
The Idea • Use the network channels as distributed FIFOs • Use that storage instead of input buffers at routers • To remove input buffer area and power costs Pipelined channel Channel as FIFO
Building an Elastic Buffer • To build an EB in a pipelined channel with master-slave flip-flops (FFs): • Use latches for storage by driving their enables independently Elastic buffer Master-slave FF
How Elastic Buffer Channels Work • Ready/valid handshake between elastic buffers • Ready: At least one free storage slot • Valid: Non-empty (driving valid data) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
Control Logic Area Overhead • Control logic is implemented as a four-state FSM with 10 gates and 2 FFs • Cost is amortized over channel width • Example: control logic increases area of a 64-bit channel by 5%
Outline • Building elastic-buffered channels • Router microarchitecture • Use EB flow-control through the router • Deadlock avoidance • Load-sensing for adaptive routing • Evaluation
Use EB Flow-Control Through the Router VC input-buffered router Three-slot output EB to cover for arbitration done one cycle in advance. VC & SW allocators removed. Per-output arbiters instead. Input buffer replaced by input EB LA routing also applicable to EB networks. EB router
Outline • Building elastic-buffered channels • Router microarchitecture • Deadlock avoidance • How to provide isolation without VCs • Load-sensing for adaptive routing • Evaluation
Deadlock Avoidance: Duplicate Channels • No input buffers no virtual channels • Three types of possible deadlocks: • Protocol deadlock • Cyclic flit dependency in network • Solution: Duplicate physical channels
Deadlock Avoidance: No Interleaving • Interleaving deadlock • New head flits require destination registers • Occupied destination registers depend on tail flits • Tail flits cannot bypass the new head flit • Solution: Disallow packet interleaving
Duplicating Channels Between Routers • Duplicate channels with neckdown • Small improvement (still one switch port), large cost • Duplicate channels with duplicate switch ports • Excessive cost (switch quadratic cost)
Dividing Into Sub-Networks More Efficient • Divide into sub-networks • Double bandwidth, double the cost • However, when narrowing datapath down to normalize for throughput or power more beneficial • Again, due to switch quadratic cost
Outline • Building elastic-buffered channels • Router microarchitecture • Deadlock avoidance • Load-sensing for adaptive routing • Propose a load metric for EB networks • Evaluation
Output Channel Occupancy Load Metric • Flit-buffered networks use credit count • EB networks measure output channel occupancy • At a certain segment of the output channel (shown in red) • Occupancy decremented when flits leave that segment • Incremented by a packet’s length when routing decision is made. Packets see other decisions in same cycle
Outline • Building elastic-buffered channels • Router microarchitecture • Deadlock avoidance • Load-sensing for adaptive routing • Evaluation • Compare throughput, power, area, latency, cycle time
Evaluation Methodology • Used a modified version of booksim • Area/power estimations from a 65nm library • Input buffers modeled as SRAM cells • Throughput/power optimal # of VCs and buffer depth • Two sub-networks: request and reply • Averaged over a set of 6 traffic patterns • Constant packet size (512 bits) • Swept channel width from 28 to 192 bits • Low-swing channels: 0.3 of the full-swing repeated wire traversal power
Throughput-Power Gains in 2D Mesh Throughput gain EB network improvement: Same power: 10% increased throughput Same throughput: 12% reduced power
Throughput-Area Gains in 2D Mesh 2% improvement for EB networks
Latency-Throughput in 2D Mesh Zero-load latency equal
Router RTL Implementation • No buffers, VCs, allocators, credits • VC router had look-ahead routing • Buffers: FF arrays. 2 VCs, 8 slots each 45nm, LP-CMOS, worst-case Mesh 5x5 routers. DOR. 64-bit datapath
Conclusions • EB flow-control uses channels as distributed FIFOs • Removes input buffers from routers • Uses duplicate physical channels instead of VCs • Increases throughput per unit power up to 12% for low-swing • Depends on what fraction of the overall cost input buffers constitute • Reduces router cycle time by 18% • Flow-control choice depends on design parameters and priorities
Thanks for your attention Questions?