290 likes | 413 Views
Design of a High-Throughput Distributed Shared-Buffer NoC Router . Rohit Sunkam Ramanujam*, Vassos Soteriou † , Bill Lin*, Li-Shiuan Peh ‡ *Dept. of Electrical Engineering, UCSD, USA † Dept. of Electrical Engineering, CUT, Cyprus ‡ Dept. of Electrical Eng. and Computer Science, MIT, USA.
E N D
Design of a High-Throughput Distributed Shared-Buffer NoC Router Rohit Sunkam Ramanujam*, Vassos Soteriou†, Bill Lin*, Li-Shiuan Peh‡ *Dept. of Electrical Engineering, UCSD, USA †Dept. of Electrical Engineering, CUT, Cyprus ‡Dept. of Electrical Eng. and Computer Science, MIT, USA
Chip Multiprocessors are a reality … • Power wall • Frequency wall • ILP wall • Non-Recurring Engineering costs • Time to market Chip Multiprocessor Sources: Intel Inc. and Tilera Inc. Uniprocessor
The need for a Network on Chip (NoC) Compute Unit Router • Scalable communication • Modular design • Efficient use of wires • A new way to organize and build VLSI systems
The Problem – Delivering high throughput in NoCs • Why Care? • NoCs in CMPs connect general-purpose processors. • Future applications unknown → traffic unknown. • Exploiting parallelism needs fine-grained interaction between cores. • Can expect high traffic volume for current and future applications running on many-core processors. • E.g. Cache coherence between large number of distributed shared L2 caches.
An important design choice that affects throughput • Router microarchitecture • How well does a router multiplex packets onto its output links?
NoC routers – Current design Input Buffered Routers (IBRs) – Flits buffered at the input ports cycle = 1 cycle = 2 cycle = 3 Output 1 Input 1 Output 2 Input 2 Crossbar Maximal Matching: Input 2 → Output 1 Maximal Matching: Input 1 → Output 1 Output 2 is unutilized in cycle 3 although there is a flit destined for output 2. Bottleneck: Maximal matching used for arbitration is not good enough. (70-80% efficiency)
Output queueing to the rescue … Output buffered router (OBR) – Flits buffered at the output ports cycle = 3 cycle = 1 cycle = 2 Output 1 Input 1 Output 2 Input 2 Crossbar Output links are always utilized when there are flits available. Better multiplexing of flits onto output links ⇒ higher throughput.
How much difference does it make? Uniform Traffic A throughput gap of 18%!
How much difference does it make? Complement Traffic A throughput gap of 12%!
How much difference does it make? Tornado Traffic A throughput gap of 22%!
Output Buffering is great … • OBRs offer much higher throughput than IBRs. • OBRs have predictable delay. • Queuing delay modeled using M/D/1 queues. • Packet delays not predictable for IBRs.
So why aren’t OBRs used in NoCs ? Input 1 Output 1 Input 2 . . . . . . Input P-1 Output P-1 Crossbar • Implementing Output Buffering requires either: • Crossbar speedup of P, where P is the number of ports. Not practical for aggressively clocked designs. • Output buffers with P write ports and a PxP2 crossbar. Has huge area and power penalties.
Our approach: Emulate Output Queueing without any speedup Current time = 2 Current time = 1 Current time = 4 Current time = 3 Current time = 6 Current time = 5 Step2: Find a conflict-free middle memory. Step1: Timestamp the flits Assign a future time at which a flit would depart the router assuming output buffering. Step4: When current time == timestamp, Read flit from middle memory to output port. Step3: Move flits from input buffers to middle memories. 4 Output 1 Input 1 5 Input 2 Output 2 6 Input 3 Output 3 Crossbar 1 Middle Memories Crossbar 2
Arrival and Departure Conflicts Arrival Conflicts – With P input ports, a flit can have an arrival conflict with P-1 other flits. Departure Conflicts – With P output ports, a flit can have a departure conflict with P-1 other flits. By Pigeon hole principle, 2P-1 middle memories needed to avoid all arrival and departure conflicts.
The Distributed Shared-Buffer Router (DSB) • Aims at emulating the packet servicing scheme of an OBR with limited buffers and no speedup. • First-Come-First-Served servicing of flits. Objectives: • Close the performance gap between OBRs with infinite buffers and IBRs (high throughput). • Make a feasible design →low power and area overhead. • Make packet delays more predictable for delay sensitive NoC applications.
DSB Router Innovations • Router pipeline with new stages for: • Timestamping flits • Finding a conflict free middle memory • Complexity and delay-balanced pipeline stages for a high-clocked, high-performance implementation. • New flow control to prevent packet dropping when resources are unavailable. • Evaluate power-performance tradeoff of DSB architectures with fewer than 2P-1 middle memories.
Evaluation Cycle accurate flit level simulator. Mesh topology – Each router has 5 ports, NSEW + Injection/Ejection. Dimension Ordered Routing (DOR) – decouple effects of routing algorithm on network performance.
Evaluation – Traffic traces • 3 Synthetic traffic traces: • Uniform • Bit Complement (Complement) • Tornado • Real traffic/memory traces from running multiple threads (49 threads ⇒ 7x7 Mesh) of eight SPLASH-2 benchmarks: • Complex 1D FFT, LU decomposition, Water-nsquared, Water-spatial, Ray tracer, Barnes-Hut, Integer Radix sort, Ocean simulation.
Performance on Uniform traffic A throughput gap of just 9%
Performance on Complement traffic A throughput gap of just 4%
Performance on Tornado traffic A throughput gap of just 8%
Performance of DSB on SPLASH-2 benchmarks Small difference in packet latency between OBR and DSB routers is mainly due to the limited buffering in the DSB router. Raytrace, Barnes and Ocean traces have very little contention. For these traces, IBR has lower latency because of a shorter pipeline. Performance of DSB is very close to an OBR with same number of pipeline stages. Huge performance improvements over IBR in traces exhibiting high contention and demanding high bandwidth. 64% 72% 97%
Input Buffered Router (IBR) pipeline RC ST LT VA SA Input 1 utput 1 Output 2 Input 2 Crossbar Switch Arbitration Acquire access to the output port through the crossbar. Switch Traversal Traverse the crossbar to reach the output link. Link Traversal Traverse the link to reach the input buffer of the next hop router. Virtual Channel Allocation Reserve an output Virtual Channel (buffering) at the next hop router. Route Computation Determine the output port of the flit based on the destination coordinates.
Distributed Shared-Buffer Router pipeline If CR or VA fails CR RC XB1 + MM_WR MM_RD + XB2 LT VA TS Input 1 Output 1 Input 2 Output 2 Crossbar 1 Middle Memory Crossbar 2 Timestamp Allocation Assign a timestamp to a flit for the output port requested. Timestamp is the future time (cycle) at which the flit can depart the middle memory buffer. Conflict Resolution + Virtual Channel Allocation Conflict Resolution: Find a conflict free middle memory. Virtual Channel Allocation: Reserve a virtual channel at the input of the next hop router. Middle Memory Read + Crossbar 2 When the current time equals the timestamp, the flit is read from the middle memory and traverses the second crossbar. Crossbar 1 + Middle Memory Write Flit traverses the first crossbar and gets written into the assigned middle memory. Route Computation Determine the output port of the flit based on the destination coordinates. Link Traversal Flit traverses the output link to reach the input buffer of the next-hop router.
Higher throughput – At what cost? CR RC XB1 + MM_WR MM_RD + XB2 LT VA TS Input 1 Output 1 Input 2 Output 2 Crossbar 1 Middle Memory Crossbar 2 Two crossbars instead of one: With N middle memories, need one PxN and one PxN crossbar. Middle memory buffers – Can have fewer input buffers to compensate for extra middle memory buffers. TS stage instead of Switch Arbitration in IBRs Extra stage for Conflict Resolution Extra power !!
Power-Performance tradeoff Theoretically, 2P-1 middle memories needed to resolve all conflicts. For a 5-port mesh router, need > 9 middle memories, a 5x9 and a 9x5 crossbar – large power overhead. What is the impact of using fewer than 2P-1 middle memories?
Power and Area Comparison Router power overhead of 50% for DSB-5 router If NoC consumes 10% of tile power, tile power overhead of only 3.5% for DSB-5 router If NoC consumes 20% of tile power, tile power overhead of only 7% for DSB-5 router
Thank you Questions?