Heterogeneous NoC Router

Heterogeneous NoC Router Moti Mor Tomer Gal Instructor: Yaniv Ben Itzhak Final Presentation 25.03.2014

Project Goals • Research about different Heterogeneous NoC Architectures • Design an architecture of a heterogeneous router • Architecture Implementation • Basic Measurements of speed and performance: • Latency • Throughput • Power • Area • Maximum Frequency M

Introduction • Network-on-Chip (NoC) is a new approach to design the communication subsystem of SoC and Chip-Multi-Processors (CMP). • Clients communicates through a network of routers • Overcoming BUS bottlenecks, Performance improvement. R R R C C C R R R C C C R R R C C C T

Background • The SoC units communicate through a network of routers • Each router is assigned for a single unit • Supports many simultaneously connections • Credit-based flit-level flow control T

Background – XY mesh NoC C = Client R R R Less Bottlenecks R C C C = Router R R R C C C R R R C C C T

Architectures Review • Considered intermediate buffer architectures: • Input Buffer • Shared Memory • Shared Buffer M

Architectures Review-Cont. • Shared Buffer – Chosen Architecture • In-ports store incoming flits to the shared buffer, out-ports read flits from the shared buffer • Each incoming flit is assigned with a Time Stamp (TS) and Shared Buffer Allocation T

Architectures Review-Cont. • Shared Buffer – • Eliminates the need of linked lists management • Decoupling in-ports and out-ports (A flit can acquire any shared buffer, and each shared buffer can be connected to any out-port) • Buffers are shared among all the ports, thus, a better buffer utilization is achieved T

Top Architecture Top Architecture In-ports RC VCA TS SBA XB1 XB2 Link Trav. Input Buffers Control SIgnals Crossbar 1 Crossbar 2 Out- ports T

Pipeline Stages • Stage 1 - Buffer Write: • Incoming flits are written into the input-buffers. • The input buffers are segmented according to the number of VCs of each input port T

Pipeline Stages • Stage 2 – Routing Calculations: • This stage is relevant only for the head flit • Output port is being determined according to the flit’s coordinates T

Pipeline Stages • Stage 3.1 – VC Allocation: • This stage is relevant only for the head flit • Arbitration for free virtual channels at the input of the next-hop router • Managing a free VCs list for each output port T

Pipeline Stages • Stage 3.2 – Time Stamping (TS): • Assigning ingress flits into the shared buffer by resolving the departure conflict • Assigns time slots in a cyclical fashion • Assigns the earliest departure time for as many flits as possible T

Pipeline Stages • Stage 4 – Shared Buffer Allocation: • Flits that were assigned to time slots in the TS stage are assigned to a specific shared buffer • Responsible to maintain the order of flits from the same packet • Should consider the write constraints of the shared buffers (Can cause Arrival Conflict) • If not succeeded Re-enters the TS stage T

Pipeline Stages - Conflicts • Departure Conflict - • Occurs for out-port O when more than ELWO flits are assigned with the same time stamp. • Arrival Conflict – • Occurs when trying to write more flits than allowed to a certain shared buffer M

Pipeline Stages • Stage 5 – Crossbar 1 (XB1) & SB Write: • Flits are traversed trough the first XB and written in the Shared Buffers. M

Pipeline Stages • Stage 6 – SB Read & Crossbar 2 (XB2) : • Flits stored in time-slot 0 are read from the shared buffer and traversed trough the second XB. • Each time-slot i advance to time-slot i-1. M

Pipeline Stages • Stage 7 – Link Traversal: • The flits are transmitted to the downstream router M

NoCHeterogeneity • Modular Parameters: • Number of Virtual Channels per port • In-port & Out-port width • Number of FIFOs in shared buffers • Shared buffer length/size • Speed-up T

Router Blocks With Simulation and Synthesis Results

Testbench Environment Overview • An individual test-bench was created for each unit/block of the router. • We ran different tests on each block, testing the functionality of the block, while testing the heterogeneous parameters. • All functional simulations were done in Model Sim 10.3. T

Synthesis Environment Overview • Vivado 2013.4 for Virtex-7 VC709 Evaluation Platform board • the largest IO pin count (in Xilinx’s newest boards) • Wrapper entity for each unit • Fewer ports (only one of each kind) • Deals with large ports • Keep attribute to avoid “optimizing out” similar paths M

Synthesis Environment Overview Wrapper unit 6 Type 1 in-ports Type 1 in-port 3 Type 2 in-ports Type 2 in-port M

Synthesis Environment Overview Wrapper unit 7 out-ports withthe same kind M

Input Buffer • Each in-port consists of one input buffer unit, which instantiates pointer-based fifos • Parameters assigned to each unit: • Number of VCs • Speedup • Bandwidth • Fifo depth • These parameters (Implemented as generics) will result in the number of Fifos using the VHDL "If Generate" syntax. • The buffers operate according to the Virtual Channel flow control convention. T

Input Buffer - Operation • The input buffer interfaces with the following router blocks: • Outports Block (In the previous router) • VCA Block • TS Block • SBA Block • XB1 Block • When flits are ready for a certain stage, the input buffer sends a request to the desired block. T

Input BufferFifos • The flits are stored in a two-dimensional fifo array • Dimension 1: driven from the number of VCs • Dimension 2: driven from the speedup and the bandwidth • The flits will stay in the fifos throughout the pipeline stages until departed in the XB1 stage to the shared buffer. • Each cell in the fifo type is “Flit Extended Type”: T

Input BufferFifo Pointers • The fifos are designed in a cyclic fashion - saving power by using address/pointer switching instead of moving the entire chunk of data. • Each stage of the pipeline has a pointer to the next flit that need to pass the desired stage • The following pointers are used: • write_addr -Write address • read_addr -Read address • rc_addr -RC (Routing calculation) • vca_addr -VCA (Virtual Channel Allocation) address • ts_addr - TS (Time Stamping) • sba_addr - SBA (Shared Buffer Allocation) address • In case of a failure in one of the stages (VCA, TS, SBA), the fifo receives a request from the input buffer to reverse a certain pointer. T

Input Buffer - Simulation • Wave overview of fifo status during simulation: T

Input Buffer - Synthesis • Small Configuration: • num of VCs in port : 1 • depth : 2 • Port's BW : 1 • Speed-Up : 1 • num of fifos in VC : 1 T

Input Buffer - Synthesis • Large Configuration: • num of VCs in port : 2 • depth : 2 • Port's BW : 2 • Speed-Up : 2 • num of fifos in VC : 2 T

Input Buffer - Synthesis • Utilization summary for both configurations : • The longest paths time were: • 18.14[ns] in the small configuration • 18.74[ns] in the large configuration T

Routing Calculation Unit • Responsible to direct incoming flits to the router’s output ports in order to allow the flits to get to their destination. • Whenever a head flit enters an input buffer, it is sent to the RC unit. T

Routing Calculation Unit • Algorithm in pseudo code: • if (dest.x < local.x) then output port = 4; • else if (dest.x > local.x) then output port = 2; • else if (dest.y < local.y ) then output port = 1; • else if (dest.y > local.y ) then output port = 3; • Else output port = 0; • Example of head flit traveling from (2,1) to (0,2): T

Routing Calculation Unit - Simulation • Simultaneous requests from all 5 IBs. The requests are:(4,3) , (3,3) , (3,1) , (2,1) , (4,4) from IB 0,1,2,3,4 respectively. T

Routing Calculation Unit - Synthesis • RTL schematics of the RC unit: T

Routing Calculation Unit - Synthesis • Utilization summary of the RC unit : • The longest path time was 1.34[ns] T

Virtual Channel Allocation Unit • Responsible for allocating virtual channels in the destination router’s input buffer • Deals only with head flits • Receives an array of up to 5 requests (one per input buffer) which contains: • The output port number the requesting flits wish to traverse to (RC result) • Look for a free VC in that output port: • Succeed  Returns that VC’s index • Fails Returns an invalid indication back to the requesting input buffer • Implements the round-robin fairness algorithm T

Virtual Channel Allocation Unit • Example (Before and after allocation): • Router A with 4 input buffers that have one VC each. • In each IB there is a head flit (marked by A,B,C,D) that wish to traverse to router B. • The relevant IB in router B has 3 free VCs. T

Virtual Channel Allocation Unit- Simulation • Simulation Example: • Allocation of one VC • Input buffer 4 requests VC allocation from Output port 2 which has 3 free VCs • The VCA result is stored in the vca_calc_arr signal array • After the allocation: out-port 2 has only 2 free VCs T

Virtual Channel Allocation Unit- Synthesis • Small configuration: • Num of VCs per port : 1,1,1,1,1 • VCs’ depth (num of buffers in each VC): 2,2,2,2,2 • RTL schematics: T

Virtual Channel Allocation Unit- Synthesis • Medium configuration: • Num of VCs per port : 1,1,2,2,2 • VCs’ depth (num of buffers in each VC) : 2,2,2,4,4 • RTL schematics: T

Virtual Channel Allocation Unit- Synthesis • Large configuration: • Num of VCs per port : 2,2,2,2,2 • VCs’ depth (num of buffers in each VC): 4,4,4,4,4 • RTL schematics: T

Virtual Channel Allocation Unit- Synthesis • Utilization summary of the VCA unit : • The longest paths time were: • 11.93 [ns] in the small configuration • 26.5 [ns] in the medium configuration • 32.29 [ns] in the large configuration T

Time Stamp Unit TS unit • Allocates columns of the Shared-Buffer 9 8 7 0 6 5 4 3 2 1 0 In-port 2 1 1 1 0 0 0 Out-port (BW=3) M

Time Stamp Unit • Resolves the Departure Conflict 9 8 7 0 6 5 4 3 2 1 0 Out-port (BW=3) M

Time Stamp Unit - Design • Interface • With IB: • Flits’ requested output port and VC • Source VC • # of flits to time-stamp • With Out-ports unit: • Free slots counter (for each VC in each Output port) • With SBA: • Failed flits • Last successful TS • Returns: • Number of successful TS • Allocated TS (per flit) M

Time Stamp Unit - Design • Internal maintained data • Last TS allocated (per packet) • Keeps the flits allocation “in-order” • # of allocated flits (per out-port, per TS) • To deal with the departure conflict M

Time Stamp Unit • TS calculation • Formula for next TS: Maximum of {Last TS of Packet, current TS in SB + 3} • Formula for number of flits with the same TS: Minimum of {Out-Port’s FSC, Out-Port’s BW, no. of requesting flits} M

Time Stamp Unit • Speculative approach • In order to avoid speculation need to: • Keep track of SB’s exact occupancy -or- • Constantly communicate with SB => TS allocation can fail • Reset relevant internal maintained data M

Heterogeneous NoC Router

Heterogeneous NoC Router

Presentation Transcript

NoC

Dimensionally-Decomposed Router for 3D-NoC*

Heterogeneous NoC Router

Design of a High-Throughput Distributed Shared-Buffer NoC Router

Explicit Modeling of Control and Data for Improved NoC Router Estimation

Design of a High-Throughput Distributed Shared-Buffer NoC Router

NOC Services

Noc Monitoring

NoC

NoC