1 / 66

Heterogeneous NoC Router

Heterogeneous NoC Router. Moti Mor Tomer Gal Instructor: Yaniv Ben Itzhak. Final Presentation. 25.03.2014. Project Goals. Research about different Heterogeneous NoC Architectures Design an architecture of a heterogeneous router Architecture Implementation

orsin
Download Presentation

Heterogeneous NoC Router

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heterogeneous NoC Router Moti Mor Tomer Gal Instructor: Yaniv Ben Itzhak Final Presentation 25.03.2014

  2. Project Goals • Research about different Heterogeneous NoC Architectures • Design an architecture of a heterogeneous router • Architecture Implementation • Basic Measurements of speed and performance: • Latency • Throughput • Power • Area • Maximum Frequency M

  3. Introduction • Network-on-Chip (NoC) is a new approach to design the communication subsystem of SoC and Chip-Multi-Processors (CMP).  • Clients communicates through a network of routers • Overcoming BUS bottlenecks, Performance improvement. R R R C C C R R R C C C R R R C C C T

  4. Background • The SoC units communicate through a network of routers • Each router is assigned for a single unit • Supports many simultaneously connections • Credit-based flit-level flow control T

  5. Background – XY mesh NoC C = Client R R R Less Bottlenecks R C C C = Router R R R C C C R R R C C C T

  6. Architectures Review • Considered intermediate buffer architectures: • Input Buffer • Shared Memory • Shared Buffer M

  7. Architectures Review-Cont. • Shared Buffer – Chosen Architecture • In-ports store incoming flits to the shared buffer, out-ports read flits from the shared buffer • Each incoming flit is assigned with a Time Stamp (TS) and Shared Buffer Allocation T

  8. Architectures Review-Cont. • Shared Buffer – • Eliminates the need of linked lists management • Decoupling in-ports and out-ports (A flit can acquire any shared buffer, and each shared buffer can be connected to any out-port) • Buffers are shared among all the ports, thus, a better buffer utilization is achieved T

  9. Top Architecture Top Architecture In-ports RC VCA TS SBA XB1 XB2 Link Trav. Input Buffers Control SIgnals Crossbar 1 Crossbar 2 Out- ports T

  10. Pipeline Stages • Stage 1 - Buffer Write: • Incoming flits are written into the input-buffers. • The input buffers are segmented according to the number of VCs of each input port T

  11. Pipeline Stages • Stage 2 – Routing Calculations: • This stage is relevant only for the head flit • Output port is being determined according to the flit’s coordinates T

  12. Pipeline Stages • Stage 3.1 – VC Allocation: • This stage is relevant only for the head flit • Arbitration for free virtual channels at the input of the next-hop router • Managing a free VCs list for each output port T

  13. Pipeline Stages • Stage 3.2 – Time Stamping (TS): • Assigning ingress flits into the shared buffer by resolving the departure conflict • Assigns time slots in a cyclical fashion • Assigns the earliest departure time for as many flits as possible T

  14. Pipeline Stages • Stage 4 – Shared Buffer Allocation: • Flits that were assigned to time slots in the TS stage are assigned to a specific shared buffer • Responsible to maintain the order of flits from the same packet • Should consider the write constraints of the shared buffers (Can cause Arrival Conflict) • If not succeeded Re-enters the TS stage T

  15. Pipeline Stages - Conflicts • Departure Conflict - • Occurs for out-port O when more than ELWO flits are assigned with the same time stamp. • Arrival Conflict – • Occurs when trying to write more flits than allowed to a certain shared buffer M

  16. Pipeline Stages • Stage 5 – Crossbar 1 (XB1) & SB Write: • Flits are traversed trough the first XB and written in the Shared Buffers. M

  17. Pipeline Stages • Stage 6 – SB Read & Crossbar 2 (XB2) : • Flits stored in time-slot 0 are read from the shared buffer and traversed trough the second XB. • Each time-slot i advance to time-slot i-1. M

  18. Pipeline Stages • Stage 7 – Link Traversal: • The flits are transmitted to the downstream router M

  19. NoCHeterogeneity • Modular Parameters: • Number of Virtual Channels per port • In-port & Out-port width • Number of FIFOs in shared buffers • Shared buffer length/size • Speed-up T

  20. Router Blocks With Simulation and Synthesis Results

  21. Testbench Environment Overview • An individual test-bench was created for each unit/block of the router. • We ran different tests on each block, testing the functionality of the block, while testing the heterogeneous parameters. • All functional simulations were done in Model Sim 10.3. T

  22. Synthesis Environment Overview • Vivado 2013.4 for Virtex-7 VC709 Evaluation Platform board • the largest IO pin count (in Xilinx’s newest boards) • Wrapper entity for each unit • Fewer ports (only one of each kind) • Deals with large ports • Keep attribute to avoid “optimizing out” similar paths M

  23. Synthesis Environment Overview Wrapper unit 6 Type 1 in-ports Type 1 in-port 3 Type 2 in-ports Type 2 in-port M

  24. Synthesis Environment Overview Wrapper unit 7 out-ports withthe same kind M

  25. Input Buffer • Each in-port consists of one input buffer unit, which instantiates pointer-based fifos • Parameters assigned to each unit: • Number of VCs • Speedup • Bandwidth • Fifo depth • These parameters (Implemented as generics) will result in the number of Fifos using the VHDL "If Generate" syntax. • The buffers operate according to the Virtual Channel flow control convention. T

  26. Input Buffer - Operation • The input buffer interfaces with the following router blocks: • Outports Block (In the previous router) • VCA Block • TS Block • SBA Block • XB1 Block • When flits are ready for a certain stage, the input buffer sends a request to the desired block. T

  27. Input BufferFifos • The flits are stored in a two-dimensional fifo array • Dimension 1: driven from the number of VCs • Dimension 2: driven from the speedup and the bandwidth • The flits will stay in the fifos throughout the pipeline stages until departed in the XB1 stage to the shared buffer. • Each cell in the fifo type is “Flit Extended Type”: T

  28. Input BufferFifo Pointers • The fifos are designed in a cyclic fashion - saving power by using address/pointer switching instead of moving the entire chunk of data. • Each stage of the pipeline has a pointer to the next flit that need to pass the desired stage • The following pointers are used: • write_addr -Write address • read_addr -Read address • rc_addr -RC (Routing calculation) • vca_addr -VCA (Virtual Channel Allocation) address • ts_addr - TS (Time Stamping) • sba_addr - SBA (Shared Buffer Allocation) address • In case of a failure in one of the stages (VCA, TS, SBA), the fifo receives a request from the input buffer to reverse a certain pointer. T

  29. Input Buffer - Simulation • Wave overview of fifo status during simulation: T

  30. Input Buffer - Synthesis • Small Configuration: • num of VCs in port : 1 • depth : 2 • Port's BW : 1 • Speed-Up : 1 • num of fifos in VC : 1 T

  31. Input Buffer - Synthesis • Large Configuration: • num of VCs in port : 2 • depth : 2 • Port's BW : 2 • Speed-Up : 2 • num of fifos in VC : 2 T

  32. Input Buffer - Synthesis • Utilization summary for both configurations : • The longest paths time were: • 18.14[ns] in the small configuration • 18.74[ns] in the large configuration T

  33. Routing Calculation Unit • Responsible to direct incoming flits to the router’s output ports in order to allow the flits to get to their destination. • Whenever a head flit enters an input buffer, it is sent to the RC unit. T

  34. Routing Calculation Unit • Algorithm in pseudo code: • if (dest.x < local.x) then output port = 4; • else if (dest.x > local.x) then output port = 2; • else if (dest.y < local.y ) then output port = 1; • else if (dest.y > local.y ) then output port = 3; • Else output port = 0; • Example of head flit traveling from (2,1) to (0,2): T

  35. Routing Calculation Unit - Simulation • Simultaneous requests from all 5 IBs. The requests are:(4,3) , (3,3) , (3,1) , (2,1) , (4,4) from IB 0,1,2,3,4 respectively. T

  36. Routing Calculation Unit - Synthesis • RTL schematics of the RC unit: T

  37. Routing Calculation Unit - Synthesis • Utilization summary of the RC unit : • The longest path time was 1.34[ns] T

  38. Virtual Channel Allocation Unit • Responsible for allocating virtual channels in the destination router’s input buffer • Deals only with head flits • Receives an array of up to 5 requests (one per input buffer) which contains: • The output port number the requesting flits wish to traverse to (RC result) • Look for a free VC in that output port: • Succeed  Returns that VC’s index • Fails Returns an invalid indication back to the requesting input buffer • Implements the round-robin fairness algorithm T

  39. Virtual Channel Allocation Unit • Example (Before and after allocation): • Router A with 4 input buffers that have one VC each. • In each IB there is a head flit (marked by A,B,C,D) that wish to traverse to router B. • The relevant IB in router B has 3 free VCs. T

  40. Virtual Channel Allocation Unit- Simulation • Simulation Example: • Allocation of one VC • Input buffer 4 requests VC allocation from Output port 2 which has 3 free VCs • The VCA result is stored in the vca_calc_arr signal array • After the allocation: out-port 2 has only 2 free VCs T

  41. Virtual Channel Allocation Unit- Synthesis • Small configuration: • Num of VCs per port : 1,1,1,1,1 • VCs’ depth (num of buffers in each VC): 2,2,2,2,2 • RTL schematics: T

  42. Virtual Channel Allocation Unit- Synthesis • Medium configuration: • Num of VCs per port : 1,1,2,2,2 • VCs’ depth (num of buffers in each VC) : 2,2,2,4,4 • RTL schematics: T

  43. Virtual Channel Allocation Unit- Synthesis • Large configuration: • Num of VCs per port : 2,2,2,2,2 • VCs’ depth (num of buffers in each VC): 4,4,4,4,4 • RTL schematics: T

  44. Virtual Channel Allocation Unit- Synthesis • Utilization summary of the VCA unit : • The longest paths time were: • 11.93 [ns] in the small configuration • 26.5 [ns] in the medium configuration • 32.29 [ns] in the large configuration T

  45. Time Stamp Unit TS unit • Allocates columns of the Shared-Buffer 9 8 7 0 6 5 4 3 2 1 0 In-port 2 1 1 1 0 0 0 Out-port (BW=3) M

  46. Time Stamp Unit • Resolves the Departure Conflict 9 8 7 0 6 5 4 3 2 1 0 Out-port (BW=3) M

  47. Time Stamp Unit - Design • Interface • With IB: • Flits’ requested output port and VC • Source VC • # of flits to time-stamp • With Out-ports unit: • Free slots counter (for each VC in each Output port) • With SBA: • Failed flits • Last successful TS • Returns: • Number of successful TS • Allocated TS (per flit) M

  48. Time Stamp Unit - Design • Internal maintained data • Last TS allocated (per packet) • Keeps the flits allocation “in-order” • # of allocated flits (per out-port, per TS) • To deal with the departure conflict M

  49. Time Stamp Unit • TS calculation • Formula for next TS: Maximum of {Last TS of Packet, current TS in SB + 3} • Formula for number of flits with the same TS: Minimum of {Out-Port’s FSC, Out-Port’s BW, no. of requesting flits} M

  50. Time Stamp Unit • Speculative approach • In order to avoid speculation need to: • Keep track of SB’s exact occupancy -or- • Constantly communicate with SB => TS allocation can fail • Reset relevant internal maintained data M

More Related