480 likes | 621 Views
Router Architecture. Z. Lu / A. Jantsch / I. Sander. S. S. S. S. S. S. S. S. S. S. S. S. S. S. S. S. T. T. T. T. T. T. T. T. T. T. T. T. T. T. T. T. Network-on-Chip.
E N D
Router Architecture Z. Lu / A. Jantsch / I. Sander
S S S S S S S S S S S S S S S S T T T T T T T T T T T T T T T T Network-on-Chip • Information in the form of packetsis routed via channels and switches from one terminal node to another • The interface between the interconnection network and the terminals (client) is called network interface SoC Architecture
Router Architecture • The discussion concentrates on a typical virtual-channel router • Modern routers are pipelined and work at the flit level • Head flits proceed through buffer stages that perform routing and virtual channel allocation • All flits pass through switch allocation and switch traversal stages • Most routers use credits to allocate buffer space SoC Architecture
A typical virtual channel router • A router’s functional blocks can be divided into • Datapath: handles storage and movement of a packets payload • Input buffers • Switch • Output buffers • Control Plane: coordinating the movements of the packets through the resources of the datapath • Route Computation • VC Allocator • Switch Allocator SoC Architecture
A typical virtual channel router • The input unit • contains a set of flit buffers • Maintains the state for each virtual channel • G = Global State • R = Route • O = Output VC • P = Pointers • C = Credits SoC Architecture
Virtual channel state fields (Input) SoC Architecture
A typical virtual channel router • During route computation the output port for the packet is determined • Then the packet requests an output virtual channel from the virtual-channel allocator SoC Architecture
A typical virtual channel router • Flits are forwarded via the virtual channel by allocating a time slot on the switch and output channel using the switch allocator • Flits are forwarded to the appropriate output during this time slot • The output unit forwards the flits to the next router in the packet’s path SoC Architecture
Virtual channel state fields(Output) SoC Architecture
Packet Rate and Flit Rate • The control of the router operates at two distinct frequencies • Packet Rate (performed once per packet) • Route computation • Virtual-channel allocation • Flit Rate (performed once per flit) • Switch allocation • Pointer and credit count update SoC Architecture
The Router Pipeline • A typical router pipeline includes the following stages • RC (Routing Computation) • VC (Virtual Channel Allocation) • SA (Switch Allocation) • ST (Switch Traversal) no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 0 • Head flit arrives and the packet is directed to an virtual channel of the input port (G = I) no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 1 • Routing computation • Virtual channel state changes to routing (G = R) • Head flit enters RC-stage • First body flit arrives at router no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 2: Virtual Channel Allocation • Route field (R) of virtual channel is updated • Head flit enters VA state • First body flit enters RC stage • Second body flit arrives at router no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 2: Virtual Channel Allocation • The result of the routing computation is input to the virtual channel allocator • If successful, the allocator assigns a single output virtual channel • The state of the virtual channel is set to active (G = A) no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 3: Switch Allocation • All further processing is done on a flit base • Head flit enters SA stage • Any active VA (G = A) that contains buffered flits (indicated by P) and has downstream buffers available (C > 0) bids for a single-flit time slot through the switch from its input VC to the output VC no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 3: Switch Allocation • If successful, pointer field is updated • Credit field is decremented no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 4: Switch Traversal • Head flit traverses the switch • Cycle 5: • Head flit starts traversing the channel to the next router no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 7: • Tail traverses the switch • Output VC set to idle • Input VC set to idle (G = I), if buffer is empty • Input VC set to routing (G = R), if another head flit is in the buffer no pipeline stalls SoC Architecture
The Router Pipeline • Only the head flits enter the RC and VC stages • The body and tail flits are stored in the flit buffers until they can enter the SA stage no pipeline stalls SoC Architecture
Pipeline Stalls • Pipeline stalls can be divided into • Packet stalls • can occur if the virtual channel cannot advance to its R, V, or A state • Flit stalls • If a virtual channel is in active state and the flit cannot successfully complete switch allocation due to • Lack of flit • Lack of credit • Losing arbitration for the switch time slot SoC Architecture
Example for Packet Stall Virtual-channel allocation stall Head flit of A can first enter the VA stage when the tail flit of packet B completes switch allocation and releases the virtual channel SoC Architecture
Example for Flit Stalls Switch allocation stall Second body flit fails to allocate the requested connection in cycle 5 SoC Architecture
Example for Flit Stalls Buffer empty stall Body flit 2 is delayed three cycles. However, since it does not have to enter the RC and VA stage the output is only delayed one cycle! SoC Architecture
Credits • A buffer is allocated in the SA stage on the upstream (transmitting) node • To reuse the buffer, a credit is returned over a reverse channel after the same flit departs the SA stage of the downstream (receiving) node • When the credit reaches the input unit of the upstream node the buffer is available can be reused SoC Architecture
Credits • The credit loop can be viewed by means of a token that • Starting at the SA stage of the upstream node • Traveling downwards with the flit • Reaching the SA stage at the downstream node • Returning upstream as a credit SoC Architecture
Credit Loop Latency • The credit loop latency tcrt, expressed in flit times, gives a lower bound on the number of flit buffers needed on the upstream size for the channel to operate with full bandwidth • tcrt in flit times is given by tcrt = tf + tc + 2Tw + 1 Flit pipeline delay One-way wire delay Credit pipeline delay SoC Architecture
Credit Roun-trip Time and Credit Stall Virtual Channel Router with 4 flit buffers tf TW TW tf tf tf tc TW TW tc tf = 4 tc = 2 Tw = 2 => tcrt = 11 Credit Transmit Credit Update tcrt White: upstream pipeline stages Grey: downstream pipeline stages SoC Architecture
Credit Loop Latency • If the number of buffers available per virtual channel is F, the duty factor of the channel will be d = min (1, F/ tcrt) • The duty factor will be 100% as long as there are sufficient flit buffers to cover the round trip latency SoC Architecture
Flit and Credit Encoding • Flits and credits are sent over separated lines with separate width • Flits and credits are transported via the same line. This can be done by • Including credits into flits • Multiplexing flits and credits at phit level SoC Architecture
Network Interface Z. Lu / A. Jantsch / I. Sander
S S S S S S S S S S S S S S S S T T T T T T T T T T T T T T T T Network-on-Chip • Information in the form of packetsis routed via channels and switches from one terminal node to another • The interface between the interconnection network and the terminals (client) is called network interface SoC Architecture
Network Interface Network • Different terminals with different interfaces shall be connected to the network • The network uses a specific protocol and all traffic on the network has to comply to the format of this protocol Switch Network Interface Terminal Node (Resource) SoC Architecture
Network Interface • The network interface plays an important role in a network-on-chip • it shall translate between the terminal protocol and the protocol of the network • it shall enable the client to communicate at the speed of the network • it shall not further reduce the available bandwidth of the network • it shall not increase the latency imposed by the network • A poorly designed network interface is a bottleneck and can increase the latency considerably SoC Architecture
Network Interfaces • For message passging: symmetric • Processor-Network Interface, • For shared memory: un-symmetric, load & store • Processor-Network Interface • Memory-Network Interface • Line-card interface connecting an external network channel with an interconnection network used as a swiching fabric SoC Architecture
Network Interfaces for message passing • Two-register interface • Descriptor-based interface • Message reception SoC Architecture
Two-Register Interface • For sending, the processor write to a specific Net-out register • For receiving, the processor reads a specific Net-in register • Pro: • Efficient for short messages • Cons: • Inefficient for long messages • Processor acts as DMA controller • Not safe, because it does not prevent the network from SW running on the processor • A misbehaving processor can send the first part of a message and then delay indefinitely sending the end of the message. • A processor can tie up the network by failing to read a message from the input register. R0 R1 : : R31 Net out Net in Network SoC Architecture
Descriptor Based Interface • The processor composes a message in a set of dedicated message descriptor registers • Each descriptor contains • An immediate value, or • A reference to a processor register, or • A reference to a block of memory • A co-processor steps through the descriptors and composes the messages • Safe because the network is protected from the processor’s SW Send Start Immediate RN Addr Length END R0 R1 : + Memory RN : : : R31 : : : SoC Architecture
Receiving Messages • A co-processor or a dedicated thread is triggered upon reception of an incoming message • It unpacks the message and stores it in local memory • It informs the receiving task via an interrupt or a status register update SoC Architecture
Shared Memory Interfaces • The interconnection network is used to transmit memory read/write transactions between processors and memories • We will further discuss • Processor-Network Interface • Memory-Network Interface SoC Architecture
Processor-Network Interface Load/store requests are stored in request register. Type: read/write, cacheable or uncacheable etc. • Requests are tagged, usually encoding how the reply is to be handled, e.g., store in register R10. • In case of a cache miss, requests are stored in MSHR (miss status holding register) SoC Architecture
Processor-Network Interface Consider a read operation: • Uncacheable read request would result in a pending read • After forming and transmitting the message, its status changes to read requested • When the network returns the message, its status changes to read complete • Completed MSHRs are forwarded to reply register, its status changes to idle SoC Architecture
Processor-Network Interface • Cache coherence protocols change the operation of the processor-network interface • Complete cache lines are loaded into the cache • Protocol requires a larger vocabulary of messages • Exclusive read request • Invalidation and updating of cache lines • Cache coherence protocol requires interface to send messages and update state in response to received messages SoC Architecture
Memory-Network Interface • Interfaces receive memory request messages and sends replies • Messages received from the network are stored in the TSHR (transaction status holding register) SoC Architecture
Memory-Network Interface • Request queue is used to hold request messages, when all THSRs are busy • THSR tracks messages in same way as MHSR • Bank Control and Message Transmit Unit monitors changes in THSR SoC Architecture
Memory-Network Interface Consider a read operation: • A read request initializes a TSHR entry with status read pending • Subsequent memory access changes status to bank activated • Right before the first word is returned from memory bank, its status is changed to read complete • Message transmit unit formats message and injects it into the network and the TSHR entry is marked idle SoC Architecture
Memory-Network Interface • Cache coherence protocols can be implemented with this structure, however TSHR must be extended, e.g., the directory. SoC Architecture
Summary • Network interfaces bridge processor with network, and memory with network • Messaing passing interfaces • Two-register interface • Descriptor-based interface • Shared memory interfaces, complicated by cache coherency. • Processor-Network Interface • Memory-Network Interface SoC Architecture