960 likes | 1.13k Views
Switch Design a unified view of micro-architecture and circuits. Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH) Xanthi, Greece dimitrak@ee.duth.gr http://utopia.duth.gr/~dimitrak. System abstraction. Algorithms-Applications.
E N D
Switch Designa unified view of micro-architecture and circuits Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH) Xanthi, Greece dimitrak@ee.duth.gr http://utopia.duth.gr/~dimitrak
System abstraction Algorithms-Applications • Processors for computation • Memories for storage • IO for connecting to the outside world • Network for communication and system integration Operating System Processors Memory Instruction Set Architecture Microarchitecture Network Register-Transfer Level Logic design Circuits IO Devices Switch Design - NoCs 2012
Logic, State and Memory • Datapath functions • Controlled by FSMs • Can be pipelined • Mapped on silicon chips • Gate-level netlist from a cell library • Cells built from transistors after custom layout • Memory macros store large chunks of data • Multi-ported register files for fast local storage and access of data Switch Design - NoCs 2012
On-Chip Wires • Passive devices that connect transistors • Many layers of wiring on a chip • Wire width, spacing depends on metal layer • High density local connections, Metal 1-5 • Upper metal layers > 6 are wider and used for less dense low-delay global connections Switch Design - NoCs 2012
Future of wires: 2.5D – 3D integration Evolution Switch Design - NoCs 2012
Optical wiring • Optical connections will be integrated on chip • Useful when the power of electrical connections will limit the available chip IO bandwidth • A balanced solution that involves both optical and electrical components will probably win Switch Design - NoCs 2012
Let’s send a word on a chip • Sender and receiver on the same clock domain • Clock-domain crossing just adds latency • Any relation of the sender-receiver clocks is exploited • Mesochronous interface • Tightly coupled synchronizers [AMD Zacate] Switch Design - NoCs 2012
Point-to-point links: Flow control Data • Synchronous operation • Data on every cycle • Sender can stall • Data valid signal • Receiver can stall • Stall (back-pressure) signal • Either can stall • Valid and Stall backpressure • Partially decouple Sender and Receiver by adding a buffer at the receive side S R Data S Valid R Data S R Stall Data S Valid R Stall Switch Design - NoCs 2012
Sender and Receiver decoupled by a buffer • Receiver accepts some of the sender’s traffic even if the transmitted words are not consumed • When to stop? How is buffer overflow avoided? • Let’s see first how to build a buffer • Clock-domain crossing can be tightly coupled within the buffer Switch Design - NoCs 2012
Buffer organization • A FIFO container that maintains order of arrival • 4 interfaces (full, empty, put, get) • Elastic • Cascade of depth-1 stages • Internal full/empty signals • Shift register in/Parallel out • Put: shift all entries • Get: tail pointer • Circular buffer • Memory with head / tail pointers • Wrap around array implementation • Storage can be register based Switch Design - NoCs 2012
Buffer implementation • The same basic structure evolves with extra read/write flexibility • Multiplexers and head/tail pointers handle data movement and addressing Elastic Shift In/Parallel Out Circular array Switch Design - NoCs 2012
Link-level flow control: Backpressure • Link-level flow control provides a closed feedback loop to control the flow of data from a sender to a receiver • Explicit flow control (stall-go) • Receiver notifies the sender when to stop/resume transmission • Implicit flow control (credits) • Sender knows when to stop to avoid buffer overflow • For unreliable channels we need extra mechanisms for detecting and handling transmission errors Switch Design - NoCs 2012
STALL-GO flow control • One signal STALL/GO is sent back to the receiver • STALL=0 (G0) means that the sender is allowed to send • STALL=1 (STALL) means that the sender should stop • The sender changes its behavior the moment it detects a change to the backpressure signal • Data valid (not shown) is asserted when new data are available Stall Switch Design - NoCs 2012
STALL-GO flow control: example • In-flight words will be dropped or they will replace the ones that wait to be consumed • In every case data arelost • STALL and GO should be connected with the buffer availability of the receiver’s queue • The example assumes that the receiver is stalled or released for other network reasons Stall Switch Design - NoCs 2012
Buffering requirements of STALL&GO • STALL should be asserted early enough • Not drop words in-flight • Timing of STALL assertion guarantees lossless operation • GO should be asserted late enough • Have words ready-to-consume before new words arrive • Correct timing guarantees high throughput • Minimum buffering for full throughput and lossless operation should cover both STALL&GO re-action cycles If not available the link remains idle Stall Switch Design - NoCs 2012
STALL&GO on pipelined and elastic links • Traffic is “blind” during a time interval of Round-trip Time (RTT) • the source will only learn about the effects of its transmission RTT after this transmission has started • the (corrective) effects of a contention notification will only appear at the site of contention RTT after that occurrence Switch Design - NoCs 2012
Credit-based flow control • Sender keeps track of the available buffer slots of the receiver • The number of available slots is called credits • The available credits are stored in a credit counter • If #credits > 0 sender is allowed to send a new word • Credits are decremented by 1 for each transmitted word • When one buffer slot is made free in the receive side, the sender is notified to increase the credit count • An example where credit update signal is registered first at the receive side Switch Design - NoCs 2012
Credit-based flow control: Example Available Credits Credit Update 0* means that credit counter is incremented and decremented at the same cycle (ways and stays at 0) Switch Design - NoCs 2012
Credit-based flow control: Buffers and Throughput Switch Design - NoCs 2012
Condition for 100% throughput Credit loop • The number of registers that the data and the credits pass through define the credit loop • 100% throughput is guaranteed only when the number of available buffer slots at the receive side equals the registers of the credit loop • Changing the available number of credits can reconfigure maximum throughput at runtime • Credit-based FC is lossless with any buffer size > 0. • Stall and Go FC requires at least one loop extra buffer space than credit-based FC Switch Design - NoCs 2012
Link-level flow control enhancements • Reservation based flow control • Separate control and data functions • Control links race ahead of the data to reserve resources • When data words arrive, they can proceed with little overhead • Speculative flow control • The sender can transmit cells even without sufficient credits • Speculative transmissions occur when no other words with available credits is eligible for transmission • The receiver drops an incoming cell if its buffer is full • For every dropped word a NACK is returned to the sender • Each cell remains stored at the sender until it is positively acknowledged • Each cell may be speculatively transmitted at most once • All retransmissions must be performed when credits are available • The sender consumes credit for every cell sent, i.e., for speculative as well as credited transmissions. Switch Design - NoCs 2012
Send a large message(packet) • Send long packet of 1Kbit over a 32-bit-wire channel • Serialize the message to 16 words of 32 bits • Need 16 cycles for packet transmission • Each packet is transmitted word-by-word • When the output port is free, send the next word immediately • Old fashioned Store-and-forward required the entire packet to reach each node before initiating next transmission Switch Design - NoCs 2012
Buffer allocation policies • Each transmitted word needs a free downstream buffer slot • When the output of the downstream node is blocked the buffer will hold the arriving words • How much free buffering is guaranteed before sending the first word of a packet? • Virtual Cut Through (VCT): The available buffer slots equal the words of the packet • Each blocked packet stays together and consumes the buffers of only one node • Wormhole: Just a few are enough • Packet inevitably occupies the buffers of more nodes • Nothing is lost due to flow control backpressure policy Switch Design - NoCs 2012
VCT and Wormhole in graphics Switch Design - NoCs 2012
Link sharing • The number of wires of the link does not increase • One word can be sent on each clock cycle • The channel should be shared • A multiplexer is needed at the output port of the sender • Each packet is sent un-interrupted • Wormhole, and VCT behave this way • Connection is locked for a packet until the tail of the packet passes the output port Switch Design - NoCs 2012
Who drives the select signals? • The arbiter is responsible for selecting which packet will gain access to the output channel • A word is sent if buffer slots are available downstream • It receives requests from the inputs and grants only one of them • Decisions are based on some internal priority state Switch Design - NoCs 2012
Arbitration for Wormhole and VCT • In wormhole and VCT the words of each packet are not mixed with the words of other packets • Arbitration is performed once per packet and the decision is locked at the output for all packet duration • Even if a packet is blocked downstream the connection does not change until the tail of the packet leaves the output port • Buffer utilization managed by flow control mechanism Switch Design - NoCs 2012
How can I place my buffers? Switch Design - NoCs 2012
Let’s add some complexity: Networks • A network of terminal nodes • Each node can be a source or a sink • Multiple point-to-point links connected with switches • Parallel communication between components Switch Source/Sink Terminal Node Switch Design - NoCs 2012
Complexity affects the switches • Multiple input-output permutations should be supported • Contention should be resolved and non-winning inputs should be handled • Buffered locally • Deflected to the network • Separate flow control for each link • Each packet needs to know/compute the path to its destination Switch Design - NoCs 2012
How are the terminal nodes connected to the switch? • More than one terminal nodes can connect per switch • Concentration good for bursty traffic • Local switch isolates local traffic from the main network Switch Design - NoCs 2012
Switch design: IO interface Separate flow control per link Switch Design - NoCs 2012
Switch design: One output port per-output requests Let’s reuse the circuit we already have for one output port Switch Design - NoCs 2012
Switch design: Input buffers Data from input#1 Requests for output #0 • Move buffers to the inputs Switch Design - NoCs 2012
Switch design: Complete output ports • How are the output requests computed? Switch Design - NoCs 2012
Routing computation • Routing computation generates per output requests • The header of the packet carries the requests for each intermediate node (source routing) • The requests are computed/retrieved based on the packet’s destination (distributed routing) Switch Design - NoCs 2012
Routing logic • Routing logic translates a global destination address to a local output port request • To reach node X from node Y should use output port #2 of Y • A Lookup-table is enough for holding the request vector that corresponds to each destination Switch Design - NoCs 2012
Switch building blocks Switch Design - NoCs 2012
Running example of switch operation • Switches transfer packets • Packets are broken to flits • Head flit only knows packet’s destination • The wires of each link equals the bits of each flit Switch Design - NoCs 2012
Buffer access • Buffer incoming packets per link • Read the destination of the head of each queue Switch Design - NoCs 2012
Routing Computation/Request Generation • Compute output requests and drive the output arbiters Switch Design - NoCs 2012
Arbitration-Multiplexer path setup • Arbitrate per output • The grant signals • Drive the output multiplexers • Notify the inputs about the arbitration outcome Switch Design - NoCs 2012
Switch traversal • Words H will leave the switch on the next clock edge provided they have at least one credit Switch Design - NoCs 2012
Link traversal • Words going to a non-blocked output leave the switch • The grants of a blocked output (due to flow control) are lost • An output arbiter can also stall in case of blocked output Switch Design - NoCs 2012
Head-Of-Line blocking: performance limiter • The FIFO order of the input buffers limit the throughput of the switch • The flit is blocked by the Head-of-Line that lost arbitration • A memory throughput problem Switch Design - NoCs 2012
Wormhole switch operation • The operations can fit in the same cycle or they can be pipelined • Extra registers are needed in the control path • Registers in the input/output ports already present • LT at the end involves a register write • Body/tail flits inherit the decisions taken by the head flits Switch Design - NoCs 2012
Look-ahead routing • Routing computation is based only on packet’s destination • Can be performed in switch A and used in switch B • Look-ahead routing computation (LRC) Switch Design - NoCs 2012
Look-ahead routing • The LRC is performed in parallel to SA • LRC should be completed before the ST stage in the same switch • The head flit needs the output port requests for the next switch Switch Design - NoCs 2012
Look-ahead routing details • The head flit of each packet carries the output port requests for the next switch together with the destination address Switch Design - NoCs 2012
Low-latency organizations SA ST LT • Baseline • SA precedes ST (no speculation) • SA decoupled from ST • Predict or Speculate arbiter’s decisions • When prediction is wrong replay all the tasks (same as baseline) • Do in different phases • Circuit switching • Arbitration and routing at setup phase • At transmit only ST is needed since contention is already resolved • Bypass switches • Reduce latency under certain criteria • When bypass not enabled same as baseline LRC SA LT ST LRC SA ST LT LRC Setup Setup ST LT Xmit Xmit ST LT Switch Design - NoCs 2012