On-FPGA Communication Architectures

On-FPGA Communication Architectures

On-FPGA Communications • Must provide high bandwidth and reliable data transfer between modules. • Can also be used as an interconnect backbone for different coarse-grain components • provides plug-and-play style of modularity. • Problem: • Growing number of embedded components •  Communication bandwidth: main factor in performance. •  Need scalable and high-performance architectures.

Communication Architectures Classification On-Chip Communication P2P Interconnect Bus NoC Custom Uniform Homogen Heterogen. Hierarchical Share Bus Split Bus Custom Segmented [Mak06]

Point-to-Point Interconnect • P2P (Direct) Architectures: • Modules communicate over dedicated physical wires configured at compile-time • Configuration of the channels remains unchanged until next full configuration. • Configuration defines: • set of physical lines, • their direction, • their bandwidth • their terminals (modules)

P2P Communication: Example • 1D Example: • Line 3 • used by C2 for I/O • fed through C1 • C1 should provide channels for the signals to cross • Line 4 • used by C1 and C2 for direct communication • ….

Point-to-Point Interconnect • Advantages: • Simple •  Widely used • Deterministic latency and performance • Reason: Channels are not shared • Disadvantage: • Puts restriction on the design of components. • Dedicated channels must be foreseen to allow signals to cross. • Placer must deal with restrictions as availability of wires. •  Possible for offline placement (at compile time). • Not scalable: • As # channels grows, the number of wires required increases rapidly. • Routing becomes very difficult. • Low wire utilization for low bandwidth channels. • High hardware overhead.

Bus-Based Communication • Communication between reconfigurable modules via a common bus. • Long wires are grouped to form a single communication channel which is shared among different logical channels. • Needs an arbitration mechanism to control sharing. • Advantages: • Significantly reduces total wire length. • Reduces hardware area for interfaces. • Disadvantage: • Delay by bus arbitration.

Bus-Based Communication • Xilinx: • uses CoreConnect bus architecture (from IBM) • for both hard-core and soft-core processors • Virtex-II Pro and Virtex 4.

Circuit Switching • Circuit Switching: • Dynamically establishes a connection between two PEs. • Uses a set of physical lines connected by switches. • PEs arranged in a mesh. • Switches available at column/rowintersections to allow a longer connection •  Two PEs can be connected at run-time setting the switches on the path • Once the connection is established, data can be transferred in one clock. • Example: • Connection mechanism in most FPGAs (fine grained idea). • PACT-XPP

Circuit Switching • Advantage (application): • In fine-grained image computing systems: • Dynamically changes the topology of a parallel computer to accommodate the best structure of the application . • Disadvantages: • Long Delay: • When the connection must go through many processors. • (must pass through many switches). • Dynamic computation of routes: • Needs run-time routing (when placement is changed dynamically) • Very time consuming  Long overall computation time. • Exclusive use of chip space: •  Next page

Circuit Switching • Exclusive use of chip space: • A hard module uses all resources in the area (including i/connects) •  Placing a module destroys the route. • Can place only in restricted area (not used by routes)

1D Circuit Switching • Reconfigurable Multiple Bus (RMB) [Bobda05] • Communication structure: • Switches, locally attached to a PE • Connection between switches through a bus,

1D Circuit Switching • Procedure (connection from Pk to Pt): • Pk sends request to its own switch sk. • sk sends the request to sk+1 • .... st • Each switch checks if there is available channel on the switch • If yes, the switch sets a connection and sends and ack. • from st to … sk • If not, reject or queue the request • When the sender receives ack, it starts communcation.

RMB on chip • RMBoC implementation: • On a column-wise reconfigurable device (Virtex), the RMB provides a modular communication infrastructure. • The device is segmented in a set of horizontal slots • Each slot can accommodate a module at run-time. • For larger modules, two/more consecutive slots. • Bus macros at the slot boundaries • A hardware module which does not allow the established connection to be destroyed during the reconfiguration.

RMBoC • Crosspoints (switches) • set the connection between the segments at the run-time

RMBoC Crosspoint • Controller: • Manages the switch according to requests from left/right crosspoints and local modules: • Commands (locally processed): • REQUEST, REPLY, CANCEL, DESTROY. • Procedure: • Communication starts by REQUEST from sender to its local crosspoint with the destination address, …. • REPLY is sent back an ack. • If a processor cannot establish a connection, CANCEL is sent back. • If successful connection, at the end of communication, the sender sends DESTROY to its crosspoint, …. • Each crosspoint frees the data channel after sending DESTROY.

RMBoC Crosspoint • Data Network: • Connects data channels according to the configurations modified by the controller. • Original RMB transferred within one clock cycle  slow clock. • RMBoC uses pipelined communication (registers between slots)

RMBoC Crosspoint • FIFOs: • provide buffer for commands coming from different sides • Round-Robin order: left, right, local.

Network on Chip

NoC • NoC: • Consist of a set of network clients (DSP, memory, peripheral controller, custom logic) that communicate on a packet base (instead of using direct connection).

NoC • modules (network client) placed at fixed locations on the chip can exchange packets in the common network. • Advantage: • Very high flexibility • because no route has to be computed before allowing components to start communicating. • Components just send packets, and they do not care on how the packets are routed in the network. • Example: • QuickSilver (FPL 2004)

NoC Characteristics • An NoC architecture is characterized by: • number of routers, • each attached to PE in the array, • bandwidth of the communication channels between the routers, • topology of the network • the mechanism used for packet forwarding. • Major components: • Router • PE

NoC vs. Macro Network • Noc must have little area overhead. • especially for fine grain architectures (e.g. FPGA). • Few registers are used as buffers for on-chip routers.

Network Topologies • 2-D Mesh • Torus

Router • Buffers • Controller • Arbiter

Router Components • Buffers: • Usually implemented as FIFO. • Temporally store messages coming from five directions. • Each router (willing to send a message in a given direction) copies it into the FIFO of the neighbor router in that direction. • Then data are placed on the data lines and the control signals are used to handshake between neighbor routers.

Router Components • Controller: • determines how to forward the packet, • usually according to the destination address. • Output arbiters: • For four directions and PE. • manage the assignment of the message to output channels.

FIFO • Characterized by: • Data width: number of bits in a register. • FIFO depth: number of registers in a FIFO. • Types: • Synchronous: • a common clock is used for reading and writing. • Asynchronous: • Two different clocks for reading and writing.

Controller • Each router is identified through its position in the network. • The (x,y)-coordinate of its PE. • Messages are sent in packets: Payload (Data) Destination Address Control Bits • Determines the direction to send the packet. • An address decoder that decodes the address into (x,y) coordinate of destination router or PE.

Controller Payload (Data) Destination Address Control Bits • E.g. XY routing: • A comparator compares (x,y) of the destination PE to that of the router to compute the direction (LOCAL, EAST, WEST, SOUTH, and NORTH). • The packet is written in the input FIFO of the corresponding neighbor FIFO (if not full). • If full, decides: • block all incoming packets or • send the packet in another direction to decongest a given data line.

Output Arbiter • For high performance FIFOs must be read concurrently. • Controller decides the direction to send the packets. • Contention if decides to forward many packets in the same direction • because only one output data line. •  Arbiter at each output port • Simple arbiter: • A MUX + an FSM

Output Arbiter • A simple arbiter: • Round-Robin fashion. • The incoming packets from the EAST will be written before the one coming from the WEST, …. • LOCAL not considered because it does not send back in the same direction as received.

Processing Element • PE can be: • processor core, • memory block, • embedded programmable logic, • custom hardware block, • …. • PE is connected to network through wrapper. • Wrapper: • controls all the transactions on the network and • provides a simple interface for PE to access the network.

Wrapper • Function: • Decoding the received packets • removes the address before passing the data to PE • Encoding sent packets • adds the address of the destination PE to the payload and formats the packet before giving it to the connected router. • Implementation: • PE is instantiated as functional block within the wrapper.

NoC Design Constraints • Design constraints to be considered in NoC design: • Area overhead: • depends on the bandwidth requirements: • Packet size, • Determines the width of connection between routers. • Proportional to the amount of internal wire required. • Buffer size, • Determines the amount of memory used for storing the packets within the router before forwarding. • Complexity of the control algorithm. • Determines how much additional resources the router consumes.

NoC Design Constraints • Latency: • the time a message needs from its source to its destination. • Components: • the time needed to setup a route • In circuit switching: request and acknowledgment latency, • in packet routing: no such set up time. • + the time needed to transfer the payload to destination.

Latency • Latency: • Only the address flit takes initial setup time to reach the • destination (based on the routing algorithm), • Thereafter for every cycle, the data flit will be delivered to the destination (in a deadlock free network). • Latency for diagonal nodes: • 16 cycles

Performance Metrics • Latency: • The time a message needs from its source to its destination: tlast - tfirst • tlast:the time when the last packet of the message arrives at destination • tfirst:the time when the first packet of the message is output from the source. • Throughput: • maximum traffic a network can accept per unit of time, • typically measured as bytes or packets per node per cycle.

Routing Techniques

Routing Techniques • Routing Algorithms: • Circuit Switching • Store-and-Forward • Virtual Cut-Through • Wormhole Routing • ….

Circuit Switching • A communication path is created from the source to the destination before transmitting any data. • A routing probe traverses network and reserving links to transmit the data. • Probe contains the source and destination addresses. • Once the routing probe reaches the destination address, an acknowledgment is sent back to the source address, • The data are transferred at the full bandwidth of the hardware. • The circuit remains operational until the end of data to be transmitted. • The lock on the links may be released once all the data have reached the destination by sending back another acknowledgment through the same route to the source.

Circuit Switching • Disadvantage: • long time to establish a dedicated link • Useful when tsu << tmsg • i.e. when long messages are present.

Store-and-Forward (SAF) • At each node: • the packets are stored in memory. • the routing information is examined to determine which output channel to direct the packet. • the packet is sent to the neighbor. • Latency: Nr * tr • Nr: number of routers through which the packet must travel • tr: time to transfer the packet between the routers

Virtual Cut-Through (VCT) • As the routing information is carried in the header, the packet should not be stored in the current node’s memory if an output buffer is available. • The packet simply cuts through the router of the node to an available output channel. • Advantage: • Less amount of memory along the path. • But enough memory has to be allocated if an output channel is not available. • At high volumes of messages on the network: VCT ≈ SAF

Wormhole Routing • Addresses the deficiency in VCT: • If an output channel is not available, the packet must be stored in the current node’s memory. • Divides a message into flits: • smaller flow-control digits than packets, • Each message contains one header flit and many data flits. • header: carries the routing and control information • Procedure: • If an output channel is available, the header flit is routed • Remaining data flits follow in a pipelined fashion.

Wormhole Routing • Advantage: • Smaller memory requirements exist for each node. • Buffers flits • very low latency. • Disadvantage: • Blocking and deadlock • Needs virtual channel technique: • Sharing a single physical channel.

Deadlock and Livelock • Deadlock: • A packet is waiting for an event that can never happen because of a circular dependence on resources. • Livelock: • Packets continue to move, but never reach their destination.

Routing Algorithms • Optimality: • Algorithm should determine the optimal routing path • Metrics: • high performance, • low overhead, • deadlock and livelock free, • fault-tolerance, • flexibility. • Classification: • Deterministic routing • Provides a unique path from a source to destination. • Adaptive routing • The direction where to send an incoming packet is not fixed a priori.

Deterministic Routing: XY Routing • XY Routing (dimension ordering routing): • Routes packets along the X-axis. • Once it reaches the destination’s column, routes along the Y-axis (until the destination’s line). • No packet moving in the Y-direction returns to the X-direction. • Disadvantage: • routes the packets based on the destination address, irrespective of the traffic pattern on the link and the link delay.

Deterministic Routing: XY Routing • Router action: • Compares its own address to the destination address of a packet. • If Xrouter < Xdest, • packet is sent to east • If Xrouter > Xdest, • packet is sent to west • If Xrouter = Xdest and Yrouter > Ydest, • packet is sent to south • If Xrouter = Xdest and Yrouter < Ydest, • packet is sent to north • If Xrouter = Xdest and Yrouter = Ydest, • packet is sent to the local PE

On-FPGA Communication Architectures

On-FPGA Communication Architectures

Presentation Transcript

Ethernet on the FPGA

FPGA Coprocessing in Multi-Core Architectures for DSP

Interconnect Testing in Cluster Based FPGA Architectures

On - Chip Communication Architectures

Hardware Encryption Market by Architectures (FPGA, ASIC)-Ana

Performance Evaluation of Packet Classiﬁcation on FPGA-based TCAM Emulation Architectures

FPGA Application on DAQ

Analysis of FPGA based Kalman Filter Architectures

A Hierarchical Modeling Framework for On-Chip Communication Architectures

Runtime Logic and Interconnect Fault Recovery on Diverse FPGA Architectures

Network-on-FPGA

Towards FPGA Architectures Optimized For Cryptographic Algorithms

Chapter Four - Communication Architectures

Multiple Drain Transistor-Based FPGA Architectures

Evaluating Communication Architectures

Evaluating Communication Architectures

Communication Between FPGA and LabView - FPGA part

Flexible wireless communication architectures

Bandpass filter on FPGA

Network-on-FPGA

FPGA Coprocessing in Multi-Core Architectures for DSP

MORE ON ARCHITECTURES