Efficiency Through Coarse-Grained Reconfigurable Devices

Reconfigurable Computing • Dr. Christophe Bobda • CSCE Department • University of Arkansas

Chapter 1 (Cont.)Architectures

Agenda • Motivation • Coarse-Grained Reconfigurable Devices • DataFlow machines • The PACT XPP • The NEC DRP • The PicoChip • Network-Based architectures • The Quicksilver ACM • Embedded PLDs • The IPflex DAP/DANN • Tensilica reconfigurable processor • The Strech Processor

1. Coarse-grained Reconfigurable DevicesMotivation

1. Recall • Brief Historycally development (Estrin Fix-Plus and Rammig machine) • Programmable Logic • PALs and PLAs • CPLDs • FPGAs • Technology • Architecfture by mean of example • Actel • Xilinx • Altera

1. Once again: General purpose vs Special purpose • With the LUT as function generators, FPGA can be seen as general purpose devices • Like any general purpose device, they are flexible and “inefficient“ • Flexible because any n-variables Boolean function can be implemented in a n-input LUT • Inefficient since complex functions must be implemented in many LUTs at different locations. • The connections among the LUTs is done using the routing matrix wich increases the signal delays • LUT implementation is usually slower than dircect „wiring“

A B Connection matrix D A F C D A B C 1. Once again: General purpose vs Special purpose • Example: Implement the function using 2-input LUTs. • LUTs are grouped in logic blocks (LB). 2 2-input LUT per • LB Connection inside a LB is efficient (direct) • Connection outside LBs are slow (Connection matrix)

1. Once again: General purpose vs Special purpose • Idea: Implement frequently used blocks as hard-core • module in the device A B Connection matrix D A F C D A B C A B C D

1. Coarse grained reconfigurable devices • Overcome the inefficiency of FPGAs by providing coarse grained functional units (Adder, multipliers, integrators, etc...), efficiently implemented • Advantage: Very efficient in term of speed (no need for connections over connection matrice for basic operators) • Advantage: Direct wiring istead of LUT implementation • A coarse grained device is usually an array of programmable and identical processing element (PE) capable of executing few operations like addition and multiplication • Depending on the manufacturer, the functional units communicate via busses or can be directly connected using programmable routing matrices

1. Coarse grained reconfigurable devices • Memory exist between and inside the PEs. • Several other functional units according to the manufacturer. • A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to executed only one operation on a given period (until the next configuration) • Communication among the PEs can be either packet oriented (on busses) or point-to-point (using crossbar switches) • Since each vendor has its own implementation approach, study will be done by mean of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, picoChip, IPflex DAP/DNA.

2.1 Dataflow Machines

2.1 The PACT XPP – Overall structure • XPP (Extreme Processing Platform) is a hierarchical structure consisting of: • An array of Processing Array Elements (PAE) grouped in clusters called Processing Arrays (PA) • PAC = Processing Array Cluster (PAC) + Configuration manager (CM) • A hierarchical configuration tree • Local CMs manage the configuration at the PA level • The local CMs access the local configuration memory while Supervisor CM (SCM) access external memory and supervise the whole configuration process on the device

2.1 The PACT XPP – Overall structure • The PAE: Two types of PAE • The ALU PAE • The RAM PAE • The ALU PAE: • Contain an ALU which can be configured to perform basic operations • Back-register (BREG) provides routing channels for data and events from bottom to top • Forward Register (FREG) provides routing channels from top to bottom

2.1 The PACT XPP – Overall structure • DataFlow Register (DF-REG) can be used at the object outputs for buffering data • Input register can be preloaded by configuration data • The RAM PAE: • Differs from the ALU-PAE only on the function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM • Useful for data storage • Data is written or read after the reading of an address at the RAM-inputs • BREG, FREG, and DF-REG of the RAM-PAE have the same function like in the ALU-PAE

2.1 The PACT XPP – Overall structure • Routing in the PACT XPP: • Two independent networks • One for data transmission • The other for event transmission • A Configuration BUS exists besides the data and event networks (very few information exists about the configuration bus) • All objects can be connected to horizontal routing channels using switch-objects • Vertical routing channels are provided by the BREG and FREG • BREGs route from bottom to top • FREGs route from top to bottom Vertical routing channels Horizontal routing channels

2.1 The PACT XPP - Interface • Interfaces are available inside the chip • Number and type of interfaces vary from device to device • On the XPP42-A1: 6 internal interfaces consisting of: • 4 identical general purpose I/O on-chip interfaces (bottom left, upper left, upper right, and bottom right) • One configuration manager (not shown on the picture) • One JTAG (Join Test Action Group, "IEEE Standard 1149.1") Boundary scan interface or for testing purpose Interfaces

2.1 The PACT XPP - Interface • The I/O interfaces can operateindependent from each other. Two operation modes • The RAM mode • The streaming mode • RAM mode: • Each port can access external Static RAM (SRAM). • Controls signals for the SRAM transaction are available. • No additional logic requires

2.1 The PACT XPP - Interface • Streaming mode: • For high speed streaming of data to and from the device • Each I/O element provides two bidirectional port for data streaming • Handshake signals are used for synchronization of data packets to external port

2.1 The NEC DRP – Architecture • The NEC Dynamically Reconfigurable Processor (DRP) consists of: A set of byte oriented processing elements (PE) • A programmable interconnection network for communication among the PEs. • A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process • Memory around the device for storing configuration and computation data • Various Interfaces

2.1 The NEC DRP - The Processing Element • ALU: ordinary byte arithmetic/logic operations • DMU (data management unit): handles byte select, shift, mask, constant generation, etc., as well as bit manipulations • An instruction dictates ALU/DMU operations and inter-PE connections • Source/destination operands can either from/to • its own register file • other PEs (i.e., flow through) • Instruction pointer (IP) is provided from STC (state transition controller)

2.1 The NEC DRP - The Processing Element • Instruction Pointer(IP) from STC identifies a datapath plane • Spatial computation with using a customized datapath plane • When IP changes, datapath plane switches instantaneously • PE instructions as a collection behave like an extreme VLIW • Sequencing through instructions=> Dynamic reconfiguration AES 3DES MD5 Data In SHA-1 Data Out Compress (task selection by descriptor) Control Multiple Datapath Planes

PE Array IP = “1” ALU DMU PE IP = “1” 3 Insts. 0 1 1 2 1 4 PE Add Sel Identify the instruction to be executed 1 Decode the instruction in the ALU plane 2 Add Cmp Sel Add Add Cmp 2.1 The NEC DRP – Reconfiguration Process PE Array 1 PE ALU DMU 2 Insts. 0 1 2 Configure the ALU Plane according to the instruction 3 4 +

2.1 The picoChip - Architecture • Hundreds of array elements each with versatile 16-bit processor and local data • heterogeneous architecture with four types of elements optimized for different tasks (DSP or wireless function) • Interface for: • SRAM • Host communication • External systems • Inter picoChip system

2.2 Network-Based Machines

2.2 The Quicksilver ACM - Architecture • Quicksilver ACM (Adaptive Computing Machine) • Fractal like structure • Hierarchically group of four nodes with full communication among the nodes • 4 lower level nodes are grouped in a higher level node • The lowest level consist of 4 heterogeneous processing nodes • The connection is done in a Matrix Interconnect Network (MIN) • A system controller • Various I/O

2.2 The Quicksilver ACM – The processing node • An ACM processing node consist of: • An algorithmic engine. It is unique to each nodes type and defines the operation to perform by the node. • The node Memory for data storage at the node level. • A node wrapper which is common to all nodes. It is use to hide the complexity of the heterogeneous architecture.

2.2 The Quicksilver ACM – The processing node • Four types of nodes exist: • The Programmable Scalar Node (PSN) provides a standard 32-bit RISC architecture with 32-bit general purpose registers • The Adaptive Execution Node (AXN) provides variable size MAC and ALU operations • The Domain Bit Manipulation (DBM) node provides bit manipulation and byte oriented operation • External Memory Controller node provides DDRRAM, SRAM, memory random access DMA control interface ACM PSN-Node

ACM AXN-Node ACM DBM-Node 2.2 The Quicksilver ACM – The processing node

2.2 The Quicksilver ACM – The node wrapper • The node wrapper:Envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features: • A MIN interface to support the communication among nodes via the MIN-network • A hardware task manager for task management at the node level • A DMA engine • Dedicated I/O circuitry • Memory controllers • Data distributors and aggregators The ACM Node-Wrapper

2.2 The Quicksilver ACM – The node wrapper • Matrix Interconnect Network is the communication medium in an ACM chip • Hierarchically organized. The MIN at a given level connects many lower-level MINs • The MIN-Root is used for: • Off-chip communication • Configuration • Support the communication among nodes • Provides service like Point to point dataflow streaming, Real-time broadcasting, DMA, etc... Example of ACM Chip configuration

2.2 The Quicksilver ACM – The System Controller The system controller • The system controller is in charge of the system management • Loads tasks into node ready-to-run queue for execution • Statically or dynamically sets the communication channels between the processing nodes • Carry the reconfiguration of nodes on a clock cycle-by-clock cycle basis • The ACM chip features a set of I/O interfaces controllers like: • PCI • PLL • SDRAM and SRAM The interface controllers

2.3 Embedded PLD

2.3 The IPflex DAP/DNA - Structure • The IPflex DAP/DNA has the structure of a System on Chip (SoC) with an embedded FPGA. It features: • Integrated RISC core • Carry some computation • Controls the reconfiguration process • A Distributed Network Architecture (DNA) matrix (matrix of configurable operation units) • Communication over an internal bus • Different caches for data, instructions and configuration • I/O and memory Interface controllers

2.3 Tensilica Xtensa • 32-bit synthesizable CPU core (SOC) • High-performance, low-power • Tensilica Instruction Extension ("TIE") language • Designers can modify the feature set of the processor: cache sizes, address and data bus widths, and other core parameters • Automatic generation of a complete software development tool environment for a specific processor instance • “application-specific extensions” at design time

2.3 Tensilica Xtensa • Instruction Set Architecture (ISA) • Consists of: • Base set of instructions (~80 instructions, superset of traditional RISC) • set of configurable options • Attributes: • Enables configurability • Minimizes code size • Reduces power req. • Maximizes performance

2.3 Tensilica Xtensa • Instruction Set Architecture (ISA)´Consists of: • Base set of instructions (~80 instructions, superset of traditional RISC) • set of configurable options • Attributes: • Enables configurability • Minimizes code size • Reduces power req. • Maximizes performance

2.3 Tensilica Xtensa

2.3 Tensilica Xtensa • The user gets • Resulting customized synthesizable hardware description • Additional logic to build a complete Xtensa CPU on an FPGA • Full set of diagnostics to verify the RTL • CAD tool scripts to with assist embedding the core in your design • Customized Compiler Toolchain tuned to the core • GNU C/C++/Assembler and profiler • GDB and the DDDebugger GUI • XMON for debugging on live FPG

2.3 Strech software configurable processor • Xtensa V RISC CPU from Tensilica • Run-time instruction set extension trough an embedded programmable Logic • Instruction Set Extension Fabric (ISEF) • SW-Controlled • Wide Load/Store support for any alignment • Wide register file • Up to 3 wide operand • 1 or 2 wide result

2.3 Strech software configurable processor Strech S5000 ISEF Strech S5000 configurable processor

Device size • Usually measure in the number of transistor used in the device • This is not so helpful for reconfigurable devices, since the number of transistors is not the number of usable resource in the chip. For example: FPGA are one of the most complex chip (complexer than Pentium processors), but their capacity is smaller than their ASIC counterpart. • The Capacity of FPGA is usually measured in term of the number of Gates equivalent a design need to be implemented. • A gate equivalent is a unit of measure. 1 gate equivalent = 1 2-inputs NAND gate • A one million-gates FPGA is able to implement the equivalent of a circuit containing 1 million 2-inputs NAND gates

Efficiency Through Coarse-Grained Reconfigurable Devices

Efficiency Through Coarse-Grained Reconfigurable Devices

Presentation Transcript

Reconfigurable Computing

Reconfigurable Computing - VHDL

Reconfigurable Computing - Pipelined Systems

Reconfigurable Computing

Reconfigurable Computing

Introduction to Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable computing

Reconfigurable Computing Platforms

Reconfigurable Computing

ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications

Configurable, reconfigurable, and run-time reconfigurable computing

Reconfigurable Computing - Clocks

ENG6530 Reconfigurable Computing Systems

Reconfigurable Computing

Secure Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing Applications

Reconfigurable Computing