410 likes | 435 Views
Reconfigurable Computing. Dr. Christophe Bobda CSCE Department University of Arkansas. Chapter 1 (Cont.) Architectures. Agenda. Motivation Coarse-Grained Reconfigurable Devices DataFlow machines The PACT XPP The NEC DRP The PicoChip Network-Based architectures The Quicksilver ACM
E N D
Reconfigurable Computing • Dr. Christophe Bobda • CSCE Department • University of Arkansas
Agenda • Motivation • Coarse-Grained Reconfigurable Devices • DataFlow machines • The PACT XPP • The NEC DRP • The PicoChip • Network-Based architectures • The Quicksilver ACM • Embedded PLDs • The IPflex DAP/DANN • Tensilica reconfigurable processor • The Strech Processor
1. Recall • Brief Historycally development (Estrin Fix-Plus and Rammig machine) • Programmable Logic • PALs and PLAs • CPLDs • FPGAs • Technology • Architecfture by mean of example • Actel • Xilinx • Altera
1. Once again: General purpose vs Special purpose • With the LUT as function generators, FPGA can be seen as general purpose devices • Like any general purpose device, they are flexible and “inefficient“ • Flexible because any n-variables Boolean function can be implemented in a n-input LUT • Inefficient since complex functions must be implemented in many LUTs at different locations. • The connections among the LUTs is done using the routing matrix wich increases the signal delays • LUT implementation is usually slower than dircect „wiring“
A B Connection matrix D A F C D A B C 1. Once again: General purpose vs Special purpose • Example: Implement the function using 2-input LUTs. • LUTs are grouped in logic blocks (LB). 2 2-input LUT per • LB Connection inside a LB is efficient (direct) • Connection outside LBs are slow (Connection matrix)
1. Once again: General purpose vs Special purpose • Idea: Implement frequently used blocks as hard-core • module in the device A B Connection matrix D A F C D A B C A B C D
1. Coarse grained reconfigurable devices • Overcome the inefficiency of FPGAs by providing coarse grained functional units (Adder, multipliers, integrators, etc...), efficiently implemented • Advantage: Very efficient in term of speed (no need for connections over connection matrice for basic operators) • Advantage: Direct wiring istead of LUT implementation • A coarse grained device is usually an array of programmable and identical processing element (PE) capable of executing few operations like addition and multiplication • Depending on the manufacturer, the functional units communicate via busses or can be directly connected using programmable routing matrices
1. Coarse grained reconfigurable devices • Memory exist between and inside the PEs. • Several other functional units according to the manufacturer. • A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to executed only one operation on a given period (until the next configuration) • Communication among the PEs can be either packet oriented (on busses) or point-to-point (using crossbar switches) • Since each vendor has its own implementation approach, study will be done by mean of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, picoChip, IPflex DAP/DNA.
2.1 The PACT XPP – Overall structure • XPP (Extreme Processing Platform) is a hierarchical structure consisting of: • An array of Processing Array Elements (PAE) grouped in clusters called Processing Arrays (PA) • PAC = Processing Array Cluster (PAC) + Configuration manager (CM) • A hierarchical configuration tree • Local CMs manage the configuration at the PA level • The local CMs access the local configuration memory while Supervisor CM (SCM) access external memory and supervise the whole configuration process on the device
2.1 The PACT XPP – Overall structure • The PAE: Two types of PAE • The ALU PAE • The RAM PAE • The ALU PAE: • Contain an ALU which can be configured to perform basic operations • Back-register (BREG) provides routing channels for data and events from bottom to top • Forward Register (FREG) provides routing channels from top to bottom
2.1 The PACT XPP – Overall structure • DataFlow Register (DF-REG) can be used at the object outputs for buffering data • Input register can be preloaded by configuration data • The RAM PAE: • Differs from the ALU-PAE only on the function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM • Useful for data storage • Data is written or read after the reading of an address at the RAM-inputs • BREG, FREG, and DF-REG of the RAM-PAE have the same function like in the ALU-PAE
2.1 The PACT XPP – Overall structure • Routing in the PACT XPP: • Two independent networks • One for data transmission • The other for event transmission • A Configuration BUS exists besides the data and event networks (very few information exists about the configuration bus) • All objects can be connected to horizontal routing channels using switch-objects • Vertical routing channels are provided by the BREG and FREG • BREGs route from bottom to top • FREGs route from top to bottom Vertical routing channels Horizontal routing channels
2.1 The PACT XPP - Interface • Interfaces are available inside the chip • Number and type of interfaces vary from device to device • On the XPP42-A1: 6 internal interfaces consisting of: • 4 identical general purpose I/O on-chip interfaces (bottom left, upper left, upper right, and bottom right) • One configuration manager (not shown on the picture) • One JTAG (Join Test Action Group, "IEEE Standard 1149.1") Boundary scan interface or for testing purpose Interfaces
2.1 The PACT XPP - Interface • The I/O interfaces can operateindependent from each other. Two operation modes • The RAM mode • The streaming mode • RAM mode: • Each port can access external Static RAM (SRAM). • Controls signals for the SRAM transaction are available. • No additional logic requires
2.1 The PACT XPP - Interface • Streaming mode: • For high speed streaming of data to and from the device • Each I/O element provides two bidirectional port for data streaming • Handshake signals are used for synchronization of data packets to external port
2.1 The NEC DRP – Architecture • The NEC Dynamically Reconfigurable Processor (DRP) consists of: A set of byte oriented processing elements (PE) • A programmable interconnection network for communication among the PEs. • A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process • Memory around the device for storing configuration and computation data • Various Interfaces
2.1 The NEC DRP - The Processing Element • ALU: ordinary byte arithmetic/logic operations • DMU (data management unit): handles byte select, shift, mask, constant generation, etc., as well as bit manipulations • An instruction dictates ALU/DMU operations and inter-PE connections • Source/destination operands can either from/to • its own register file • other PEs (i.e., flow through) • Instruction pointer (IP) is provided from STC (state transition controller)
2.1 The NEC DRP - The Processing Element • Instruction Pointer(IP) from STC identifies a datapath plane • Spatial computation with using a customized datapath plane • When IP changes, datapath plane switches instantaneously • PE instructions as a collection behave like an extreme VLIW • Sequencing through instructions=> Dynamic reconfiguration AES 3DES MD5 Data In SHA-1 Data Out Compress (task selection by descriptor) Control Multiple Datapath Planes
PE Array IP = “1” ALU DMU PE IP = “1” 3 Insts. 0 1 1 2 1 4 PE Add Sel Identify the instruction to be executed 1 Decode the instruction in the ALU plane 2 Add Cmp Sel Add Add Cmp 2.1 The NEC DRP – Reconfiguration Process PE Array 1 PE ALU DMU 2 Insts. 0 1 2 Configure the ALU Plane according to the instruction 3 4 +
2.1 The picoChip - Architecture • Hundreds of array elements each with versatile 16-bit processor and local data • heterogeneous architecture with four types of elements optimized for different tasks (DSP or wireless function) • Interface for: • SRAM • Host communication • External systems • Inter picoChip system
2.2 The Quicksilver ACM - Architecture • Quicksilver ACM (Adaptive Computing Machine) • Fractal like structure • Hierarchically group of four nodes with full communication among the nodes • 4 lower level nodes are grouped in a higher level node • The lowest level consist of 4 heterogeneous processing nodes • The connection is done in a Matrix Interconnect Network (MIN) • A system controller • Various I/O
2.2 The Quicksilver ACM – The processing node • An ACM processing node consist of: • An algorithmic engine. It is unique to each nodes type and defines the operation to perform by the node. • The node Memory for data storage at the node level. • A node wrapper which is common to all nodes. It is use to hide the complexity of the heterogeneous architecture.
2.2 The Quicksilver ACM – The processing node • Four types of nodes exist: • The Programmable Scalar Node (PSN) provides a standard 32-bit RISC architecture with 32-bit general purpose registers • The Adaptive Execution Node (AXN) provides variable size MAC and ALU operations • The Domain Bit Manipulation (DBM) node provides bit manipulation and byte oriented operation • External Memory Controller node provides DDRRAM, SRAM, memory random access DMA control interface ACM PSN-Node
ACM AXN-Node ACM DBM-Node 2.2 The Quicksilver ACM – The processing node
2.2 The Quicksilver ACM – The node wrapper • The node wrapper:Envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features: • A MIN interface to support the communication among nodes via the MIN-network • A hardware task manager for task management at the node level • A DMA engine • Dedicated I/O circuitry • Memory controllers • Data distributors and aggregators The ACM Node-Wrapper
2.2 The Quicksilver ACM – The node wrapper • Matrix Interconnect Network is the communication medium in an ACM chip • Hierarchically organized. The MIN at a given level connects many lower-level MINs • The MIN-Root is used for: • Off-chip communication • Configuration • Support the communication among nodes • Provides service like Point to point dataflow streaming, Real-time broadcasting, DMA, etc... Example of ACM Chip configuration
2.2 The Quicksilver ACM – The System Controller The system controller • The system controller is in charge of the system management • Loads tasks into node ready-to-run queue for execution • Statically or dynamically sets the communication channels between the processing nodes • Carry the reconfiguration of nodes on a clock cycle-by-clock cycle basis • The ACM chip features a set of I/O interfaces controllers like: • PCI • PLL • SDRAM and SRAM The interface controllers
2.3 The IPflex DAP/DNA - Structure • The IPflex DAP/DNA has the structure of a System on Chip (SoC) with an embedded FPGA. It features: • Integrated RISC core • Carry some computation • Controls the reconfiguration process • A Distributed Network Architecture (DNA) matrix (matrix of configurable operation units) • Communication over an internal bus • Different caches for data, instructions and configuration • I/O and memory Interface controllers
2.3 Tensilica Xtensa • 32-bit synthesizable CPU core (SOC) • High-performance, low-power • Tensilica Instruction Extension ("TIE") language • Designers can modify the feature set of the processor: cache sizes, address and data bus widths, and other core parameters • Automatic generation of a complete software development tool environment for a specific processor instance • “application-specific extensions” at design time
2.3 Tensilica Xtensa • Instruction Set Architecture (ISA) • Consists of: • Base set of instructions (~80 instructions, superset of traditional RISC) • set of configurable options • Attributes: • Enables configurability • Minimizes code size • Reduces power req. • Maximizes performance
2.3 Tensilica Xtensa • Instruction Set Architecture (ISA)´Consists of: • Base set of instructions (~80 instructions, superset of traditional RISC) • set of configurable options • Attributes: • Enables configurability • Minimizes code size • Reduces power req. • Maximizes performance
2.3 Tensilica Xtensa • The user gets • Resulting customized synthesizable hardware description • Additional logic to build a complete Xtensa CPU on an FPGA • Full set of diagnostics to verify the RTL • CAD tool scripts to with assist embedding the core in your design • Customized Compiler Toolchain tuned to the core • GNU C/C++/Assembler and profiler • GDB and the DDDebugger GUI • XMON for debugging on live FPG
2.3 Strech software configurable processor • Xtensa V RISC CPU from Tensilica • Run-time instruction set extension trough an embedded programmable Logic • Instruction Set Extension Fabric (ISEF) • SW-Controlled • Wide Load/Store support for any alignment • Wide register file • Up to 3 wide operand • 1 or 2 wide result
2.3 Strech software configurable processor Strech S5000 ISEF Strech S5000 configurable processor
Device size • Usually measure in the number of transistor used in the device • This is not so helpful for reconfigurable devices, since the number of transistors is not the number of usable resource in the chip. For example: FPGA are one of the most complex chip (complexer than Pentium processors), but their capacity is smaller than their ASIC counterpart. • The Capacity of FPGA is usually measured in term of the number of Gates equivalent a design need to be implemented. • A gate equivalent is a unit of measure. 1 gate equivalent = 1 2-inputs NAND gate • A one million-gates FPGA is able to implement the equivalent of a circuit containing 1 million 2-inputs NAND gates