ENG6530 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems Reconfigurable Architectures ENG6530 RCS

Topics • Coupling of RCS Systems • Implementation Approaches • Advantages/Disadvantages • Mapping Large Designs? • Parallel, Serial, Semi-Serial • Floating Point, Fixed Point • Run Time Reconfigurations • Static vs. Dynamic Reconfiguration • Support for RTR ENG6530 RCS

References • “FPGA-Based System Design”, by Wayne Wolf • “Reconfigurable Computing: Accelerating Computation with FPGAs”, Maya Gokhale, 2005. • “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications”, by C. Bobda, 2007. • “Reconfigurable System Design and Verification”, by Pao-Ann Hsiung, 2009. • “Reconfigurable Computing: A Survey of Systems and Software”, Scott Hauck, 2002. ENG6530 RCS

Reconfigurable Computing: Definition • Reconfigurable Computing (RC) is a computing paradigm • where programmable logic devices are used to accelerate computations or applications by exploiting parallelism at different levels (bit, instruction level, architectural) • in which Algorithms are implemented as a temporally and spatially ordered set of very complex tasks. • What is meant by temporal and spatial implementations? ENG6530 RCS

Spatial vs. Temporal Computing Temporal Spatial ENG6530 RCS

Spatial & Temporal Definitions • There are several perspective on the meaning of spatial and temporal • Cluster of Microprocessors • Temporal = Function runs within one processor • Spatial = Function spread across many microprocessor nodes • Traditional Embedded Computing Hardware • Temporal = using Microprocessor • Spatial = Implementing dedicated ASIC accelerators • FPGAs • Temporal = using same logic resources for multiple Functions • Spatial = Parallelizing and pipelining a function across the FPGA Fabric ENG6530 RCS

Spatially Programmed Connections • “Hardware” customized to specifics of problem. • Direct map of problem specific • datapath, • control. • Circuits “adapt” as problem requirements change. ENG6530 RCS

How RC Enhances the performance？ • Performance enhancement achieved by hardware execution itself (overcoming the following limitations): • The overhead of software execution (Instruction fetch, data load to registers, and etc.) • The overhead of using fixed size data. • The overhead of sequencing (i.e., using branches). However, these benefits are not so large, for embedded CPU and DSP are highly optimized. The key of performance improvement is: Pipelining/Parallel processing

Issues in Configurable Design • Reconfigurable Hardware Architecture (FPGA) • Choice and granularity of computational elements • Issues related to performance, area and power consumption • Design Entry Techniques • Low Level (VHDL) • High Level (ESL e.g. Handel-C) • Support of efficient CAD tools • High Level Synthesis, Logic Optimization, • Mapping, Place & Route • Coupling Approaches • Tightly coupled vs. loosely coupled • Area versus Performance • Serial, semi-parallel, parallel • Floating Point, Fixed Point • Reconfiguration time and rate • Static versus dynamic reconfiguration (area, performance, approaches)

Coupling Approaches for Reconfigurable Hardware (RH) RH can be coupled to GP as: • A functional unit (Tight Coupling) • A Co-processor • An Attached processing unit • A Standalone processing unit (Loosely coupled) ENG6530 RCS

Different levels of coupling Loosely Coupled Workstation Attached Processing Unit Coprocessor Standalone Processing Unit Tightly Coupled I/O Interface CPU Memory Caches FU

1. Functional Unit • Part of the data-path of a host machine • Examples • Chimaerea (Hauck97a) • XiRisc Architecture • Tensilica ASIP CPU Memory Caches FU ENG6530 RCS

Fetch Decode Issue Integer Unit FP Unit Branch Unit LD/ST Unit Reconfigurable Unit Functional Unit • Features: • Customized instructions may change over time • Registers hold input/output • RU is a functional Unit Reconfigurable Instruction Set Processors ENG6530 RCS

Example of RPU integrated into CPU ENG6530 RCS

Architecture • Duplicated instruction decode logic (2 symmetrical data- channels) • Duplicated commonly used function Units (Alu and Shifter) • All others function units are shared (DSP operations, Memory handler) • A tightly coupledpipelined configurable Gate Array

PiCoGA PiCoGA: a Pipelined ConfigurableGate Array • Embedded function unit for dynamic extension of the Instruction Set • Two-dimensional array of LUT-based Reconfigurable Logic Cells • Each row implements a possible stage of a customized pipeline, independentand concurrent with the processor • Up to 4x32-bit input data and up to 2x32-bit output data from/to register File

Tensilica Xtensa Processor • Tensilica’s Xtensa processors are synthesizable processors that are configurable and extensible.! ENG6530 RCS

Tensilica Xtensa Architecture ENG6530 RCS

Automated Design Process ENG6530 RCS

XPRES Compiler ENG6530 RCS

2. Coprocessor • As a Coprocessor: • No sharing of data path of GPP CPU • Without constant supervision of the GPP • Similar to a Floating Point Unit (FPU) • Might share cache/memory • GPP initializes the RH • Independent parallel computation • More communication overhead • Several cycles Coprocessor CPU ENG6530 RCS

Coprocessor: Garp Architecture For general purpose loop acceleration Loop is extracted with a compiler, and converted to hardware ENG6530 RCS

Coprocessor Design: Cray XR1 • A Cray XR1 reconfigurable blade has two nodes, consisting of a single AMD Opteron processor coupled with two RPUs • This connection is made directly with HyperTransport. • This delivers low latency and high bandwidth communication between the processing elements. • Offers users orders of magnitude speedup on select applications. • Many Xilinx Virtex -4 FPGAs can be integrated into a single system and applied effectively against demanding problems. Cray XR1 blade: 2x AMD Opteron + 2x Virtex LX200 FPGAs ENG6530 RCS Cray XT5h Supercomputer

3. Attached Processing Unit • Behaves as an additional processor • Independent Computation • Higher delay to communicate with CPU • DMA-type overlap • No sharing of Cache Attached Processing Unit CPU Memory Caches ENG6530 RCS

Attached RPU • Similar to a multiprocessor environment • Allow transfer and computation of large amount of data • Communications with CPU => via memory ENG6530 RCS

Attached RPU: Zynq-7000 ENG6530 RCS

Zynq-7000 AP ENG6530 RCS

4. As a Standalone • The most loosely coupled to GP. • Infrequent Communication with the GP. • Independent computation for long time. • Communication is expensive!! I/O Interface CPU Memory Caches Standalone Processing Unit ENG6530 RCS

SYSTEM LEVEL VIEW of the SPLASH 2 ARCHITECTURE interface board:1.connects Splash 2 to the host 2.Extends the address and data buses Processing Element (PE): Each PE has 512 KB of memory The host can read/write this memory PE X0:controls the data flow into the processor board PEs (X1-X16) Splash 2 Processing Board The Sun host can read/write to memories and memory mapped control registers of Splash 2 via these buses. ENG6530 RCS

Pros/Cons of Coupling Approaches • The tight integration • Less communication overhead • RH can not operate “alone” for short period of time • Amount of Reconfig. Logic is limited • The loose integration • Greater parallelism • Greater independence • RH can not operate “alone” for long period of time • Higher and more expensive communication overhead ENG6530 RCS

Benefits/Drawbacks of the coupling + Dependency Reconfiguration speed Functional Unit Coprocessing Unit Attached RPU Communication overhead Amount of logic capacity Size Maintainability Standalone RPU + ENG6530 RCS

Summary • Degree of coupling plays an important role in terms of • Performance and cost. • Communication overhead, • Maintenance and reconfiguration speed • Several architectures have been proposed in academia and industry • New tools are required to aid the designer in exploring the design space and choose among the different coupling approaches for their specific application. ENG6530 RCS

Issues with Reconfigurable Computing ENG6530 RCS

How to manage large Designs? • Use the largest FPGA available. • Use multiple FPGAs to accommodate the entire design. • Optimize your design and the synthesis and place & route. • Customize your architecture to fit in the available FPGA: • use serial implementations or • semi-parallel implementations. ENG6530 RCS

× × × × + + + + + + FPGAs: Space/Speed Trade-offs How can we make this more area efficient yet still achieve performance? A Q = (A x B) + (C x D) + (E x F) + (G x H) can be implemented in parallel B C D Q E F G H ENG6530 RCS

Cont … Example: Semi Parallel • Y = (A * B) + (C *D) + (E * F) + (G * H); Can we make this more area efficient ? ENG6530 RCS

Cont … Example: Serial • Y = (A * B) + (C *D) + (E * F) + (G * H); ENG6530 RCS

× × × × × × × D Q + + + + + + + + + + + + Customize Architectures to Suit your Ideal Algorithms FPGAs allow Area (cost) / Performance tradeoffs Parallel Semi-Parallel Serial D Q Speed Optimized for? Area

Floating Point vs. Fixed Point ENG6530 RCS

Floating Point Representation • Floating-point arithmetic is sufficiently widespread in scientific computing, DSP Applications, Machine Learning, Communication Systems, Optimization, …. • Floating-point arithmetic is widely usedbecause it has many practical advantages ?? • It provides a familiar approximation to the real numbers, with useful properties like automatic scaling • It is widely available on different computers and is well supported by programming languages • Current workstations have highly optimized native floating-point arithmetic, sometimes faster than native integer arithmetic • Single Precession vs. Double Precession. ENG6530 RCS

S: sign of mantissa Range (roughly) Single: 10-38 to 1038 Double: 10-307 to 10307 Precision (roughly) Single: 7 significant decimal digits Double: 15 significant decimal digits FP Number Representation Mantissa x RExponent 5.234 x 10-28 ENG6530 RCS

sign integer bits fractional bits register width: RW = 1 + IB + FB (typically 16 or 32) Fixed-Point Arithmetic IB FB Example (RW=9, IB=FB=4) 0011 00112 = 1011.01112 = 3.187510 • Uses integers to represent fractional numbers: • Operations • Dynamic range: • -2IB ... 2IB-1 • much smaller than in floating-point risk of overflow • Problem: for a given application, choose IB (and thus FB) to avoid overflow • Any tools to automatically choose, application dependent, “best” IB (and thus FB) for linear DSP kernels? a·b »fb a+b multiplication addition ENG6530 RCS

Fixed Point ADVANTAGE: • and with integer is much quickerthan floating point additional and multiplication. • The hardware is less complex, • The hardware is cheaper, and • The hardware requires less power.. DISADVANTAGES: • DSP algorithms require fractional numbers (Complex for the developer) • Rounding 0.03 to 0 will cause your filter to fail. ENG6530 RCS

Addition Units: Some Trade-offs Floating-point vs. Fixed-point • Area : 7x-15x • Speed: 0.8x-1x • Power: 5x-10x ENG6530 RCS

Fixed Point or Floating Point? Fixed Point • Very fast when base 2 • No complicated logic • Radix point not encoded • Fixed Accuracy • Can only represent small number set Floating Point - Slower • Accuracy Varies • Represent very large number set • Radix point encoded • Complex logic required

Conversion Programs Development Procedure Floating-Point C Program Range Estimator Floating-Point to Fixed-Point C Program Converter Range Estimation C Program Manual specification Execution Fixed-Point C Program IWL information

Static and Dynamic Reconfiguration ENG6530 RCS

Reconfigurability • Reconfiguration is either static (execution is interrupted), semi-static (also called time-shared) or dynamic (in parallel with execution): • Static configuration involves hardware changes at the slow rate of days/weeks, typically used by hardware engineers to: • Evaluate prototype chip implementations, • Implement an architecture on an entire FPGA fabric or multiple FPGAs. • Semi-Static If an application can be pipelined, it might be possible to implement each phase in sequence on the reconfigurable hardware. • The switch between the phases is on command: a single FPGA performs a series of tasks in rapid succession, reconfiguring itself between each one. • Such designs operate the chip in a time-sharing mode and swap between successive configurations rapidly. • The dynamic reconfiguration: most powerful form of reconfigurable computing. • The hardware reconfigures itself on the fly as it executes a task. • While some modules are executing others might be swapped in/out. ENG6530 RCS

Static Implementation • Static or Compile Time Reconfiguration (CTR) • Static implementation strategy • Single system wide configuration • Configuration doesn’t change during computation • Similar to using ASIC for application acceleration CONFIGURE EXECUTE ENG6530 RCS

Compile Time Configuration • Compile time configuration is an important feature in SRAM-based FPGAs that allows changes in functionality according to need. • Enables benefits such as flexibility, hardware reuse, and reduced power consumption • Drawbacks of compile-time reconfiguration • Entire fabric is reconfigured even for slight design changes • System execution stalls completely • Time to load a design onto the fabric from external memory (reconfiguration time) increases with bitstream size Flexibility Designs loaded when required Hardware Reuse Current required design replaces old one on the same fabric Design C Design B Design A Design A Power Savings Design A, B, & C stored in external memory Configuration controller Design B Design C Design C Required Design B Required Design A Required External memory FPGA Fabric ENG6530 RCS 50

ENG6530 Reconfigurable Computing Systems