480 likes | 737 Views
Hardware/Software Codesign of Embedded Systems. Reconfigurable Computing. Voicu Groza SITE Hall, Room 5017 562 5800 ext. 2159 Groza@SITE.uOttawa.ca. Outline. Introduction Enabling Technologies Fix, configurable, reconfigurable ... Reconfigurable Architectures
E N D
Hardware/Software Codesign of Embedded Systems Reconfigurable Computing Voicu Groza SITE Hall, Room 5017 562 5800 ext. 2159 Groza@SITE.uOttawa.ca
Outline • Introduction • Enabling Technologies • Fix, configurable, reconfigurable ... • Reconfigurable Architectures • Run-Time-Reconfigurable System-on-Chip • Conclusion and Future Work • References
1. Introduction • Reconfigurable computing – Definition • Why reconfigurable computing ?
Reconfigurable Computing - Definition • Reconfigurable Computing (RC) = presence of hardware (HW) that can be reconfigured (reconfigware - RW) • 1960: Gerald Estrin, “The UCLA Fixed-Plus-Variable (F+V) Structure Computer” • DeHon and Wawrzynek: “computing via a postfabrication and spatially programmed connection of processing elements.” • The architecture used in the computation is determined postfabrication and can therefore adapt to the characteristics of the executed algorithms. • The computation is spatial, in contrast to the more temporal style associated with microprocessors.
Re-inventing the wheel... wire your own computer
Why reconfigurable computing ? • Is your belt long enough? • Embedded hand-held devices need to reduce • the power consumption targets, • the acceptable packaging and manufacturing costs, • the time-to-market • High-performance computing • Today’s computationally intensive applications require more processing power: • streaming video, • image recognition and processing, • highly interactive services • telecommunications • genes • Cray revived its latest entry-level XD1 supercomputer by combining AMD Opteron processors with FPGAs for compute acceleration in a Linux environment.
2. Enabling Technologies • Programmable ICs: CPLD and FPGA (Xilinx 1984) • HW Abstractions • Fine-grained Reconfiguration is at the gate and register level. • By reconfiguration of registers, gates, and their interconnections, the internal structure of functional units is changed. • 2 major technologies: • Complex Programmable Logic Devices (CPLD) – EEPROM based • Field-Programmable Gate Arrays (FPGA) – SRAM based • Coarse-grained Reconfiguration is based on a set of fixed blocks, like functional units, processor cores, and memory tiles. • The reconfiguration is merely the reprogramming of the interconnections between the fixed blocks.
Complex Programmable Logic Devices (CPLD) • Supplied with no predetermined logic function. • Programmed by user to implement any digital logic function. • Requires specialized computer software for design and programming. • Complex PLD (CPLD) = A PLD that has several programmable sections with internal interconnections between the sections. • The basic building block of a CPLD is a macrocell which implements a logic function that is synthesized into a sum of product equations, followed by a D-type register. • Macrocells are grouped into logic blocks which are connected via a centralized interconnect array.
Field-Programmable Gate Array (FPGA) • Reconfigurable functional units • coarse grained - ALUs and storage • fine-grained - small lookup tables Interconnection network Universal gates and/or storage elements Switches
Basic ingredient: Look Up Table (LUT) Universal gate = = Look-up table = memory Logic Cell 0 0 0 1 a0 data a1 a0 a1 & a2 • Memory elements: SRAM a1
Configurable Logic Blocks (CLB - Xilinx)Logic Array Block (LAB – Altera) XILINX Spartan II CLB • 2 logic cells =1 slice (Xilinx) or • = 1 Adaptive Logic Module (ALM - Altera) • 2 slices = HW abstractions Configurable Logic Blocks (CLB - Xilinx)
Xilinx - Spartan II Architecture • IOBs provide the interface between the package pins and the internal logic • CLBs provide the functional elements for constructing most logic • Dedicated block RAM memories (4096 bits each) • Clock DLLs for clockdistribution delay compensation and clock domain control • Versatile multi-level interconnect structure
SRAM Buffer Xilinx Virtex FPGA Model Logic block CLB IO Mux Switch Matrix Switch Matrix Line Segments Programmable Interconnect Point (PIP)
Virtex-II Architecture Overview • 1 CLB = 8 slices • 1 slice contains 2 function generators F & G which are configurable as • 4-input look-up tables (LUTs), or • 16-bit shift registers, or • 16-bit distributed SelectRAM memory. DCM = Digital Clock Manager Block SelRAM =18 Kbit (2k x 9bit of dual-port RAM) Multiplier blocks 18-bit x 18-bit
3. Fix, configurable, reconfigurable ... • A simple classification: • Non-configurable computing • Configurable computing • Reconfigurable computing • Each has its own characteristics, (dis)advantages and applications
Execute 3.1. Non-Configurable Computing • Uses fixed hardware such as ASICs or Custom VLSI circuits (eg. Microprocessors like x86, Sparc, DEC, PowerPC, etc…) • Long product turnaround time, usually around 3-6 months • Optimized for performance • Can be quite costly • Hardwired thus no room for error, re-work, improvement
Execute 3.2. Configurable Computing Bitstream Configuring Host • Configuring host supervises FPGA reconfiguration of a new bitstream • A bitstream is a sequence of bits which represents the burn-in configuration of the Hardware Block (HB) eg. synthesized, place and routed design 1110010001111111111111111110011000111100011111111101101001011101101110001001100011100000000011010101011110101011010111111111111 011010010111011011100010011000111001110010100110001110011100101001100011100111001010011000111001110010
3.2. Configurable Computing (Cont’d) Advantages: • Uses configurable hardware such as FPGA or CPLD • PLDs are soft wired for re-use of static hardware resources • Cost effective • Quick turnaround time • Flexible and ease in design process Disadvantages: • Inefficient use of hardware resources, cannot use idle FPGA area during run-time • Slow reconfiguration time, because of reconfiguring the entire FPGA for a single Hardware Block (HB) • Thus, must stop execution while reconfiguring a new Hardware Block
Execute 3.3. Reconfigurable Computing Bitstream Configuring Host 011010010111011011100010011000111001110010100110001110011100101001100011100111001010011000111001110010110010 1110010001111111111111111110011000111100011111111101101001011101101110001001100011100 1110010001111111111111111110011000111100011111111101101001011101101110001001100011100 We could also use a placement algorithm to possibly fit all requested HBs into the FPGA
3. Reconfigurable Computing (Cont’d) Advantages: • Same as Configurable Computing • No need to completely stop the execution while reconfiguring the FPGA with a new HB • Efficient use of static hardware resources; can swap out or move HBs around to fit new HBs on the FPGA, no need for a larger FPGA or a second one • Fast reconfiguration times • Run-time reconfiguration on the fly • Less power consumption, as we can swap out HBs Disadvantages: • Routing HBs can be a heavy overhead for the configuring host especially if HBs are too large or when defragmentation is necessary
What is Run-Time Reconfiguration (RTR) ? • On-the-fly flexibility • Combines characteristics of co-processors with those of reconfigurable computing • Introduces overhead to reconfigure the co-processor but offsets by increasing execution speed (faster in H/W!)
4. Reconfigurable Architectures • External stand-alone processing unit • Attached processing unit • Reconfigurable functional unit • Co-processor • Processor embedded in a reconfigurable fabric (Compton & Hauck)
External stand-alone processing unit RPU coupled to the I/O system bus • The RECON System • John Reid Hauser • John Wawrzynek • Randy H. Katz • (University of California, Berkeley) • Consists of a SUN SparcStation host and a reconfigurable coprocessor board (The board exploits a XC4010 FPGA as the reconfigurable processor unit).
Attached processing unit RPU coupled to the local bus • TKDM • Marco Platzner • ETH Zurich • FPGA module that uses the DIMM (dual inline memory module) bus for high-bandwidth communication with the host CPU. • It is integrated with the Linux host OS; • offers functions for data communication and FPGA reconfiguration.
Attached processing unit(Cont.) • Consists of a combination of a RISC processor core with an array of coarse-grain reconfigurable cells; • It utilizes a DMA controller in order to load the configuration data (context) into the Context Memory Morphosys Nader Bagherzadeh University of California, Irvine • Coarse grain: MorphoSys operates on 8 / 16-bit data. • Configuration: RC array is configured by context words, which specify an instruction opcode for RC. • Depth of programmability: The Context Memory can store up to 32 planes of configuration. • Dynamic reconfiguration: Contexts are loaded into Context Memory without interrupting RC operation. • Local/Host Processor: The control processor (Tiny RISC) and RC Array are resident on the same chip. • Fast Memory Interface: Through DMA controller.
Reconfigurable functional unit RPU integrated in the CPU • Chimaera • S. Hauck • University Washington, Seatle • System treats the reconfigurable logic as a cache for RPU instructions. • Those instructions that have recently been executed, or that we can otherwise predict might be needed soon, are kept in the reconfigurable logic. • If another instruction is required, it is brought into the RPU by overwriting one or more of the currently loaded instructions. Chimaera
Co-processor RPU coupled to the CPU • GARP • Hauser & Wawrzynek • University of California, Berkley • A reconfigurable architecture that combines reconfigurable hardware with a standard MIPS processor on the same die to retain better feature performance. • Two configurations can never be active at the same time on its reconfigurable array which can significantly reduce the overall performance of the system.
5. RTR-SoC System Architecture Execution unit of HBs Allows dedicated OMA-RPU access Stores program and data code IBM OPB Runs software instructions Stores HB bitstreams RTR-SoC System Architecture
Application and Reconfiguration Flows • While the application flow runs on AE, RE sends RTR_PREP_HB to the ICAP controller, to start the loading of the first HB bitstream onto the RPU. • Once this HB is ready in the RPU, the ICAP sends back an RTR_ACK to the RE. • The newly implemented HB on the RPU starts to work as soon as it is ENABLEd by the reconfiguration flow on RE. • Upon completion, HB sets flag RTR_DONE to make the application flow aware that it is ready for use. • Once the application flow on AE has prepared data that HB needs, AE asserts the flag DATA_READY. • HB asserts EXE_DONE when finishes its task and has prepared the results to be read by the application flow on AE. • When the application flow needs these results, it checks the flag EXE_DONE, and waits if it is not yet set. • The application flow gets the results and then asserts DATA_ACK to acknowledge to HB that it got data.
Physical Layer Overview • Have already developed a physical layer in JBits in order to evaluate RTR on a Xilinx Virtex device • Physical layer has 3 main functions • modeling the FPGA resources, • running a placement algorithm for the different Hardware Blocks, and • managing the physical resources of the FPGA and any on-board peripherals. RTR Execution Model • Bitstream(s) read by the JBits App • JBits App configures the Virtex RC HW located in the PCI slot using the XHWIF API. • XHWIF (Xilinx HardWare InterFace Standard) Java interface for communicating with FPGA- based boards. This Enables run-time reconfiguration of Virtex Device. JBits is a set of Java APIs and classes that provide a High-Level language approach to develop reconfigurable Systems, include RT reconfiguration.
HBDU … … . . . . . . . . . … valid CU done r/w Mem req Packer Dispatcher Mem ack HBIU I-Buffer Data_ MAB O-Buffer HB sel1 Register Decoder RS10 . . . HB sel2 RS20 Register Decoder . . . reg sel1 RS1n reg sel2 RS2n data_ opb data HB I/F addr HB ss opb addr MAB r/w opb ss mc r/w hb PE PE PE PE PE PE PE PE PE LM LM LM LM LM LM LM LM LM Hardware Block (HB) Architecture • An HB is a functional hardware module that contains its own configuration (i.e. the bitstream), and state information (e.g. status and control registers) that define its current state. • It is divided into two major components: • The HB Dependent Unit (HBDU) Encompasses several components that vary in functionality and magnitude depending on the functions supported by a particular HB. • The HB Independent Unit (HBIU) Designed as a core and hence follows a standardized implementation scheme for all HBs.
ICAP FPGA Configuration Memory Control Logic MicroBlaze BRAM OPB Bus Hardware Block Reconfiguration • The HBs are partially reconfigured by the aforementioned Reconfigurable Processing Unit (RPU). • The reconfiguration process is enabled by means of a Self-Reconfiguration Platform (SRP). • It enables the FPGA to be dynamically reconfigured under the control of an embedded microprocessor. • It is divided into a H/W component and S/W components. • The H/W component consists of four primary components: the Internal Configuration Access Port (ICAP), some control logic, a small configuration cache - Block RAM (BRAM), and an embedded processor. • The S/W component implements an APIthat defines methods for accessing configuration logic through the ICAP port.
PR Methodology: Xilinx Virtex II Architecture • Virtex II FPGAs fabric composed of an array of Configurable Logic Blocks (CLBs). • Block RAMs (BRAM). • Input/Output Blocks (IOBs). • Special functions blocks such as Multipliers, PLLs etc. • Each CLB contains four slices. • Each slice contains two 4-input look-up tables, 2 D-type flip-flops to implement combinational and sequential circuits.
PR Methodology • Bus Macros (BMs) are required between active and static modules of the design. • The size and location of the reconfigurable module (active) is always fixed. • The reconfigurable module is always the full height of the device; • All logic resources located within the width of the module are considered part of the reconfigurable module’s bitstream frame. This includes slices, tri-state buffers (TBUFs), block RAMs (BRAMs), multipliers, input/output blocks (IOBs), and all routing resources.
PR Methodology Bus Macro block Diagram • Bus Macros (BMs) are predefined physical routing bridges that connect the active to the static one. • Any connection from active to static logic should always go through a bus macro • We chose the slices bus macros (over the TBUF) as they give higher concentration of communication bits per CLB • Bus macros allows data to move in only one direction either left-to-right or right-to-left.
PR Methodology Final Design Layout Design contains only one active module. All other logic components are on the static module.
PR Methodology Xilinx Internal Configuration Access Port (ICAP) • Provides configuration interface to FPGA fabric. • Cache BRAM to hold at least one frame. • Control logic for the OPB bus interface. • API calls to allow SW to read/Write configuration memory.
PR Methodology • A partial bitstream is generated for the active (dynamic) part of the FPGA • The device remains in full operation while the new partial bitstream is downloaded • The full bitstream configuration must already be programmed into the device before downloading the partial bitstream. • Multiple bitstreams can be generated for every partially reconfigurable module variation • Failing to utilize this command will assert the global set reset (GSR) during configuration, resetting the entire design • –g ActiveReconfig: Yes option
PR Methodology • Virtex-II configuration memory is arranged in vertical frames that are one bit wide and stretch from the top edge of the device to the bottom. • These frames are the smallest addressable segments of the Virtex-II configuration memory space; therefore, all operations must act on whole configuration frames. • The length of a Virtex-II frame is not fixed and depends on the size of the device. • the number of frames per column type is constant for all devices.
Reconfigurable Processing Unit The RPU high-level block diagram
Preliminary Results • Xilinx Virtex-II Platform FPGAs were used to implement this system. • Preliminary results were generated using ModelSim SE 5.7f. Simulation results for the HB I/F interface. They illustrate how the I/F is used in order to enable proper synchronization among the reconfiguration flow and the application flow.
6. Conclusion and Future Work • A novel architecture of a RTR SoC is introduced • RPU and HBs are designed • This design targets adaptive embedded systems, DSP-related and low-power applications • These functions are implemented as HBs and can be exploited in a multi-purpose environment. For example, the RTR SoC may execute various tasks to perform DSP-related functions, and subsequently reconfigured into a high-performance measurement processing system • Future designs would allow the user more flexibility by auto-reconfiguring the RPU depending on the computational and functional needs of its respective applications • Real-time applications is our future target, as idle HBs are swapped out of the RPU, to save power or to allow for updates to the HBs
References • Marco Platzner. „Reconfigurable Computer Architectures,“ e&i Elektrotechnik und Informationstechnik, 115(3):143-148, 1998. Springer. • Y. Li, T. Callahan, E. Darnel, R. Harr, U. Kurkure and J. Stockwood, “HardwareSoftware Co-Design of Embedded Reconfigurable Architectures,” 37th Design Automation Conference, 2000. Proceedings DAC pp.:507 - 512, June 5-9, 2000. • J. P. Heron, R. Woods, S. Sezer, and R. H. Turner. “Development of a run-time reconfiguration system with low reconfiguration overhead,” Journal of VLSI Signal Processing, 28(1/2):97-113, May 2001. • “Xilinx Microblaze Soft Processor Core,” http://www.xilinx.com/ise/embedded/edk6_2docs/mb ref_guide.pdf, last accessed on October 19, 2004 • G. Aggarwal, N. Thaper, K. Aggarwal, M. Balakrishnan, and S. Kumar. “A Novel Reconfigurable Co-Processor Architecture,” In Proceedings of Tenth International Conference on VLSI Design, pages 370-375, January 1997. • G. Haug and W. Rosenstiel. “Reconfigurable Hardware as Shared Resource in Multipurpose Computers,” In Reiner W. Hartenstein and Andres Keevallik, editors, Field-Programmable Logic: From FPGAs to Computing Paradigm, Springer-Verlag, pages 149-158, Berlin, August/September 1998. • “Xilinx Virtex-II Platform FPGAs: Complete Data Sheet,” DS031 (14 Oct. 2003). • D. Wo and K. Forward, “Compiling to the Gate Level for a Reconfigurable Co-Processor” In Proceeding of FPGAs for Custom Computing Machines (1994), pages 147-154. • V. Groza, R. Abielmona, M. El-Kadri, N. Sakr, and M. Elbadri, “A Reconfigurable Co-Processor for Adaptive Embedded Systems,” Workshop on Intelligent Solutions in Embedded Systems, Graz, Austria, June 2004. • “IBM On-Chip Peripheral Bus,” http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/ 9A7AFA74DAD200D087256AB30005F0C8/$file/OpbBus.pdf last accessed on October 19, 2004 • R. Abielmona, V. Groza, N. Sakr, and J. Ho, “Low-Level Run-Time Reconfiguration of FPGAs for Dynamic Environments,” IEEE Canadian Conference on Electrical and Computer Engineering, CCECE 2003, Niagara Falls, May 2004. • B. Blodget, P. James-Roxby, E. Keller, S. McMillian, and P. Sundararajan. “A Self reconfiguring Platform,” Proceedings of the International Conference on Field Programmable Logic, Lisbon, Portugal, Sept. 2003.