340 likes | 477 Views
High-Level Programming Issues for Reconfigurable Computing Systems. Mark Jones ECE Virginia Tech Blacksburg, Virginia mtj@vt.edu www.ccm.ece.vt.edu. The Virginia Tech Configurable Computing Lab.
E N D
High-Level Programming Issues for Reconfigurable Computing Systems Mark Jones ECE Virginia Tech Blacksburg, Virginia mtj@vt.edu www.ccm.ece.vt.edu
The Virginia TechConfigurable Computing Lab • Focus on devices, architectures, applications, and programming issues for configurable computing • 30+ undergraduates,graduates, and post-docs • Variety of public andprivate sponsors • Peter Athanas & • Mark Jones
Overview • Run-time reconfiguration (RTR) • Obstacles to RTR • Recent developments enabling RTR • New hardware • New bitstream generation tools • New runtime control software • RTR applications • Summary and predictions • Disclaimer: In the interests of time, I am not mentioning all of the relevant projects.
Run-Time Reconfiguration • Adaptive computing devices (e.g., FPGAs) • Hardware configurations can be changed • Speed of reconfiguration varies by device • Reconfigure during the runtime of an applications(s) – less than 1 ms • Goals of the DARPA ACS program include • Development of hardware supporting fast RTR • Creation of software to control RTR hardware • Applications that demonstrate the computational benefits of RTR in size, weight, and power
Types of RTR • Virtual Hardware • Provide programmer with an abstraction of unlimited hardware, similar to Virtual Memory • Useful abstraction which, like virtual memory, provides portability between devices • OS is responsible for directing the chip to context-switch user “hardware” (may include multiple processes) • Requires fast context-switching capability and software to effectively partition user hardware • Virtual co-processor work (e.g. DISC @ BYU) can be thought of in a similar fashion
Types of RTR (continued) • Data-driven RTR • Based on the data encountered, the hardware is reconfigured to process it • e.g., for a given DES key, the hardware is reconfigured to a DES core specific to the key • Can provide increased speed in a small package • Hardware must be able to reconfigure quickly and (in most cases) direct its own reconfiguration based on data encountered
Device Reconfiguration Methods • Entire device via a single bitstream • e.g. Xilinx 4K series • Long reconfiguration times • Logic-unit addressable reconfiguration • e.g. Xilinx 6200 • Significant chip area devoted to this function • Context-based reconfiguration • Sanders CSRC chip • Significant chip area devoted to this function
Device Reconfiguration Methods (continued) • Partial reconfiguration • e.g. Xilinx Virtex • Must reconfigure column at a time • Stream-based reconfiguration • e.g, Colt/Stallion • Appropriate for stream-based computation • Pipeline-oriented reconfiguration • e.g, PipeRench • Appropriate for deeply pipelined applications
Types of Reconfigurable Apps • Stream-oriented applications • Intelligent network devices, software radios, video processing • Reconfiguration must occur quickly enough and w/o disruption of hardware to avoid losing data in stream (buffering too expensive in many situations) • “Batch”-type applications • Number-crunching simulations, off-line analysis of data • Reconfiguration must simply be cost-effective when trading off processing for reconfiguration
Prior Obstacles to RTR • Lack of hardware devices that support RTR in an appropriate fashion • Provide fast reconfiguration without sacrificing performance • Lack of software to support RTR • Generate and modify bitstream configurations during runtime • The following slides will survey projects which are overcoming these obstacles • These projects really represent evolutionary advances on previous research projects
Virtual Hardware:PipeRench (CMU) • Many applications, particularly stream-based applications, can be deeply pipelined to improve performance • PipeRench is built as a reconfigurable pipeline n units • The programmer views PipeRench as a programmable pipeline of m units where m is arbitrarily large
PipeRench (CMU) • PipeRench supports this Virtual Hardware abstraction by reconfiguring the physical pipeline through the abstract pipeline
PipeRench (CMU) • Only one stage must be reconfigured at each step • Allows for fast reconfiguration because only part of chip must be reconfigured • Defines a scalable architecture series • No changes to code are needed as hardware increases in size • Realization in VLSI exists as well as compiler tools
Runtime Generation of Bitstreams: Loki Project(Xilinx and Virginia Tech) APPLICATION PROGRAM NEW STATE FUNCTIONALITY PLACE & ROUTE STATE CONNECTIVITY RESOURCES
Loki Project (continued) • JBits provides an API to the Xilinx bitstream for the 4K and Virtex parts • Java-based API at the LUT/pip level • Executing a Java program with the JBits API can create or modify a bitstream • The Loki project builds on this API to provide a design environment • Focus is on Run-Time Parameterizable cores
Loki Project (continued) • RTP cores (tens of cores at this point) • Finite state machines • KCMs • CAMs • Execution time for customizing bitstreams • Milliseconds (or less) for modification of LUTs in an existing bitstream • Challenge is to provide similar speeds when routing is required
Loki Project (continued) • The RTP core-based approach provides a hierarchical approach • Routing & placement is handled within the core, a full chip-wide P&R is not required • The JBits & RTP-based approach in the Java environment make development of new tools much easier • Simulator for Virtex devices • Visualization of routing delays • Visualization of core layout and runtime execution
BoardScope Core View Output Shift Register (Vertical) 3 Input Shift Registers. (Horizontal) Center Register Highlighted. Evolved Synchronous Circuit
Runtime Hardware Control: SLAAC & DRACS (Sanders, Virginia Tech, USC/ISI-East) • The new hardware that supports fast RTR requires new runtime control software to reduce/eliminate the software overhead associated with reconfiguration • Need to provide the programmer with an abstraction for RTR that is easy to use, yet doesn’t incur runtime overhead
Runtime Hardware Control: Target Hardware • The SLAAC-1V board • 3 Virtex 1000 chips capable of partial reconfiguration • On-board configuration controller (Virtex 100) with a local memory cache • The Sanders RCM board • 2 CSRC chips capable of context-switching • PowerPC and Xilinx 4085 with local memory cache
Runtime Hardware Control: Virtual Hardware • Consider an OS that is swapping hardware configurations in/out of chip (microseconds) • Partial configurations in and out of the Virtex parts on the SLAAC-1V • Switching contexts on the RCM board • Cannot afford to have the configurations sent by the OS to board on every configuration swap • Overwhelm the microsecond cost
Runtime Hardware Control: Virtual Hardware (continued) • Most programs exhibit temporal locality • Exploit this in way similar to virtual memory • Both the SLAAC-1V and the Sanders RCM provide the memory and the control capability to build a configuration cache • Instead of sending configurations to the board, control signals are sent invoking reconfiguration from the cache • Transparent to the programmer
Runtime Hardware Control: Data-Driven RTR • Data-driven RTR requires extremely fast reconfiguration and virtually no overhead in the control of RTR • Little benefit to clock-cycle RTR (CSRC) if the control software takes longer • Must execute control of RTR near the chip • Need an abstraction for programmers to target
Runtime Hardware Control: Data-Driven RTR (continued) • Using a Finite State Machine (FSM) provides a suitable solution • The FSM monitors the data encountered, triggering changes in state • State change in the FSM reconfigures the chip from the configuration cache • FSM can execute in small space (e.g., fraction of Xilinx 4085) local to board • Interface familiar to most programmers
Application: DES Core (Xilinx) • The circuitry for DES computation can be significantly reduced if a specific key is “folded into” the circuitry • This reduction allows for a smaller, faster hardware realization of DES • Of course, a DES implementation that is specific to a single key isn’t useful unless it can be reconfigured…
DES Core (continued) • A DES core was implemented using JBits • A new core for each key is generated at runtime • Requires only changes to LUTs to configure for a new key • This implementation is faster than the current ASIC DES champion from Sandia • Technique being exploited for other encryption methods at Xilinx
EPIC View of 16 Rounds Courtesy Cameron Patterson
Comparing Fully Unrolled and Pipelined Designs Courtesy Cameron Patterson
Application: Number Crunching (Virginia Tech) • Traditional “numerical-analysis” style computation has focused on the use of IEEE-compliant floating-point arithmetic on general purpose CPUs • Two trends are forcing a refocus • Intel (and others) do not focus design on this market • Embedded processing is becoming increasingly complex, requiring more “number-crunching”
Number Crunching (cont.) • Cannot do away with key features of IEEE-compliant arithmetic (too many algorithms depend on it) • Floating-point units, however, are large and expensive • Can customize hardware to provide performance in reasonable package • Reconfiguration is a key
Number Crunching (cont.) • Use constant floating-point multipliers • e.g., as coefficients in an FIR • These multipliers are smaller and faster than two-input multipliers • analytical analysis provides bounds on size of IEEE-compliant implementations
Summary • Obstacles to practical RTR are being overcome • New hardware devices, experimental and commercial, are now available • New software is coming online to allow run-time bitstream generation • And now for some predictions…
RTR Predictions • Security of reconfigurable devices come into question and changes are made to address this issue • APIs to commercial FPGA bitstreams become commonplace, allowing more widespread innovation in RTR software • RTR hardware becomes essential aspect of SOC solutions which, by their nature, avoid the “scale by adding more hardware” aspect of PCs • Will proliferate in industries that need low-cost, low-power, small solutions (e.g., cellular phones)