180 likes | 270 Views
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead. Heng Tan and Ronald F. DeMara University of Central Florida. Agenda. Introduction Previous work on reducing Partial Reconfiguration overhead Develop a multi-step physical area management strategy
E N D
A Physical Resource Management Approach toMinimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMaraUniversity of Central Florida
Agenda • Introduction • Previous work on reducing Partial Reconfiguration overhead • Develop a multi-step physical area management strategy • Experimental case studies: • 4 simple circuits: serial, parallel, block multiplier resource,high fanout carry chain LUT multiplier • 2 large circuits: SECDED, MD5 step function (new) • Conclusion and future work
Motivation For partial reconfiguration operations, especially those running on a SOC configuration, • storage space and • reconfiguration speed are crucial. Find way to optimize them . . .
Previous Work • Architecture-level [Ganesan2000] • Pipelining • overlap execution of one temporal partition with reconfiguration of another • Logical level [Raghuraman2005] • Minimization of frames • relating the number of frames that need to be downloaded into FPGAs to the number of minterms of a specially constructed logic function • Hardware level [Compton2002][Hauck1999] • Requires dedicated HW component • Compression or defragmentation performed externally at runtime • Physical-level resource management? • Not specifically addressed for partial reconfiguration modules in literature • But can be done … and … also cascaded with above techniques for multiplicative benefit
Module Based Flow Basic concept and capability for PR proposed by Xilinx • Including Fixed modules and PR modules • Using Bus Macro • Suitable for potential full automation • Primary PR technique to study
Column-Level Configuration Format • For Virtex II/Pro series • Including IOB, IOI, CLB, GCLK, BlockRAM, and BlockRAM Interconnect • Labeling with 32-bit addresses composed of a Block Address (BA), a Major Address (MJA), a Minor Address (MNA), and a byte number • Only contents of CLB column are studied in this research
Bitstream Compression • CLB columns program the configurable logic blocks, routing, and most interconnect resources. • Each CLB column contains 2 columns of slices. • 22 frames are utilized within the bitstream for each column of slices, describing the logic and routing information respectively. • Each frame occupies 424 bytes. • In bitstream of PR modules, a compression technique is already used by Xilinx to represent the unused CLB frame. • For unused CLB frames, only 10 bytes are used instead of 424 to describe empty contents.
Optimization Strategy • Primary goal: minimize the number of columns of slices utilized, including routing resources • Secondary goal: do not incur increase in propagation delay after area optimization • Physical area management procedure: • Region Allocation: define size and boundary • Pin Assignment: top/bottom edge preferred • Column Alignment: fill column of slices at a time • Choke-Point Elimination: resolve high fanout cases • Repeat
Case Study • The hardware platform is Xilinx Virtex II Pro VP7 device. • Module-based partial reconfiguration flow is adopted to generate the partial reconfiguration bitstream. • The Xilinx ISE 6.3 is used to support the module based flow. • The area constraints are entered directly into User Constrain File (.ucf)before map and routing. • Four representative small cases and two larger size cases studies are tested for the strategy. • Similar or identical external pin arrangement. • Hypothesis: larger circuits can achieve more savings
4-LUT Design Optimization (parallel) • Simple 4-LUT elements • Parallel logic path with direct input from external signals • LUTs feed outputs straight though to flip flops • Best strategy is by locating them in a single column close to the external pins
Shifter Optimization (serial) • All logic elements are cascaded in contiguous string of CLBs • Attributes of this serial circuit functionality will have best arrangement from input to output in single column serially
Block Multiplier Optimization • Block multiplier resource involved • Requires balancing the routing between two paths. • One path is between the block multiplier and the LUTs. The other path is from the LUTs to the external pins. • Decreased savings compared to use of LUT-only circuits
LUT Multiplier Optimization • High fan-out occurs because of the carry chains • Single column style is not optimal • Arrange related LUTs around each other using adjacent columns
Larger Case Studies • SECDED as full PR module and MD5 step functions were developed as PR module vs. SHA family • During the optimization process, not every slice has been specifically placed because of the large number of resources involved • Only the slices on the critical path are constrained. • These are comparatively larger modules, increased bitstream savings of 33% and 30% are achieved
Area Optimization Results • A physical resource area management strategy is proposed to minimize the reconfiguration overhead. • Experiments show that up to one-third of size reduction can be achieved for partial reconfiguration modules. • The maximum propagation delay has also been decreased slightly in most cases (not increased). • On the other hand, the larger the module is, the more complicated and time consuming the process becomes. • Providing autonomous area optimization capabilities is future work for integrating into our Multi-Layer Runtime Reconfiguration Architecture FGPA framework
Current Work:Direct Bitstream Management • Change one-bit full adder to a one-bit full subtracter • Both have three one-bit inputs and two one-bit outputs. • Both used 2 LUTs with identical logic interconnections between LUTs and I/O signals. • Only difference between them is actually only one truth table stored inside one LUT, changing from 0xE8 to 0x8E Combined MD5 / SHA-1 Step Function – Area Utilization Original AlgorithmModule Based Frame Based SHA-1 192(slices) 65(slices)32(slices) MD5 881(slices) 168(slices) 32(slices) Combined 1068(slices) 324(slices) 32(slices)