220 likes | 350 Views
Floating-Point FPGA (FPFPGA). Architecture and Modeling (A paper review). Jason Luu ECE University of Toronto Oct 27, 2009. Motivation. Goal: Build faster, cheaper, lower power FPGAs How? Fixed-Functionality (hard) blocks!
E N D
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009
Motivation • Goal: Build faster, cheaper, lower power FPGAs • How? Fixed-Functionality (hard) blocks! • FPGA reconfigurability comes at the price of area, delay, and power • Some reconfigurability is unnecessary, remove it for savings
What to Make Hard? • What hard blocks to use? • If not used, block is wasted • Industry suggests including memories and multipliers • Paper suggests adding floating-point units (FPU) • Given a hard block, how fractured should it be? • Eg. Stratix III FPGA multipliers can be configured in a set of four 18x18 multipliers or one 36x36 multiplier • How fractured should the FPU be?
Introducing FPFPGA • Contains soft and hard blocks • Soft blocks are composed of standard LUTs, FFs • Hard blocks are FPUs called Coarse-grained units (CGU) • CGU characteristics: • Floating-point (FP) adds and multiplies only • Bus-based LUT operations using “wordblock” • Dedicated output registers • Accessible to soft blocks and vice-versa
CGU parameters • # of each type of FP block • Bus Width • Number of Input Buses • Number of Output Buses • Number of Feedback Paths
Modeling Methodology • Need to measure how “good” FPFPGA is • Use empirical measurement method FPFPGA Benchmark Circuit Commercial CAD FLow Measure Quality of Results Very Nice! Commercial tools are unaware of FPFPGA , authors introduce “VEB” as solution
Virtual Embedded Block (VEB) Flow • Manually map benchmark circuit into • CGU • Soft logic • Put VEB representing CGU into commercial CAD tool • Compile • Gather area and timing measurements
VEB • Create standard cell ASIC CGU and get area/timing numbers • Implement area and timing of ASIC CGU using soft logic of commercial FPGA (different functionality, similar silicon timing, area, and pin demand) • Assumes all internal paths == critical path to simplify timing of soft logic implementation
VEB Details • Model delay with carry-chains • Model area with shift registers • Use LUT inputs and outputs for pin demand • Note: Area and delay models use independent resources
VEB Placement Challenge • Hard block locations are fixed on an FPGA • Commercials tools can’t do that for VEB since it’s just a group of clustered soft logic constrained to be placed in a particular relative distance from each other • Solution: • Let commercial tools place VEB anywhere • Then manually place VEB to fixed locations
VEB Quality • 11% delay error when modeling embedded multiplier (non-fp to compare with existing multiplier) • Area is accurate (no number given) • Important repeatability hint: Must determine timing post-bitstream because of significant false paths (most CGUs do not use the longest path and this is detected post-bitstream)
Benchmarks • 32-bit single-precision floating-point • 8 benchmarks • 5 Core computation blocks • 1 application • 2 synthetic
Experimental Settings • Xilinx Virtex 2: XC2V3000-6-FF1152 • 16 CGUs each implemented as a VEB • Each CGU takes up 122 Logic Cells • 2 FP multipliers, 2 FP adders, 5 wordblocks • In the order: W M A W W M A W W • 4 input buses • 3 output buses • 3 feedback registers
Results • Average area reduced by 25x • Average delay reduced by • 3.6x for single precision • 4.3x for double precision • Results are comparable to Kuon FPGA vs ASIC measurements • Critical path of all circuits is in FPU
Reason for Good Results • Removed reconfiguration bits (area reduction) • Efficient directional routing • Embedded FP operators
Contributions • Exploration of FPGA architectures with embedded floating-point cores • VEB methodology to leverage commercial tools to explore new embedded hard blocks even when commercial tools are unaware of those new hard blocks
Weaknesses • Significant amounts of speculation • Try to claim scope for stuff that should be in future work • Especially weak was the paper’s analysis of a FPFPGA compiler which is outside of scope and should be listed as such
My 2 Cents • Primary advantage of FPFPGA vs GPU in the floating-point high computation domain is low latency • Several applications demand very low latency and very high computational power • Plant monitoring of high-speed reactions • Financial automatic buy-sell algorithms • Secondary advantage is energy consumed to perform the same computations.
My 2 Cents • Comparison unfair • Most FPGA designers would convert floating-point to fixed point and not leave it as floating-point • Double precision fp add requires 701 slices • Fixed point add 64 LUTs == 16 slices • Critical path is in FPU suggests benchmark circuits are unusually geared to use FPU cores and this is admitted by the authors