1 / 24

Design Space Exploration for Field-Programmable Compressor Trees

Design Space Exploration for Field-Programmable Compressor Trees. 1 Royal Institute of Technology School of Information and Communication Technology Stockholm, Sweden. Seyed Hosein Attarzadeh Niaki 1 Alessandro Cevrero 2 Philip Brisk 3 Chrysostomos Nicopoulos 3 Frank K. Gurkaynak 4

rey
Download Presentation

Design Space Exploration for Field-Programmable Compressor Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Space Exploration for Field-Programmable Compressor Trees 1Royal Institute of Technology School of Information and Communication Technology Stockholm, Sweden Seyed Hosein Attarzadeh Niaki1 Alessandro Cevrero2 Philip Brisk3 Chrysostomos Nicopoulos3 Frank K. Gurkaynak4 Yusuf Leblebici2 Paolo Ienne3 Ecole Polytechnique Fédérale de Lausanne (EPFL) 2School of Engineering 3School of Computer and Communication Sciences Lausanne, Switzerland 4Swiss Federal Institute of Technology, Zurich Microelectronics Design Center Zurich, Switzerland

  2. Project Overview • Goal • Accelerate multi-input addition on FPGAs • H.264 motion estimation • 3G wireless base station channel cards • FIR filters • Exposed via systematic dataflow transformations [Verma et al., TCAD 2008] • Field Programmable Compressor Tree (FPCT) • [Cevrero et al., FPGA 2008] • More flexibility than DSP blocks • Can benefit from dataflow transformations • DSP blocks cannot • Better performance than LUT-based logic

  3. Dataflow Transformations: Example step 3 delta 7 delta 4 delta 2 delta 1 >> 4 0 0 0 + step 1 0 & = = = SEL >> = step 0 step 1 step 2 step 3 2 + 0 & step 2 >> >> >> >> 0 0 0 SEL = >> SEL SEL SEL 1 & & & & 0 & + ∑ Compressor Tree SEL = + vpdiff vpdiff ADPCM 3

  4. Contribution • Design Space Exploration • Tune the design of an FPCT to match the needs of a representative set of arithmetically intensive benchmarks

  5. Outline • Arithmetic Tutorial: Compressor Trees • Field Programmable Compressor Tree • Design Space Exploration • Results • Conclusion

  6. Partial Product Generator (PPG) + CPA  Compressor Tree m+n bits  mn bits S C   S C S C + + + Compressor Trees Multi-input Adder Parallel Multiplier Multiply-Accumulate

  7. Arithmetics on FPGAs • DSP blocks • Fixed-bitwidth multiply/MAC • FPGA logic can be faster when there are bitwidth mismatches [Kuon and Rose, TCAD 2007] • Cannot bypass PPG • No multi-input addition • Cannot exploit dataflow transformations that expose large compressor trees [Verma et al. TCAD 2008] • FPGA logic • 3-ary addition • LUTs + carry-chains • Altera Stratix II-IV, Xilinx Virtex-5 • Compressor tree synthesis • [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] • Faster than 3-ary adder trees • Does not use carry chains

  8. Field Programmable Compressor Tree • Programmable core integrated into an FPGA • Supports multi-input addition • Unlike DSP blocks • Can exploit dataflow transformations [Verma et al. TCAD 2008] • Programmable to match the input operands • More flexible than DSP block • Multiplication/MAC • FPGA logic generates partial products

  9. m:n counter • count number of input bits set to 1 • m input bits • n = log2(m+1) output bits Parallel Counters and Generalized Parallel Counters (GPCs) • GPC • Input bits may have • different ranks (2, 5, 6; 5) GPC 6:3 Counter 6 input bits of rank i 5 input bits of rank i+1 2 input bits of rank i+2 6 input bits of rank i 3 output bits of rank i, i+1, i+2 5 output bits of rank i, …, i+4

  10. 15 15:4 4 4:3 3 3:2 2 CPA CPA FPCT Motivation (1/2)

  11. 15:4 15:4 15:4 15:4 4:3 4:3 4:3 4:3 3:2 3:2 3:2 3:2 Carry Propagate Adder (CPA) FPCT Motivation (2/2)

  12. 16 Independently drive each input bit to 0 Input Configuration Circuit The 31:5 counter can implement a variety of 16-input, 5-output GPCs GPC Configuration Circuit 31 31:5 The CSlice can be configured to produce multiple output bits. 5:3 5:3 Drive all carry-in bits to zero to break the carry chain 3:2 3:2 Choose the carry-save outputs or the output of the final CPA. CPA Store the carry-save or CPA output to a bypassable register. Register Depending on the configuration different carry-out bits are propagated to the next CSlice Compressor Slice (CSlice) Architecture

  13. FPCT Results Delay (ns) No Transformations Transformed [Verma et al., TCAD 2008] Use DSP blocks for multiplication 3-ary adder tree GPC Mapping FPCT No multipliers, but benefits from transformations Multiplier-based Benchmarks Multi-input Addition Benchmarks

  14. 16 Input Configuration Circuit • GPCCC/ICC • {enumerate} GPC Configuration Circuit 31 2. First counter size (FCS) {15:4, 31:5} 31:5 5:3 5:3 3. Max. Output Rank Config. (MORC) {1, 2, 3} 3:2 3:2 CPA Register CSlice Design Space

  15. GPC Configuration Circuit Inputs can be configured as rank-0 or 1 Rank-0 inputs GPC Configuration Circuit 15:4 Counter Configuration Bit

  16. Benchmark Circuits • Always generate a sufficient number of CSlices for each benchmark BenchmarkDescriptionFCSs Mapped mul5x5 5x5-bit multiplication 31:5; 15:4 mul18x18 18x18-bit multiplication 31:5 mul36x18 36x18-bit multiplication 31:5 add8x32 Add 8 32-bit Integers 31:5, 15:4 add16x16 Add 16 16-bit Integers 31:5 FIR 3-tap FIR Filter 31:5 SAD Sum-of-Absolute-Differences 31:5, 15:4

  17. FCS, MORC, Input Bit Pattern All GPCCCs enumerated? Done Enumerate next GPCCC Generate CSlice HDL Map input bit pattern Synthesize mapped FPCT Delay/Area Mapping

  18. Delay Results (31:5) Average Delay (FCS = 31:5) Best Worst ns MORC and GPC Config. Circuit

  19. Area Results (31:5) Average Area (FCS = 31:5) Worst Best Delay Ranking 1 4 2 3 6 5 7 m2 MORC and GPC Config. Circuit

  20. Utilization • Input Utilization (Uin) • Fraction of first counter inputs used • Unused inputs driven to zero • Output Utilization (Uout) • Fraction of CSlice outputs used if MORC > 1 • I/O Utilization (U = UinUout) • Acceptable due to correlation between Uin, Uout • Prune the search space with utilization • Only synthesize FPCTs for which utilization is high • Reduce cost of searching entire space

  21. Uin (MORC = 1) Uin (MORC = 2) Uin (MORC = 3) Uout (MORC = 1) Uout (MORC = 2) Uout (MORC = 3) Correlation Between Input/Output Utilization mul36x18 I/O Utilization GPC Config. Circuit

  22. 4 points with maximum utilization for MORC = 2 and 3 respectively I/O Utilization Generally Finds the Best Data Points per Benchmark mul36x18 Design Space: Delay vs. Area ns m2 MORC = 1 MORC = 2 MORC = 3

  23. Conclusion • FPCT • Programmable compressor tree integrated into an FPGA for improved arithmetic performance • FPCT Design space exploration • Tune FPCT CSlice architecture to a set of benchmarks • Prune the design space with utilization • Two Pareto-optimal design points found • 1. Best average delay, near the middle in terms of average area • 2. Six virtually indistinguishable points • 2nd –7th best average delays, 1st – 6th best average area

  24. References Cevrero, A., et al. Architectural improvements for field programmable counter arrays: enabling efficient synthesis of compressor trees on FPGAs. FPGA, February 2008, pp. 181-190. Kuon, I., and Roses, J. Measuring the gap between FPGAs and ASICs. IEEE TCAD, February 2007, pp. 203-215. Parandeh-Afshar, H., Brisk, P., and Ienne, P. Efficient synthesis of compressor trees on FPGAs. ASPDAC, January 2008, pp. 138-143. Parandeh-Afshar, H., Brisk, P., and Ienne, P. Improving synthesis of compressor trees on FPGAs via integer linear programming. DATE, April 2008, pp. 1256-1261. Verma, A. K., Brisk, P., and Ienne, P. Data-flow transformations to maximize the use of carry-save representation in arithmetic circuits. IEEE TCAD, October 2008, pp. 1761-1774.

More Related