Design Space Exploration for Field-Programmable Compressor Trees

Design Space Exploration for Field-Programmable Compressor Trees 1Royal Institute of Technology School of Information and Communication Technology Stockholm, Sweden Seyed Hosein Attarzadeh Niaki1 Alessandro Cevrero2 Philip Brisk3 Chrysostomos Nicopoulos3 Frank K. Gurkaynak4 Yusuf Leblebici2 Paolo Ienne3 Ecole Polytechnique Fédérale de Lausanne (EPFL) 2School of Engineering 3School of Computer and Communication Sciences Lausanne, Switzerland 4Swiss Federal Institute of Technology, Zurich Microelectronics Design Center Zurich, Switzerland

Project Overview • Goal • Accelerate multi-input addition on FPGAs • H.264 motion estimation • 3G wireless base station channel cards • FIR filters • Exposed via systematic dataflow transformations [Verma et al., TCAD 2008] • Field Programmable Compressor Tree (FPCT) • [Cevrero et al., FPGA 2008] • More flexibility than DSP blocks • Can benefit from dataflow transformations • DSP blocks cannot • Better performance than LUT-based logic

Dataflow Transformations: Example step 3 delta 7 delta 4 delta 2 delta 1 >> 4 0 0 0 + step 1 0 & = = = SEL >> = step 0 step 1 step 2 step 3 2 + 0 & step 2 >> >> >> >> 0 0 0 SEL = >> SEL SEL SEL 1 & & & & 0 & + ∑ Compressor Tree SEL = + vpdiff vpdiff ADPCM 3

Contribution • Design Space Exploration • Tune the design of an FPCT to match the needs of a representative set of arithmetically intensive benchmarks

Outline • Arithmetic Tutorial: Compressor Trees • Field Programmable Compressor Tree • Design Space Exploration • Results • Conclusion

Partial Product Generator (PPG) + CPA  Compressor Tree m+n bits  mn bits S C   S C S C + + + Compressor Trees Multi-input Adder Parallel Multiplier Multiply-Accumulate

Arithmetics on FPGAs • DSP blocks • Fixed-bitwidth multiply/MAC • FPGA logic can be faster when there are bitwidth mismatches [Kuon and Rose, TCAD 2007] • Cannot bypass PPG • No multi-input addition • Cannot exploit dataflow transformations that expose large compressor trees [Verma et al. TCAD 2008] • FPGA logic • 3-ary addition • LUTs + carry-chains • Altera Stratix II-IV, Xilinx Virtex-5 • Compressor tree synthesis • [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] • Faster than 3-ary adder trees • Does not use carry chains

Field Programmable Compressor Tree • Programmable core integrated into an FPGA • Supports multi-input addition • Unlike DSP blocks • Can exploit dataflow transformations [Verma et al. TCAD 2008] • Programmable to match the input operands • More flexible than DSP block • Multiplication/MAC • FPGA logic generates partial products

m:n counter • count number of input bits set to 1 • m input bits • n = log2(m+1) output bits Parallel Counters and Generalized Parallel Counters (GPCs) • GPC • Input bits may have • different ranks (2, 5, 6; 5) GPC 6:3 Counter 6 input bits of rank i 5 input bits of rank i+1 2 input bits of rank i+2 6 input bits of rank i 3 output bits of rank i, i+1, i+2 5 output bits of rank i, …, i+4

15 15:4 4 4:3 3 3:2 2 CPA CPA FPCT Motivation (1/2)

15:4 15:4 15:4 15:4 4:3 4:3 4:3 4:3 3:2 3:2 3:2 3:2 Carry Propagate Adder (CPA) FPCT Motivation (2/2)

16 Independently drive each input bit to 0 Input Configuration Circuit The 31:5 counter can implement a variety of 16-input, 5-output GPCs GPC Configuration Circuit 31 31:5 The CSlice can be configured to produce multiple output bits. 5:3 5:3 Drive all carry-in bits to zero to break the carry chain 3:2 3:2 Choose the carry-save outputs or the output of the final CPA. CPA Store the carry-save or CPA output to a bypassable register. Register Depending on the configuration different carry-out bits are propagated to the next CSlice Compressor Slice (CSlice) Architecture

FPCT Results Delay (ns) No Transformations Transformed [Verma et al., TCAD 2008] Use DSP blocks for multiplication 3-ary adder tree GPC Mapping FPCT No multipliers, but benefits from transformations Multiplier-based Benchmarks Multi-input Addition Benchmarks

16 Input Configuration Circuit • GPCCC/ICC • {enumerate} GPC Configuration Circuit 31 2. First counter size (FCS) {15:4, 31:5} 31:5 5:3 5:3 3. Max. Output Rank Config. (MORC) {1, 2, 3} 3:2 3:2 CPA Register CSlice Design Space

GPC Configuration Circuit Inputs can be configured as rank-0 or 1 Rank-0 inputs GPC Configuration Circuit 15:4 Counter Configuration Bit

Benchmark Circuits • Always generate a sufficient number of CSlices for each benchmark BenchmarkDescriptionFCSs Mapped mul5x5 5x5-bit multiplication 31:5; 15:4 mul18x18 18x18-bit multiplication 31:5 mul36x18 36x18-bit multiplication 31:5 add8x32 Add 8 32-bit Integers 31:5, 15:4 add16x16 Add 16 16-bit Integers 31:5 FIR 3-tap FIR Filter 31:5 SAD Sum-of-Absolute-Differences 31:5, 15:4

FCS, MORC, Input Bit Pattern All GPCCCs enumerated? Done Enumerate next GPCCC Generate CSlice HDL Map input bit pattern Synthesize mapped FPCT Delay/Area Mapping

Delay Results (31:5) Average Delay (FCS = 31:5) Best Worst ns MORC and GPC Config. Circuit

Area Results (31:5) Average Area (FCS = 31:5) Worst Best Delay Ranking 1 4 2 3 6 5 7 m2 MORC and GPC Config. Circuit

Utilization • Input Utilization (Uin) • Fraction of first counter inputs used • Unused inputs driven to zero • Output Utilization (Uout) • Fraction of CSlice outputs used if MORC > 1 • I/O Utilization (U = UinUout) • Acceptable due to correlation between Uin, Uout • Prune the search space with utilization • Only synthesize FPCTs for which utilization is high • Reduce cost of searching entire space

Uin (MORC = 1) Uin (MORC = 2) Uin (MORC = 3) Uout (MORC = 1) Uout (MORC = 2) Uout (MORC = 3) Correlation Between Input/Output Utilization mul36x18 I/O Utilization GPC Config. Circuit

4 points with maximum utilization for MORC = 2 and 3 respectively I/O Utilization Generally Finds the Best Data Points per Benchmark mul36x18 Design Space: Delay vs. Area ns m2 MORC = 1 MORC = 2 MORC = 3

Conclusion • FPCT • Programmable compressor tree integrated into an FPGA for improved arithmetic performance • FPCT Design space exploration • Tune FPCT CSlice architecture to a set of benchmarks • Prune the design space with utilization • Two Pareto-optimal design points found • 1. Best average delay, near the middle in terms of average area • 2. Six virtually indistinguishable points • 2nd –7th best average delays, 1st – 6th best average area

References Cevrero, A., et al. Architectural improvements for field programmable counter arrays: enabling efficient synthesis of compressor trees on FPGAs. FPGA, February 2008, pp. 181-190. Kuon, I., and Roses, J. Measuring the gap between FPGAs and ASICs. IEEE TCAD, February 2007, pp. 203-215. Parandeh-Afshar, H., Brisk, P., and Ienne, P. Efficient synthesis of compressor trees on FPGAs. ASPDAC, January 2008, pp. 138-143. Parandeh-Afshar, H., Brisk, P., and Ienne, P. Improving synthesis of compressor trees on FPGAs via integer linear programming. DATE, April 2008, pp. 1256-1261. Verma, A. K., Brisk, P., and Ienne, P. Data-flow transformations to maximize the use of carry-save representation in arithmetic circuits. IEEE TCAD, October 2008, pp. 1761-1774.

Design Space Exploration for Field-Programmable Compressor Trees

Design Space Exploration for Field-Programmable Compressor Trees

Presentation Transcript

Planning for Space Exploration

Space Exploration

Technologies for Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space exploration

Design Space Exploration with SimpleScalar

Summative for Space Exploration

Architectural Design Space Exploration

Design Space Exploration for Field-Programmable Compressor Trees

Design Space Exploration

Space Exploration

Space Exploration

Design Space Exploration

Space Exploration

Space Exploration

A Design Space Exploration framework for rISA Design