190 likes | 359 Views
A Flexible DSP Block to Enhance FGPA Arithmetic Performance. Hadi Parandeh-Afshar Alessandro Cevrero Panagiotis Athanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne. LAP EPFL LSM, LAP EPFL LSM, LAP EPFL UCR LSM EPFL LAP EPFL.
E N D
A Flexible DSP Block to Enhance FGPA Arithmetic Performance HadiParandeh-Afshar Alessandro Cevrero PanagiotisAthanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne LAP EPFL LSM, LAP EPFL LSM, LAP EPFL UCR LSM EPFL LAP EPFL EcolePolitechiqueFederale De lausanne (EPFL) University of California Riverside (UCR) {first_name.last_name@epfl.ch} first_name@cs.ucr.edu
Motivation and contribution • New DSP block for high performance FPGAs • Increased flexibility PPG Bypassable PPG Programmable Compressor Tree • Enchance FPGA arithmetic performance
E1 E2 E1 E2 DSP blocks cannot accelerate multi-operand addition M1 M2 E1 E2 19 19 48 19 19 M1 M2 4 48 1 sign Fused multiply-addition operations cannot use current DSP blocks in a single-cycle and sign neg S1 S2 S1 S2 not xor xor 4 out (a) (b) Arithmetic transformations out Motivation and contribution • Data flow transformation automatically expose compressor tree [Verma et al , TCAD 08]
Outline • Related work • Limitations • DSP Block Architecture • Experimental methodology • Results • Conclusions
9 9 9 9 9 9 9 9 Σ FPGA commentary • Logic cells with dedicated addition circuitry and fast carry chains • Compressor tree synthesis on 6-LUT FPGAs • [Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09] • IP cores [Xilinx, Altera] • FP cores [Beauchamp et al., TVLSI 08] • DSP Blocks [Altera Stratix III-IV]
9 9 9 9 9 9 9 9 Σ FPGA commentary • Logic cells with dedicated addition circuitry and fast carry chains • Compressor tree synthesis on 6 LUTs FPGAs • [Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09] • IP cores [Xilinx, Altera] • FP cores [Beauchamp et al., TVLSI 08] • DSP Blocks [Altera Stratix III-IV]
16 128 = 816 input bits 15 15 15 15 CSlice Carry-out Carry-in 6 48 = 86 output bits Field Programmable Compressor Tree (FPCT) • User-configurable multi operand adder • Compressor tree + bypassable CPA [Cevrero et al, FPGA 08, TRETS 09]
FPCT limitations • PPG soft logic 9x9-bit signed multiplier [Baugh Wooley] Soft-Logic 9x9-bit PPG (81 LUTs) 1 82 wires FPCT 18 bit output
FPCT limitations • PPG soft logic • Low input utilization for multipliers 9x9-bit signed multiplier [Baugh Wooley] 64% input utilization C3 C2 Soft-Logic 9x9-bit PPG (81 LUTs) C6 C4 C5 C1 C0 1 82 wires FPCT 3 3 2 2 2 2 3 18 bit output
11 DSP block architecture 128 FPCT (8 CSlices) 48
11 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture 61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers
11 Fixed Logic (A) Fixed Logic (B) 3 3 2 2 2 5 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture C3 C2 C4 C1 61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers
11 128 90 18 PPG* PPG 61 0 0 3 3 21 15 A A ½-FPCT (4 CSlices) B B 5 5 DSP block architecture Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process) 61 6 ½-FPCT (4 CSlices) • Two 9x9 signed PPGs • One modified to support larger multiplier • Hard compression circuits ‘A’ and ‘B’ • Efficient Synthesis of large multipliers
IP IP IP Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP Output Pins
F* F* F* Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP • Replace our DSP block with F* • Map benchmark on Stratix II • Extract F* delay • Estimated proposed DSP block delay • ASIC design flow (90nm CMOS) Output Pins
New-DPS New-DPS New-DPS Experimental methodology Input Pins • Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] • Define a preplaced soft IP core: F* • Same area and I/0 as our DSP • Replace our DSP block with F* • Map benchmark on Stratix II • Extract F* delay • Estimated proposed DSP block delay • ASIC design flow (90nm CMOS) • For each proposed DSP block in the circuit • Subtract delay of F* • Add proposed DSP block delay Output Pins
Ternary Stratix II DSP Block Proposed DSP Block GPC [Parandeh-Afshar et al, ASPDAC 08] FPCT w/ Soft PPG Results Critical Path Delay ns
Stratix II DSP Block Proposed DSP Block FPCT w/ Soft PPG Results Normalized Area (to Stratix II DSP block area)
Conclusion • New DSP block proposed • Accelerate multiplication and multi-operand addition • More flexibility • Competitive with Stratix II DSP block • Intends to replace compressor tree in existing DSP block • Only 8% area overhead respect to original FPCT