A Novel FPGA Logic Block for Improved Arithmetic Performance

A Novel FPGA Logic Block for Improved Arithmetic Performance Hadi Parandeh-Afshar Philip Brisk Paolo Ienne 16th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008

ASIC FPGA Performance √ Area Utilization √ Power Consumption √ Flexibility √ Time-to-Market √ FPGA vs. ASIC • Performance gap between FPGAs and ASICs • [Kuon and Rose, FPGA 2006 and TCAD 2007] • Arithmetic circuits exacerbate the disparities • Focus on compressor trees 1/16

Compressor Trees • A circuit that sums k > 2 integer values • Carry-save representation • [Wallace 1966, Dadda 1967] • Parallel multipliers • Many video/signal processing circuits • FIR Filters • H.264/AVC video coding • 3G wireless base station channel cards • Flowgraph transformations to expose compressor trees • [Verma and Ienne, ICCAD 2004] • Generally applicable to arithmetic circuits • Merge disparate add, mul operations to form compressor trees 2/16

delta 4 delta 2 delta 1 step 3 delta 7 0 0 0 >> 4 = = = + step 1 0 & step 0 step 1 step 2 step 3 SEL >> = 2 >> >> >> >> 0 0 0 + 0 & step 2 SEL SEL SEL SEL = >> 1 & & & & ∑ 0 & + Compressor Tree + SEL = vpdiff vpdiff Circuit Transformation ADPCM [Verma and Ienne, ICCAD 2004] 3/16

HA FA FA m [Wallace 1966] • Count number of input bits set to 1 • Generalized Full/Half Adders • Output is a value in the range [0, m] [Dadda 1967] FA FA • Drawbacks • Routing delays • Can’t use carry-chains [Stelling et al. TComp 1998] n HA HA [Verma and Ienne, DATE 2007] Compressor Tree Synthesis • FPGA Synthesis • Stratix II/III carry chain • LUTs (shared arithmetic mode) • Ternary addition • Map poorly onto LUTs • Poor flexibility in mapping • ASIC Synthesis • Ripple-carry addition • Carry-save representation • Ternary addition • Full/Half Adder Trees • m:n counters LUTs Carry-chain LUTs 4/16

rank = r rank = r+1 The Altera Stratix II/III ALM: Shared Arithmetic Mode sumr 3-LUT To ALM output carryr 3-LUT sumr+1 3-LUT To ALM output carryr+1 3-LUT 5/16

(2, 3; 3) (0, 4; 3) 2n-1 … 21 20 21 21 20 20 Generalized Parallel Counters (GPCs) • Extension to m:n counters • Input bits can have different ranks • i.e., (kn-1, …, k1, k0; S) Output Range: [0, 7]  S = 3 4:3 Counter • Number of input bits: M = kn-1 + … + k1 + k0 • Number of output bits: S 6/16

Compressor Tree Synthesis on FPGAs via GPC Mapping • Software synthesis heuristic/ILP • [Parandeh-Afshar et al. ASPDAC 2008, DATE 2008] • Faster than ternary adder trees or DSP blocks • Stratix II/III and Xilinx Virtex-5 FPGAs • M = 6 inputs S = 3, 4 outputs • GPCs were mapped onto 6-LUTs • Unable to exploit the carry chain, except for final add • Contribution: A new carry chain that we can use! 7/16

6:2 Compressor 6-input GPC 6:3 Counter Input ranks may vary All inputs have rank 0 All inputs have rank 0 rank 2 rank 0 cout,1 cin,0 6:3 6:3 6:2 cout,0 cin,1 rank 1 rank 0 2 1 0 2 1 0 1 0 Output rank Output rank Output rank The 6:2 Compressor: an Alternative to the 6:3 Counter and 6-input GPC 8/16

Steady state: 2 bits per column Steady state: 3 bits per column Why are 6:2 compressors more effective than 6:3 counters? 6:2 Compressor 6:3 Counter 11/16

6 6 6 6 6 6 6:2 6:2 6:2 6:2 6:2 6:2 2 2 2 2 2 2 6:2 Compressors Form a Carry Chain • Each 6:2 compressor is a logic cell • Carry chains between adjacent cells bypass local routing • This is not an over-glorified ripple-carry structure 9/16

cin,1 cin,0 rank-0 inputs FA FA HA FA FA cout,1 cout,0 Sum outputs 6:2 Compressors: Microarchitecture • No combinational path from carry-in to carry-out bits • This is not ripple-carry 10/16

cin,1 cin,0 ALM inputs rank-0 inputs FA FA FA FA (LUTs) HA FA FA FA FA cout,1 cout,0 Sum outputs To ALM outputs ALM (Shared Arithmetic Mode) 6:2 Compressor Similarities Between Shared Arithmetic Mode and the 6:2 Compressor 12/16

Proposed Logic Cell: 2 Designs 13/16

Experimental Methodology • Platform: VPR • Modeled island-style FPGA • Altera-like ALMs and LABs • 4 ALMs per LAB to reduce complexity • 4 Mapping Algorithms • 3-ADD : Ternary adder trees • GPC : GPC mapping • [Parandeh-Afshar et al. ASPDAC 2008] • 6:2 : Mapping using 6:2 compressors only • 6:2 + GPC : The best of both worlds 14/16

Experimental Results 6:2 + GPC is the best in all cases 3-ADD has the smallest area in all cases GPC has the largest area in all cases No uniform trends GPC does not use carry chains; the others do! 15/16

Conclusion • Compressor trees are an important class of arithmetic circuits • Previous work: GPC mapping outperforms 3-ADD • Cannot use carry-chain • Contribution: New carry chain • Configures the Altera Stratix II/III ALM as a 6:2 compressor • 1 HA, 2 FA, 2 muxes, plus wires • Best results combine GPC mapping with 6:2 compressors • Average speedup : 1.41x over 3-ADD • Average increase in ALM usage: 1.19x over 3-ADD 16/16

A Novel FPGA Logic Block for Improved Arithmetic Performance

A Novel FPGA Logic Block for Improved Arithmetic Performance

Presentation Transcript

Arithmetic-logic units

Improved Performance

Arithmetic-Logic Units

Arithmetic and Logic Instructions

Training for Improved Performance

FPGA Logic Cluster Design

FPGA Switch Block Design

Coaching for Improved Performance:

Coaching for Improved Performance:

Coaching for Improved Performance:

Coaching for Improved Performance:

Arithmetic Logic Unit ALU

Improved Performance

Improved Performance

Enhancing FPGA Performance for Arithmetic Circuits

Improved m FEL performance with novel resonator

WireMap FPGA Technology Mapping for Improved Routability

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Arithmetic Logic Units

Arithmetic and Logic

Arithmetic Logic Unit

Arithmetic and Logic Instructions