1 / 17

A Novel FPGA Logic Block for Improved Arithmetic Performance

A Novel FPGA Logic Block for Improved Arithmetic Performance. Hadi Parandeh-Afshar. Philip Brisk. Paolo Ienne. 16 th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008. ASIC. FPGA. Performance. √. Area Utilization. √. Power Consumption. √.

issac
Download Presentation

A Novel FPGA Logic Block for Improved Arithmetic Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Novel FPGA Logic Block for Improved Arithmetic Performance Hadi Parandeh-Afshar Philip Brisk Paolo Ienne 16th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008

  2. ASIC FPGA Performance √ Area Utilization √ Power Consumption √ Flexibility √ Time-to-Market √ FPGA vs. ASIC • Performance gap between FPGAs and ASICs • [Kuon and Rose, FPGA 2006 and TCAD 2007] • Arithmetic circuits exacerbate the disparities • Focus on compressor trees 1/16

  3. Compressor Trees • A circuit that sums k > 2 integer values • Carry-save representation • [Wallace 1966, Dadda 1967] • Parallel multipliers • Many video/signal processing circuits • FIR Filters • H.264/AVC video coding • 3G wireless base station channel cards • Flowgraph transformations to expose compressor trees • [Verma and Ienne, ICCAD 2004] • Generally applicable to arithmetic circuits • Merge disparate add, mul operations to form compressor trees 2/16

  4. delta 4 delta 2 delta 1 step 3 delta 7 0 0 0 >> 4 = = = + step 1 0 & step 0 step 1 step 2 step 3 SEL >> = 2 >> >> >> >> 0 0 0 + 0 & step 2 SEL SEL SEL SEL = >> 1 & & & & ∑ 0 & + Compressor Tree + SEL = vpdiff vpdiff Circuit Transformation ADPCM [Verma and Ienne, ICCAD 2004] 3/16

  5. HA FA FA m [Wallace 1966] • Count number of input bits set to 1 • Generalized Full/Half Adders • Output is a value in the range [0, m] [Dadda 1967] FA FA • Drawbacks • Routing delays • Can’t use carry-chains [Stelling et al. TComp 1998] n HA HA [Verma and Ienne, DATE 2007] Compressor Tree Synthesis • FPGA Synthesis • Stratix II/III carry chain • LUTs (shared arithmetic mode) • Ternary addition • Map poorly onto LUTs • Poor flexibility in mapping • ASIC Synthesis • Ripple-carry addition • Carry-save representation • Ternary addition • Full/Half Adder Trees • m:n counters LUTs Carry-chain LUTs 4/16

  6. rank = r rank = r+1 The Altera Stratix II/III ALM: Shared Arithmetic Mode sumr 3-LUT To ALM output carryr 3-LUT sumr+1 3-LUT To ALM output carryr+1 3-LUT 5/16

  7. (2, 3; 3) (0, 4; 3) 2n-1 … 21 20 21 21 20 20 Generalized Parallel Counters (GPCs) • Extension to m:n counters • Input bits can have different ranks • i.e., (kn-1, …, k1, k0; S) Output Range: [0, 7]  S = 3 4:3 Counter • Number of input bits: M = kn-1 + … + k1 + k0 • Number of output bits: S 6/16

  8. Compressor Tree Synthesis on FPGAs via GPC Mapping • Software synthesis heuristic/ILP • [Parandeh-Afshar et al. ASPDAC 2008, DATE 2008] • Faster than ternary adder trees or DSP blocks • Stratix II/III and Xilinx Virtex-5 FPGAs • M = 6 inputs S = 3, 4 outputs • GPCs were mapped onto 6-LUTs • Unable to exploit the carry chain, except for final add • Contribution: A new carry chain that we can use! 7/16

  9. 6:2 Compressor 6-input GPC 6:3 Counter Input ranks may vary All inputs have rank 0 All inputs have rank 0 rank 2 rank 0 cout,1 cin,0 6:3 6:3 6:2 cout,0 cin,1 rank 1 rank 0 2 1 0 2 1 0 1 0 Output rank Output rank Output rank The 6:2 Compressor: an Alternative to the 6:3 Counter and 6-input GPC 8/16

  10. Steady state: 2 bits per column Steady state: 3 bits per column Why are 6:2 compressors more effective than 6:3 counters? 6:2 Compressor 6:3 Counter 11/16

  11. 6 6 6 6 6 6 6:2 6:2 6:2 6:2 6:2 6:2 2 2 2 2 2 2 6:2 Compressors Form a Carry Chain • Each 6:2 compressor is a logic cell • Carry chains between adjacent cells bypass local routing • This is not an over-glorified ripple-carry structure 9/16

  12. cin,1 cin,0 rank-0 inputs FA FA HA FA FA cout,1 cout,0 Sum outputs 6:2 Compressors: Microarchitecture • No combinational path from carry-in to carry-out bits • This is not ripple-carry 10/16

  13. cin,1 cin,0 ALM inputs rank-0 inputs FA FA FA FA (LUTs) HA FA FA FA FA cout,1 cout,0 Sum outputs To ALM outputs ALM (Shared Arithmetic Mode) 6:2 Compressor Similarities Between Shared Arithmetic Mode and the 6:2 Compressor 12/16

  14. Proposed Logic Cell: 2 Designs 13/16

  15. Experimental Methodology • Platform: VPR • Modeled island-style FPGA • Altera-like ALMs and LABs • 4 ALMs per LAB to reduce complexity • 4 Mapping Algorithms • 3-ADD : Ternary adder trees • GPC : GPC mapping • [Parandeh-Afshar et al. ASPDAC 2008] • 6:2 : Mapping using 6:2 compressors only • 6:2 + GPC : The best of both worlds 14/16

  16. Experimental Results 6:2 + GPC is the best in all cases 3-ADD has the smallest area in all cases GPC has the largest area in all cases No uniform trends GPC does not use carry chains; the others do! 15/16

  17. Conclusion • Compressor trees are an important class of arithmetic circuits • Previous work: GPC mapping outperforms 3-ADD • Cannot use carry-chain • Contribution: New carry chain • Configures the Altera Stratix II/III ALM as a 6:2 compressor • 1 HA, 2 FA, 2 muxes, plus wires • Best results combine GPC mapping with 6:2 compressors • Average speedup : 1.41x over 3-ADD • Average increase in ALM usage: 1.19x over 3-ADD 16/16

More Related