1 / 29

Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees

Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees. Hadi P. Afshar Philip Brisk Paolo Ienne. Multi-input Additions are Fundamental. DSP and Multimedia Application FIR filters, Motion Estimation,… Parallel Multipliers Flow Graph Transformation. D. D. D. D.

burian
Download Presentation

Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees Hadi P. Afshar Philip Brisk Paolo Ienne

  2. Multi-input Additions are Fundamental • DSP and Multimedia Application • FIR filters, Motion Estimation,… • Parallel Multipliers • Flow Graph Transformation D D D D FIR Filter Σ

  3. & Flow Graph Transformation BEFORE step 3 delta AFTER 7 delta 4 delta 2 delta 1 >> 4 & & & 0 0 0 + + + step 1 0 & = = = >> = step 0 step 1 step 2 step 3 2 >> >> >> >> 0 & step 2 0 0 0 = >> 1 0 & ∑ Compressor Tree = + ADPCM vpdiff vpdiff

  4. Compressor vs. Adder Tree Compressor Tree Adder Tree CSA CSA CSA CSA CSA CSA CSA CSA CSA CPA • Slow intra LUT routing • Poor LUT utilization • Low logic density Compressors are better than Adder Trees in VLSI CPA But Adder Trees are better than Compressors in FPGA! CPA CPA CPA

  5. But Compressor Trees can be faster and smaller if Properly Designed

  6. Better Compressors on FPGA • Generalized Parallel Counter (GPC) is the basic block • More logic density • Fewer logic levels • Less pressure on the routing GPC GPC GPC GPC GPC GPC GPC CPA

  7. Overview • Arithmetic Concepts • Hybrid Design Approach • Bottom-up • Top-down • Experiments • Conclusion

  8. Parallel Counters • Parallel Counter • Count # of input bits set to 1 • Output is a binary value • 3:2 − Full Adder • 2:2 − Half Adder • Generalized Parallel Counter (GPC) • Input bits can have different bit position • Eg. (3, 3; 4) GPC ∑ m:n counter m n = log2(m+1) n

  9. Compressor Trees on FPGAs • We propose GPCs as the basic blocks for compressor trees • Why? • GPCs map well onto FPGA logic cells • GPCs are flexible

  10. GPC Mapping Example (0,5;3) (3,4;4) (3,5;4) 5 Counters 3 GPCs

  11. Overview • Arithmetic Concepts • Hybrid Design Approach • Bottom-up • Top-down • Experiments • Conclusion

  12. Hybrid Design Approach Compressor Tree Specification Top-Down GPC Mapped HDL Netlist Place and Route Result Atom Level GPC HDL Library FPGA Architectural Characteristics Bottom-Up

  13. FPGA Logic Cell • Altera Stratix-II/III/V Logic Array Block (LAB) Adaptive Logic Module (ALM) Reg 1 2 + 3 4 Comb. Logic 5 6 + 7 Reg 8

  14. FPGA Logic Cell • ALM Configuration Modes • Normal • Extended • Arithmetic • Shared Arithmetic 4-LUT 4-LUT 4-LUT 4-LUT 4-LUT 4-LUT 4-LUT 4-LUT + + + +

  15. Bottom-up Design LAB1 LAB0 F0 6:3 GPC F1 F2 What if we have bigger GPCs like 7:3 GPC? Can we exploit the carry chain and dedicated adders for building GPCs? F2 F0 F1

  16. GPC Design Example 0 a5 (0, 6; 3) GPC a4 + C(a1,a2,a3) 0 a0 S(a1,a2,a3) a3 a0 a2 C(a4,a5) 0 a0 S(a4,a5) ALM0 a1 s0 a0 + z0 s1 FA HA FA FA s0 c0 s1 c1 c0 + z1 c1 ALM1 z0 z2 + z2 z2 z1 0

  17. GPC Placement Logic separation between carry and sum Zero value on the carry cin a + + + s a cout {cout,s} = cin+ a + a = cin+ 2a cout = a and s = cin GPC Boundary GPC Boundary 0 + + +

  18. LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT + + + + + + + + + + + + + + + + GPCi GPCi GPCi+1 GPCi+1

  19. Mapping_algorithm(Integer : M, Integer : W, Array of Integers : columns ) Step1: Step2: Step3: • { • Build_GPC_library(); • repeat • { • while (col_indx<max_col_indx) • { • if(columns[col_indx] > H) • Map_by_GPC(); • else • col_indx++; • } • lsb_to_msb_covering(); • Connect_GPCs_IOs(); • Propagate_comb_delay(); • Generate_next_stage_dots(); • } until three rows of dots remains; • } Top-down Heuristic (0, H; log2H)

  20. Major Step of Heuristic Mapped to (0, H; log2H) GPCs Height < H Process columns from LSB to MSB

  21. Delay Balancing CP1 = z1d+a0d CP2 = max(z1d+a5d, z4d+a2d, z6d+a0d) z8 z7 z6 z2 z1 z0 z5 z4 z3 a5 a2 a0 z1d > z4d > z6d a0d > a2d > a5d

  22. Overview • Arithmetic Constructs • Hybrid Design Approach • Bottom-up • Top-down • Experiments • Conclusion

  23. Experiments • Bottom-up design • Atom-level design by VerilogQuartus Module (VQM) format • Top-down • Heuristic: C++ • Output: Structural VHDL • Quartus-II Altera tool • Benchmarks • DCT, FIR, ME, G721 • Multiplier • Horner Polynomial • Video Mixer

  24. Experiments • Mapping methods • Ternary • LUT Only • Arith1: Arithmetic mode, without delay balancing • Arith2: Arithmetic mode, with delay balancing

  25. Delay (ns) -27% +2%

  26. Area (ALM) +47% +18%

  27. Area (LAB) -4.5%

  28. Overview • Arithmetic Concepts • Hybrid Design Approach • Bottom-up • Top-down • Experiments • Conclusion

  29. Conclusion • Conventional wisdom has held that adder trees outperform compressor trees on FPGAs • Ternary adder trees were a major selling point • Conventional wisdom is wrong! • GPCs map nicely onto FPGA logic cells • Carry-chain • Compressor trees on FPGAs, are faster than adder trees when built from GPCs

More Related