Efficient Synthesis of Compressor Trees on FPGAs

Efficient Synthesis of Compressor Trees on FPGAs Hadi Parandeh-Afshar1,2 Philip Brisk2 Paolo Ienne2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences

Outline • State of the Art: FPGAs • Motivation • Generalized Parallel Counters • Mapping Heuristic • Experimental Results • Conclusion January 22, 2008

ASIC FPGA Performance Area Utilization Power Consumption Flexibility Time-to-Market FPGA vs. ASIC √ √ √ √ √ January 22, 2008

FPGA Arithmetic Features • Poor Performance for Arithmetic Operations Compared to ASIC • IP Cores • High Routing Costs • Limited Flexibility; 18-bit Adder/Multiplier • Full Adder Implemented in CLB Structure • Fast Carry-Chain (Xilinx and Altera) • Reduces Routing Delay • Cannot Use Compressor Trees to Add k>2 Values • Wallace/Dadda/3-Greedy January 22, 2008

Motivation: Compressor Trees • Partial product reduction in parallel multiplication • Wallace and Dadda in the 1960s • Multi-input addition occurs in many multimedia and signal processing • H.264/AVC Variable Block Size Motion Estimation • FIR Filters • 3G Wireless Base Station Channel Cards • Flow graph transformations expose opportunities to use compresor trees in high-level synthesis • [Verma and Ienne, ICCAD 04] January 22, 2008

step 3 delta 7 delta 4 delta 2 delta 1 >> 4 0 0 0 + step 1 0 & = = = SEL >> = step 0 step 1 step 2 step 3 2 + 0 & step 2 >> >> >> >> 0 0 0 SEL = >> SEL SEL SEL 1 & & & & 0 & + ∑ Compressor Tree SEL = + vpdiff vpdiff Flow Graph Transformation ADPCM January 22, 2008

Counters Counters You Know 2:2 – Half Adder 3:2 – Full Adder Count #of Input Bits Set to 1 Output # as a Binary Value (Carry-Save Adder) m:n counter m The correct building block for computing sums of k>2 numbers n Counters do not map well onto LUTs or carry chains n = log2(m+1) January 22, 2008

Generalized Parallel Counters (GPCs) • Sum bits having different ranks • m:n counter: all bits have rank 0, i.e.: 20 = 1 • Representation: • (Kn-2, Kn-1, …, K0; S) • Ki – number of input bits of rank i • S – number of output bits • (0, 4; 3) – typical 4:3 counter • (2, 3; 3) – maximum value: 2*21 + 4*20= 12 • Range [0, 12] requires S = 4 output bits • Examples using dot notation • (3, 3; 4) GPC • (5, 5; 4) GPC January 22, 2008

GPC Implementation • For ASICs • Basic gates, e.g. AND, XOR • Built from m:n counters, e.g., just like a compressor tree • FPGA Implementation • K-input GPC maps nicely onto K-LUTs • One logic level required • K = 6 for Xilinx Virtex-5 and Altera Stratix II and III • Three 6-LUTs for 6-input, 3-output GPC • Four 6-LUTs for 6-input, 4-output GPC January 22, 2008

Definitions • Primitive GPCs: • Satisfies given I/O Constraints • 12-primitive GPCs for 6 inputs, 3 outputs • Including (1, 3; 3), (2, 3; 3) • Covering GPCs • Functionality cannot be implemented by other GPCs, given the I/O constraints • e.g., (2, 3; 3) GPC can implement a (1, 3; 3) GPC • Set a rank-1 input bit to 0 January 22, 2008

Definitions • Unreasonable GPCs: • Single bit in rank-0 column • (3, 1; 3) GPC • rank-0 output bit = rank-0 input bit • No reduction in bits • (1, 2; 3) GPC • 3 input bits: • Output value in range [0, 4] • 3 output bits January 22, 2008

Definitions • Compression Ratio (CR): • # Input Bits / # Output Bits • (3, 3; 4) GPC • CR = 6/4 = 1.5 • (2, 3; 3) GPC • CR = 5/3 = 1.67 • Using GPCs with large CR tends to reduce the number of bits to sum at the next logic level • # logic levels = # LUTs on critical path in an FPGA January 22, 2008

0 rank Input: Columns of bits to sum • Example: 3-tap FIR filter • Each FIR filter is different, depending on constants used January 22, 2008

Attack the tallest column first (greedy approach) Virtex-5 and Stratix II & III support ternary addition Mapping Heuristic • map_algorithm(Integer : M, Integer : N, • Array of Integers : columns ) • { • step1: find_covering_GPCs( ); • step2: find_primitive_GPCs( ); • step3: order_primitive_GPCs( ); • Repeat { • step4: Repeat { • col_indx = find_highest_column( ); • find_next_GPC (col_indx); • remove_covered_dots( ); • } until all dots are covered or no reasonable GPC is found • step5: connect_GPCs_IOs( ); • step6: generate_next_stage_dots( ); • } until three rows of dots remains; • } • step7: generate_final_cpa( columns ) January 22, 2008

2 1 3 4 Map to ternary adder Example January 22, 2008

Experimental Methodology • Altera Stratix-II • 90nm CMOS Technology • Implementations of multi-input addition • ADD – Ternary adder tree • State of the art for FPGAs • 3GD – 3-greedy algorithm (3:2 and 2:2 counters) • [Stelling et al., TCOMP 98] • 2 and 3-input counters do not map well onto 6-LUTs! • GPCs – Heuristic described here January 22, 2008

Experimental Results (Delay) 27% on average GPC is faster than ADD January 22, 2008

Experimental results (Area) 5% increase in ALMs usage for GPC compared to ADD January 22, 2008

Are DSP/MAC Blocks Useful? • No! On average, delay using DSP/MAC blocks was more than 2x worse than 3GD January 22, 2008

Conclusion • Conventional wisdom has held that adder trees outperform compressor trees on FPGAs • Ternary adder trees were a major selling point of the Altera Stratix II architecture • This led to their inclusion in Xilinx Virtex-5 devices • Conventional wisdom is wrong! • GPCs map nicely onto LUTs • Compressor trees on FPGAs, are faster than adder trees when built from GPCs January 22, 2008

Efficient Synthesis of Compressor Trees on FPGAs