Philip Brisk 2 Paolo Ienne 2

Improving Synthesis of Compressor Trees on FPGAs via Integer Linear Programming Hadi Parandeh-Afshar1,2 Philip Brisk2 Paolo Ienne2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences

Outline • Motivation • Generalized Parallel Counters • ILP Formulation • Experimental Results • Conclusion

Motivation: Why multi-input addition is important? • Partial product reduction in parallel multiplication • Wallace and Dadda in the 1960s • Multi-input addition occurs in many multimedia and signal processing • H.264/AVC Variable Block Size Motion Estimation • FIR Filters • 3G Wireless Base Station Channel Cards • Flow graph transformations expose opportunities to use compresor trees in high-level synthesis [Verma and Ienne, ICCAD 2004]

Multi Input Addition Implementation • ASIC • Compressor Trees + Final Adder • Counters are the basic blocks • Wallace/Dadda/3-Greedy • FPGA • Adder Trees • Full Adder Implemented in CLB Structure • Fast Carry-Chain (Xilinx and Altera) • Reduces Routing Delay • Compressor Trees have poor performance • Fast carry chains can not be used • Counters are inflexible • GOAL: Better implementation of compressor trees on FPGAs

(3; 2) Counter (3, 3; 4) GPC Generalized Parallel Counters (GPCs) • Parallel Counter: Sum bits with the same rank • Generalized Parallel Counter: Sum bits having different ranks • Example • GPCs are more flexible and reduce the number of logic levels • GPCs are more complex, but the additional complexity is absorbed in LUTs! • GPCs are perfect building blocks to create better compressors out of FPGA LUTs

GPC Implementation K K GPC K-LUT K-LUT K-LUT N N

0 2 1 3 Rank Goal • How to best select GPC types and connect them to build a compressor tree

ki = 1 ki = 0 GPC kj = 1 kj = 2 kj = 0 ILP Formulation • Objective Function • Minimizing Levels of GPCs • GPC Representation in ILP

ILP Formulation • Variables • pm,i,ki {0, 1} – True if there is a connection between the m-thinput bit and an input of rank kiof GPCi. m2 m1 m0 m3 p2,1,0 p1,0,1 p0,0,0 GPC1 GPC0 D3,3 e1,2,0,1 e0,2,1,0 GPC2 q1,2,2 q0,0,0 q2,1,1 n3 n2 n1 n0

m3 p2,1,0 p1,0,1 p0,0,0 GPC1 GPC0 e1,2,0,1 e0,2,1,0 GPC2 q1,2,2 q0,0,0 q2,1,1 n3 n2 n1 n0 ILP Formulation • Variables • qi,ki,m{0, 1} – True if there is a connection between the ki-thoutput of GPCi and an output bit of rank m. m2 m1 m0 D3,3

m3 p2,1,0 p1,0,1 p0,0,0 GPC1 GPC0 e1,2,0,1 e0,2,1,0 GPC2 q1,2,2 q0,0,0 q2,1,1 n3 n2 n1 n0 ILP Formulation • Variables • ei,j,ki,kj{0, 1} – True if there is a connection from the ki-thoutput of GPCi and an input of rank kj of GPCj. m2 m1 m0 D3,3

m3 p2,1,0 p1,0,1 p0,0,0 GPC1 GPC0 e1,2,0,1 e0,2,1,0 GPC2 q1,2,2 q0,0,0 q2,1,1 n3 n2 n1 n0 ILP Formulation • Variables • Di,j{0, 1} – True if there is a direct connection from the ith input bit and an output bit of rank j. m2 m1 m0 D3,3

ILP Formulation • Connection rules • Circuit I/Os • Each circuit input should be connected to either a GPC or the final adder • Each output rank should be derived k-times (K=3, final adder is a ternary adder) • GPC I/Os • Satisfying number of allowable I/Os considering input ranks • Wires • Satisfying rank constraints of source and destination of each wire

ILP Formulation • ILP Improvement • Using [Parandeh-Afshar et. al, APSDAC 2008] heuristic for estimating maximum number of GPCs at each Level • GPC on level L can only connect to inputs of GPCs on levels L+1 and L+2

Experimental Methodology • CPLEX ILP Solver • Altera Stratix-II • 90nm CMOS Technology • Implementations of multi-input addition • Adder Tree – Ternary adder tree • State of the art for FPGAs • Heuristic – Mapping heuristic described in [13] • ILP – ILP formulation described here

Experimental results (Delay) ILP on average is: 32% faster than Adder Tree 5% faster than the Heuristic

Experimental Results (Area) ILP on average consumes: 3% less resources than Adder Tree 13% less resources than Heuristic

Conclusion • Conventional wisdom has held that adder trees outperform compressor trees on FPGAs • Ternary adder trees were a major selling point of the Altera Stratix II architecture • Conventional wisdom is wrong! • GPCs map nicely onto LUTs • Compressor trees on FPGAs, are faster than adder trees when built from GPCs

Philip Brisk 2 Paolo Ienne 2