Reconfigurable Computing

Reconfigurable Computing -Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia

Serial Circuits • Space efficient • Sloooow • One bit of result produced per cycle • Sometimes this isn’t a problem • Highly parallel problems • Search • Many operations on the same data stream • eg search a text database for many keywords in parallel =key0? Serial processing needs: 8xMbits/s - Easy! =keyi? Text stream =keyi? =keyi? Effective performance may require comparison with 1000’s of keys space for key circuits critical! small, compact bit-serial comparator ideal! Data rate: xMB/s =keyn?

Serial Circuits sum a 2-bit register • Bit serial adder FA b cout cin ENTITY serial_add IS PORT( a, b, clk : IN std_logic; sum, cout : OUT std_logic ); END ENTITY serial_add; ARCHITECTURE df OF serial_add IS SIGNAL cint : std_logic; BEGIN PROCESS( clk ) BEGIN IF clk’EVENT AND clk = ‘1’ THEN sum <= a XOR b XOR cint; cint <= (a AND b) OR (b AND cint) OR (a AND cint ); END IF; END PROCESS; cout <= cint;END ARCHITECTURE df; clock Note: The synthesizer will insert the latch on the internal signals! It will recognize the IF clk’EVENT … pattern!

multiplicand x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x multiplier } partialproducts product Multipliers • ‘Long’ multiplication In binary, the partial products are trivial – if multiplier bit = 1, copy the multiplicand else 0 Use an ‘and’ gate!

Multipliers • ‘Long’ multiplication a3 a2 a1 a0 b3 b2 b1 b0 x x x x x x x x x x x x x x x x x x x x x x x In binary, the partial products are trivial – if multiplier bit = 1, copy the multiplicand else 0 Use an ‘and’ gate! b0 b1 b2 b3 a3 a2 a1 a0 b0 first row of partial products

Multipliers • We can add the partial products with FA blocks a3 a2 a1 a0 0 b0 FA FA FA FA b1 FA FA FA FA b2 FA FA FA FA p0 product bits p1

SIGNAL pa, pb, cout : ARRAY( 0 TO n-1 ) OF ARRAY( 0 TO n-1 ) OF std_logic; … but you need to fill in the PORT MAP using internal signals! Parallel Array Adder • We can build this adder in VHDL with two GENERATE loops FOR j IN 0 TO n-1 GENERATE -- For each row FOR j IN 0 TO n-1 GENERATE –- Generate a row pjk : full_adder PORT MAP( … ); END GENERATE; END GENERATE; This part is straight-forward!

Multipliers • We can add the partial products with FA blocks a3 a2 a1 a0 0 b0 Optimization 1: Replace this rowof FAs FA FA FA FA b1 Time? What’s the worst case propagation delay? FA FA FA FA b2 FA FA FA FA p0 product bits p1

Note that an extra adder is needed below the last row to add the last partial products and the carries from the row above! Carry select adder Multipliers • We can add the partial products with FA blocks a3 a2 a1 a0 0 Try to use a more efficient adder in each row? b0 A simpler scheme uses a ‘carry save’ adder – which pushes the carry out’s down to the next row! FA FA FA FA b1 FA FA FA FA b2 FA FA FA FA p0 product bits p1

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Multipliers - Optimization • Multiplier delay is still long! What can we do? • Use the ‘dot’ form to describe a multiplier Each·represents a bit of an operand, partial product or product

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Multipliers - Pipelined • Pipelining will • throughput (results produced per second) • but also • total latency (time to produce full result) Insert registers to capture partial sums Benefits * Simple * Regular * Register width can vary - Need to capture operands also! * Usual pipeline advantages Inserting a register at every stage may not produce a benefit!

· · · · · · · · · · · · · · · · · · · · · · · · Multipliers - Pipelined • Multiplier arrays need space! • O(n2) full adders – a considerable amount of space! • Sequential multipliers use O(n) space butO(n) cycles! (a ^ bj) 2j + a b

· · · · · Multipliers - Tree • Re-examine a full adder • Input - 3 bits (of equal weight) • Output – 2 bits (weighted sum of inputs) a b c Inputs are equivalent a b FA Use the dot notation c cout s Weight 21 20 co s

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Multipliers - Tree • Summing the partial products All these partial product bits are available immediately! Each one is ajbk -itdepends on the inputs only

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Multipliers - Tree • Summing the partial products So combine them vertically! · · · · First level results ·

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Multipliers - Tree • Second level combines first level results in the same way This produces a Wallace Tree · · · · Fast - O(log n) but irregular! First level results · · · · Second level results · · · · · · · · · · · · · · Fill in for yourself  Third level results

High-Radix Multiplication • Avoiding carry propagation • Form logic equations for adding more than 3 bits • A full adder performs a 3 → 2 reduction in the number of bits • Other reductions are possible • In particular a 7 → 3 reduction seems attractive • In an FPGA implementation, the effectiveness of this will depend on the logic block capabilities • A LB with 7 inputs and 3 outputs can implement a 7 → 3 reduction trivially! • Requires a 128x3 bit LUT (look up table)

Reconfigurable Computing - Options in Circuit Design