120 likes | 222 Views
A Timing-Driven Hybrid-Compression Algorithm for Faster Sum-of-Products. Sabyasachi Das Synplicity Inc Sunil P. Khatri Texas A&M University. e. d. f. b. c. a. q = c * d. p = a * b. q. p. z = p + q + e + f. z. What is a Sum-of-Product?.
E N D
A Timing-Driven Hybrid-Compression Algorithm for Faster Sum-of-Products Sabyasachi Das Synplicity Inc Sunil P. Khatri Texas A&M University
e d f b c a q = c * d p = a * b q p z = p + q + e + f z What is a Sum-of-Product? • IC block that performs addition of multiple product and sum terms • Computationally-intensive • Wide usage in DSP, Graphics, Microprocessors
Multiplication {assign z = a * b} MAC {assign z = (a * b) + c} 2-operand Addition {assign z = a + b} Squarer {assign z = a * a} Adder-Tree {assign z = a + b + c + d} Generalized SOP {assign z = (a * b) + (c * d) + (e * f) + g + h + k} Examples of Sum-of-Product Blocks
Structure of Sum-of-Products Inputs • Sum-of-Products block consists of 3 parts (written in the order of data-flow) • Partial Product Generator (PPGen) • Partial Product Reduction Tree (PPRT) • Final Carry-Propagation Adder (CPA) Partial Product Generator (PPGen) Partial Product Reduction Tree (PPRT) Final Carry Propagation Adder (CPA) Output
Partial Product Reduction Tree • In Partial Product Reduction Tree, total number of elements in each bit gets reduced to upto two • Partial Product Reduction Tree (PPRT) consumes >50% delay of the SOP block • Hence the performance of PPRT is crucial to the performance of the SOP block
ai bi Ci+1 Si ai bi ci Ci+1 Si Two Reduction Counters in PPRT • (2:2) Counter • Reduces 2 inputs (ai and bi) to 2 outputs (Si and Ci+1) • (3:2) Counter • Reduces 3 inputs (ai, bi and ci) to 2 outputs (Si and Ci+1)
ai bi ci di Ci+2Ci+1 Si 4:3 Reduction Counters • (4:3) Counter • 4 inputs to 3 outputs • The functionality of the Ci+2 is a 4-input AND gate. • Faster reduction at ith column • Produces element to (i+2)th column at an earlier time • Has larger area than other two counters Key idea is to use (4:3) counter as much as possible in conjunction with the (3:2) and (2:2) counters
Explanation of our approach • Perform column-wise reduction (LSB to MSB) • For each column (or BitSlice/BitCluster) • Sort inputs based on arrival time • Is (2:2) reduction fast? If yes, instantiate that • Else is (3:2) reduction fast? If yes, instantiate that • Else instantiate (4:3) reduction • After each reduction, re-sort the signals and continue
An example of our approach P07 P06 P05 P04 P03 P02 P01 P00 P17 P16 P15 P14 P13 P12 P11 P10 P27 P26 P25 P24 P23 P22 P21 P20 P37 P36 P35 P34 P33 P32 P31 P30 P47 P46 P45 P44 P43 P42 P41 P40 P57 P56 P55 P54 P53 P52 P51 P50 C02 C01 S00 C11 S10 C03 S01 C13 C12 S02
Results On an average, our approach produces about 3.5% speed improvement with 4.3% area penalty
Summary • A 4:3 reduction counter is designed • Reduces elements in the given column at a faster pace • Produces an element to the (i+2)th column at an earlier time • 4:3 reduction counter is used extensively (in conjunction with the existing 3:2 and 2:2 counters) • A timing-driven algorithm selects the correct type of counter that needs to be instantiated • On an average, 3.5% improvement in speed with 4.3% area penalty.