IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005 Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America

Outline • Introduction • Related Work • Polynomial transformation • Common Subexpression elimination • Results • Conclusions

Introduction • Multiplications by constants encountered in many application areas • DSP transforms in Audio, Video, Image processing (DFT, DCT, IDCT etc..) • Filtering operations in Communication (FIR, IIR filters) • Multiple Input Multiple Output (MIMO) systems • Polynomials in Computer graphics

Introduction • Multiplication is expensive in hardware • Decompose constant multiplications into shifts and additions • 13*X = (1101)2*X = X + X<<2 + X<<3 • Signed digits can reduce the number of additions/subtractions • Canonical Signed Digits (CSD) (Knuth’74) • (57)10 = (0110111)2 = (100-1001)CSD • Further reduction possible by common subexpression elimination • Upto 50% reduction (R.Hartley TCS’96)

4+, 4<< 3+, 3<< Introduction • Common subexpressions = common digit patterns • F1 = 7*X = (0111)*X = X + X<<1 + X<<2 F2 = 13*X = (1101)*X = X + X<<2 + X<<3 • D1 = X + X<<2 F1 = D1 + X<<1 F2 = D1 + X<<3 • Good for single variable: FIR filters(transposed form) • Multiple variable? (DFT, DCT etc..??) “0101” => X + X<<2

Related Work • Simple Bipartite matching (Potkonjak et. al TCAD’95) • (10101) and (01101) => common pattern = “101” • (10010) and (010010) => cannot detect pattern “1001” • Recursive Shift and Add (RESANDS) (H.Nguyen et. Al, TVLSI 2000) • (10010) and (010010) => common pattern “1001” • Exhaustive enumeration of all digit patterns (Pasko et. Al. TCAD’99) • (1011) => “0011”, “1001”, “1010”, “0101”, “1011”

Related Work • Extending techniques for multiple variables Y1 a11 a12 a13 X1 Y2 =a21 a22 a23 xX2 Y3 a31 a32 a33 X3 Potkonjak et. al. TCAD’95 All Distinct SijXj and CikDk Y1 Y2 Y3

Related Work • Multiple Variable Common Subexpression elimination (A.Hosangadi et. al ASAP’04) • Polynomial transformation of linear systems. • Use rectangular covering methods • Cannot find subexpressions with reversed signs eg. (X1 – X2<<1) ≠ (X2<<1 – X1) • Common occurrence when signed digits are used • Rectangle covering has exponential complexity • Method to overcome these limitations ?

Related Work • Algebraic methods in multi-level logic synthesis (MLLS) • Reducing literal count in a set of Boolean expressions • Factoring, decomposition: Established algebraic techniques • Typically used for thousands of variables and literals • Apply these methods to optimize linear systems? D1 = X1+ X2<<2 Y1 = D1 + D1<<3 + X1<<3 Y2 = D1 + X2<<2

Linear systems and polynomial transformation • View linear systems as set of arithmetic expressions • Expressions consisting of +,-,<< operators • Develop methodology for extracting common subexpressions • Polynomial formulation C × X = (±X×Li) (14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1 = (100-10)CSD × X = XL4 – XL1

Linear Systems and polynomial transformation • Y0 1 1 1 1 X0 Y1 =2 1 -1 -2 X1 Y2 1 -1 -1 1 X2 Y3 1 -2 2 -1 X3 • Decomposing constant multiplications H.264 Integer Transform Y0 = X0 + X1 + X2 + X3 Y1 = X0<<1 + X1 - X2 - X3<<1 Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1<<1 + X2<<1 - X3 12+, 4<<

Linear Systems and polynomial transformation • Y0 1 1 1 1 X0 Y1 =2 1 -1 -2 X1 Y2 1 -1 -1 1 X2 Y3 1 -2 2 -1 X3 • Polynomial transformation H.264 Integer Transform Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3 12+, 4<<

Fx algorithm • Concurrent Decomposition and Factorization of Boolean Expressions (J.Rajski et. al TCAD’92) • Popular as Fast-Extract (Fx) algorithm • Expression f = gh + r • g = (ab + c) => Double cube divisor • g = ab => Single cube divisor • Fx algorithm for Linear systems?

Two-term divisors • Obtained from every pair of terms in each expression • Divide by the minimum exponent of L • eg. F = X1 + X2L + X3L3 • { +X2L, +X3L3}: Divide by L => (X2+ X3L2) • Divisors = (X1 + X2L), (X1 + X3L3), (X2 + X3L2) • Two divisors intersect if • The terms involved are distinct • (X1 – X2L)∩ (X1 - X2L) = φ (X1 – X2L)∩ (-X1 + X2L) = φ (reversed signs allowed !!)

Two-term divisors • Theorem: Multiple term common subexpression in set of expression iff non-overlapping intersection among two-term divisors • Many divisors with intersections, which one to choose? • Use greedy selection of divisor with most # of intersections • Selecting divisors changes expressions • Perform concurrent decomposition of expressions

Algorithm (Step 1) • Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi; {Divisors} = {Divisors}∩ {Dnew}; Update frequency statistics of {Divisors} ; }

Algorithm (Step 2)Common Subexpression Elimination {Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors}∩ {Dnew}; }

Algorithm complexity • MxM constant matrix; N digits of precision Y0 1111 1111 1011 1001Y0 = X0 + X0L + ... XM-1L3+ XM-1 Y1 .. … … … … .. YM-11111 1110 0011 1010 M N O(MN) terms M => O(M2N2) divisors

Algorithm (Step 1) • Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi; {Divisors} = {Divisors}∩ {Dnew}; Update frequency statistics of {Divisors} ; } O(M2N2) distinct divisors O(M2N2) O(M3N2)

Algorithm (Step 2)Common Subexpression Elimination O(M2N2) {Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors}∩ {Dnew}; } O(M2N2)

Algorithm • H.264 example • >> Select D0 = (X0 + X3) Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3

Algorithm • H.264 example • >> Select D1 = (X1 – X2) Y0 = D0 + X1 + X2 Y1 = X0L + X1 - X2 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - X1L + X2L - X3

Algorithm • H.264 example • >> Select D2 = (X1 + X2) Y0 = D0 + X1 + X2 Y1 = X0L + D1 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - D1L - X3

Algorithm • H.264 example • >> Select D3 = (X0 – X3) Y0 = D0 + D2 Y1 = X0L + D1 -X3L Y2 = D0 - D2 Y3 = X0 - D1L - X3

Final Implementation 8+, 2<< • Extracting 4 divisors D0 = X0 + X3 Y0 = D0 + D2 D1 = X1 – X2 Y1 = D1 + D3L D2 = X1 + X2 Y2 = D0 - D2 D3 = X0 - X3 Y3 = D3 – D1L Original: 12+, 4<< Rectangle Covering: 10+, 3<<

Experimental Setup • Goal • Reduction in #additions/subtractions • Effect on area/latency on synthesis • Simulate designs to estimate power consumption • Transforms DCT, IDCT,DFT, DST, DHT. • 8x8 constant matrices • 16 digits precision (CSD representation) • Compare with • Potkonjak (TCAD’95) • RESANDS (Nguyen et. al TVLSI’2000) • Rectangle Covering (A.Hosangadi et.al ASAP’04)

Experimental Results Run Time 0.81s 0.08s

Experimental results (III)  RESANDS (IV)  Rect. Covering (V)  2-term CSE • Synthesis results (minimum latency constraints)

Experimental results (III)  RESANDS (IV)  Rect. Covering (V)  2-term CSE • Power consumption

Conclusions • A new technique for eliminating common subexpressions in linear systems • Fewer operations than known methods • Much faster than rectangle covering • Combine with scheduling on given resources

Thank you • Questions??

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005