420 likes | 530 Views
Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI Design. Anup Hosangadi Ryan Kastner ECE Department, UCSB. Farzan Fallah Advanced CAD Research Fujitsu Labs of America. Outline. Introduction Related Work Problem formulation
E N D
Energy Efficient Hardware Synthesis of Polynomial Expressions18th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America
Outline • Introduction • Related Work • Problem formulation • Algorithms for optimizing polynomials • Experimental results • Conclusions
Introduction • Embedded system applications need to compute polynomial expressions • Continuous functions can be approximated by Taylor Series • Adaptive (polynomial) filters • Polynomial interpolation/extrapolation in Computer Graphics • Encrpytion
Introduction • Commonly occuring computations implemented in hardware • More flexibility than processor architecture • NPAs (Hardware accelarators) in PICO project • Custom Instructions (Tensilica) • Upto 100 times improvement over processor implementation (Kastner et.al TODAES’02) • Develop techniques for reducing power consumption
Related Work (Behavioral transforms) • Power consumption depends on many factors • Reducing number of operations • Hardware: (Nguyen and Chatterjee TVLSI’00) • Software: (I.Hong et.al TODAES’99) • Voltage reduction after speedup transformations • Retiming, Pipelining, Algebraic restructuring (Chandrakasan et. al TCAD’95)
Related Work • Scheduling and resource allocation • Shutting down unused resources (Monteiro et. al. DAC 96) • Allocation of registers, functional units and interconnects (A.Raghunathan et. al ICCD’94) • Multiple Vdd scheduling • Assigning supply voltage to each operation in CDFG (M.Chang and M.Pedram TVLSI’97)
Related Work • Switching power is proportional to number of operations • Multiplications are expensive in Embedded systems • Average 40 times more power than addition at 5V (V.Krishna et. al, VLSI Design 1999) • Careful optimization of expressions is therefore necessary to save power
Reducing operations in polynomial expressions • No good tool for polynomials • Designers rely on hand optimized libraries • Conventional compiler techniques: CSE and Value numbering not suited for polynomials. • Horner form: most popular representation • anxn + a1xn-1 + ….an-1x + a0 = (…((anx + an-1)x + an-2)x + ..a1)x + a0 • Not good for multivariate polynomials • Only a single polynomial expression at a time
Comparison with Horner form • Quartic-spline polynomial (3-D graphics) P = zu4 + 4avu3 + 6bu2v2 + 4uv3w + qv4 • Horner form (from MapleTM) P = zu4 + (4au3 + (6bu2 + (4uw + qv)v)v)v (17 multiplications) • Proposed algebraic method: d1 = v2 ; d2 = d1*v P = u3(uz + ad2) + d1( qd1 + u(wd2 + 6bu) ) (11 multiplications)
Related Work (Polynomial Expressions • Expression Factorization (M.A. Breuer JACM’69) • Allows only one kind of operator at a time • Using Symbolic Algebra (M.A.Peymandoust, De Micheli) • Mapping polynomial datapaths to libraries (DAC’01) • Low power embedded software (DATE’02) • Results depend heavily on set of library elements eg. (a2 – b2) = (a+b)(a-b) iff (a+b) or (a-b) is a library element • Manipulates only a single expression at a time F1 = A+ B + C + D; F2 = A + P + D; => Extract (A + D)
Motivating Example • Consider set of expressions • Using CSE 16 multiplications and 4 additions/subtractions 12 multiplications and 4 additions/subtractions
Motivational Example • Using Horner transform • Using our algebraic technique 12 multiplications and 4 additions/subtractions 7 multiplications and 3 additions/subtractions
Introduction to algebraic technique for redundancy elimination • Algebraic techniques in multi-level logic synthesis (MLLS) • Decomposition, factoringreduce number of literals • Distill and Condense use Rectangle Covering methods • Polynomial Expressions (Our Technique) • Factoring, Single term common subexpressions reduces number of multiplications • Multiple term common subexpressions reduces number of additions and possibly multiplications • Key Differences (Generalization to handle higher orders) • Kernelling techniques • Finding single cube intersections
Introduction to our technique(Outline) • Find a subset of all possible subexpressions (kernel generation) • Transformation of Polynomial Expressions • Problem formulation • Extract multiple term common subexpressions and factors • Extract single term common factors
Introduction to our technique • Terminology • Literal: A variable or a constant eg. a,b,2,3.14 • Cube: Product of literals e.g. +3a2b, -2a3b2c • SOP: Sum of cubes e.g. +3a2b – 2a3b2c • Cube-free expression: No literal or cube can divide all the cubes of the expression • Kernel: A cube free sub-expression of an expression, e.g. 3 – 2abc • Co-Kernel: A cube that is used to divide an expression to get a kernel, e.g. a2b
Introduction to our Technique • Matrix Representation of Polynomial Expressions • F = x3y – xy2z is represented by • Each row represents a product term • Each column represents a variable/constant • Each element (i,j) represents power of variable j in term i
Generation of Kernels (example) • P1 = x3y + x2y2z {L} = {x,y,z} • Divide by x: Ft = P1/x = x2y + xy2z
Generation of Kernels (example) Ft = P1/x = x2y + xy2z • C = Biggest Cube dividing all cubes of Ft / C = C = 1 1 0 = xy
Generation of Kernels (example) • Obtain Kernel: F1 = Ft/C = (x2y + xy2z)/(xy) = ( x + yz) • Obtain Co-Kernel D1 = x*(xy) = x2y • No kernels within F1. Go back to P1 • P1 = x3y + x2y2z • Divide now by next variable y Ft = x3 + x2yz • C = x2 • But (x < y) ε C Stop Here, to avoid repeating same kernel Ft/C = (x + yz) • No more kernels extracted • Record kernel F1 = P1 with co-kernel ‘1’
Concept of kernels and co-kernels • Theorem: Two expressions f and g can have a multiple term common subexpression iff there are 2 kernels Kf and Kg having a multiple term intersection • Detection of multiple term common subexpressions by intersection of sets of kernels • Each co-kernel : kernel pair represents a possible factorization • e.g. x3y + x2y2z = [x2y](x + yz) • Set of kernels a subset of all possible subexpressions
All Kernels and Co Kernels Which kernels to choose?
Kernel Cube Matrix (KCM) • One row for each Kernel generated • One column for each distinct kernel cube • Each non-zero element represents a term x3y
Finding Kernel Intersections(Distill Algorithm) • Each kernel intersection or factor appears as a rectangle • Rectangle: Set of rows and columns such that all elements are ‘1’ • Value of a rectangle = Weighted sum of the energy savings of the different operations • Goal: Maximum valued rectangular covering of KCM • Greedy heuristic: covering by prime rectangles
Modeling value function of a rectangle • Formula for weighted sum of energy savings on selection of a rectangle R = # of rows ; C = # of columns M(Ri) = # of multiplications in row (co-kernel) i. M(Ci) = # of multiplications in column (kernel-cube) i m = ratio of average energy consumption of multiplication to addition in the target library Value =
4x + 4yz = 4d1 d1 = (x + yz) x3y + x2y2z = x2yd1 Saves 5 multiplications and 1 addition Value = 201 units (m = 40) Distill Algorithm
Distill Algorithm 4xy – x2y = xyd2 d2 = 4 – x Saves 2 multiplications Value = 80 Remove covered terms
Distill Algorithm • Distill algorithm exits after no more kernel intersections can be found P1 = x2yd1 d1 = x + yz P2 = 4d1 – xyz d2 = 4 - x P3 = xyd2 Can further optimize by finding single cube intersections
Finding single cube intersections (Condense algorithm) • Form Cube Literal Matrix (CLM) • One row for each cube • One column for each literal • Eg. 2 cubes F1 = a4b3c; and F2 = a2b4c2
Finding single cube intersections (Condense algorithm) • Each (single term) common subexpression appears as a rectangle. • Rectangle: Set of rows and columns where all elements are non-zero • Value of a rectangle is number of multiplications saved by selecting it • C = cube corresponding to the rectangle Value = Rows*( (ΣC[i] ) -1) • Maximum valued rectangular covering will give minimum number of multiplications • Use greedy iterative covering by prime rectangles
Cube Literal Matrix (Condense Algorithm) CLM for our example after Distill algorithm Save 2 multiplications by extracting xy C = xy
Condense Algorithm Extracting xy No more favorable cube intersections found
Final Implementation • Total 7 multiplications, 3 additions/subtractions • Savings of 5 multiplications, 1 addition/subtraction compared to CSE • Impossible to obtain such results using conventional techniques
Experimental setup • Polynomials used in Computer graphics and Signal Processing • 1.0 µ technology library, characterized for power consumption • Synthesized using Synopsys Design CompilerTM • Min Hardware constraints (1 adder + 1 multiplier) • Med Hardware constraints (Max 4 multipliers)
Experimental setup • Estimated power using Synopsys Power CompilerTM for random inputs, using RTL Simulator (VCSTM) • Compared energy consumption with CSE and Horner form • Compared energy after voltage scaling
Conclusions • Technique to reduce number of operations in polynomial expressions • Large savings in energy consumption observed over CSE and Horner methods • Need to consider scheduling and resource allocation to obtain further improvements
Conclusions • Thank you!! • Questions ???
Finding Kernel Intersections(Distill Algorithm) • Worst case scenario for Distill algorithm • Number of prime rectangles exponential in number of rows/columns • Heuristic methods to find best prime rectangle • In practice polynomial expressions are not so large