160 likes | 354 Views
L8 : A Survey on Low Power Multiplication / Accumulation. Contents. Introduction [1] Interlaced Accumulation Programming [2] Operand Swapping [3] Selective Coefficient Negation [4] Coefficient Optimization [5] Coefficient Reordering Conclusion & Future Works. Power Distribution of a DSP.
E N D
Contents • Introduction • [1] Interlaced Accumulation Programming • [2] Operand Swapping • [3] Selective Coefficient Negation • [4] Coefficient Optimization • [5] Coefficient Reordering • Conclusion & Future Works
Power Distribution of a DSP • Hirotsugu [ISLPED ‘96] : For each test programs Normalized Power Consumption (%) 40 Variation due to Data Dependency 30 20 10 Pin Bus Misc. Control Memory Clocking Data Op. Address Generation Peripheral
Multiplication and Accumulation: MAC • Major operation in DSP [ Modified Booth Encoding ] One of 0, X, -X, 2X, -2X based on each 2 bits of Y X X Y Y MULT ALU ACC PR CSA CPA MUL > (5 * ALU) PR
Power Consumption by a Multiplier • Power Consumption by Data Dependency (nJ) X : Energy per cycle Y : # of input transitions Little Correlation 8 7 Average = 7nJ 6 (nJ) 5 2 4 3 1 2 1 20 40 60 20 40 36-bit ALU 16x16 MPY
Power Consumption by a Multiplier • What is an important input in terms of power ? (nJ) (nJ) 8 8 7 7 6 6 5 5 Average = 1nJ Average = 5nJ 4 4 3 3 2 2 1 1 10 15 10 15 5 5 0x8000 x (random) (random) x 0x8000
Power Consumption by a Multiplier • Booth encoding is a significant overhead. (nJ) (nJ) 8 8 7 7 6 6 5 5 Average = 4nJ Average = 6nJ 4 4 3 3 2 2 1 1 10 15 10 15 5 5 0x5555 x (random) (random) x 0x5555
2 3 1 Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k+1) = C0 * X(k+1) + C1 * X(k ) + C2 * X(k-1) 5 6 4 Y(k+2) = C0 * X(k+2) + C1 * X(k+1) + C2 * X(k ) 4 6 2 Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k+1) = C0 * X(k+1) + C1 * X(k ) + C2 * X(k-1) 3 5 1 Y(k+2) = C0 * X(k+2) + C1 * X(k+1) + C2 * X(k ) Interlaced Accumulation Programming(1/2) • Hirotsugu [ISLPED ‘96] 3-tap FIR filter (n=3)
Interlaced Accumulation Programming(2/2) • More than 40% power is saved by • Keeping a constant at one operand of multiplier • X is kept : 7nJ -> 5 ~ 6nJ • Y is kept : 7nJ -> 1 ~ 3nJ • Reducing the number of memory access by a half • Traditional : two memory operands • Interlaced : one memory operand • ( data re-use by temporary register )
Operand Swapping (1/2) • Weight = how many additions are needed ? Weight = 2 00111100 Y= 00X000X0 By Booth Encoding Operands Current (mW) A B A*B B*A Saving 7FFF AAAA 54% 10.0 22.0 0001 AAAA Low Weight High Switching 7FFF 6666 68% 10.0 31.6 0001 AAAA 7FFF AAAA 58% 12.2 28.8 0001 0001
Operand Swapping (2/2) • For filter operations, one operand is usually is constant. => Operand swapping in compile-time. Y Current (mA) LowW ->LowW HighW ->HighW LowW ->HighW LowS HighS LowS HighS HighS LowS 4.0 9.5 11.9 21.2 19.2 X 7.7 13.0 21.6 31.2 27.5 HighS LowS : Low switching HighS : High switching LowW : Low weight HighW : High weight Candidate for Operand Swapping
Selective Coefficient Negation • To reduce the toggle • store Coeff[i] or -Coeff[i] on memory • According to the negation, • use `multiply and add’ (MAC+ instruction) • use `multiply and sub’ (MAC- instruction) • GSM Vocoder : 11% power reduction ACC = ACC + (X * Y) ACC = ACC - (X * Y)
Coefficient Optimization • Mahesh [TVLSI ‘98] • The design of the finite wordlength FIR filter • Given N coefficients and constraints, • Find a new set of coefficients such that the total Hamming distance between successive coefficients is minimized. • => using a coefficient perturbation & • an algorithm similar to simulated annealing • But, Hamming distance is not a good cost-function !!!
Coefficient Ordering • MAC operation : commutative, associative • Finding a good ordering • N! cases for a N-tap filter Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k) = C1 * X(k-1 ) + C0 * X(k ) + C2 * X(k-2)
Conclusion & Future Works • Power characteristics of a multiplier • Some techniques for low power MACs • Interlaced accumulation programming • Operand swapping • Selective coefficient negation • Coefficient optimization & ordering • Find an accurate power model for a multiplier • Cost function for coefficient optimization • & instruction-level power optimization • An implementation of a multiplier supporting • Selective ‘operand swapping’ & ‘negation’