470 likes | 555 Views
A Pipelined LNS ALU Mark G. Arnold marnold@uwyo.edu University of Wyoming & University of Manchester Institute of Science and Technology. Outline What is LNS and Why Bother? Three Step Addition Algorithm Prior research: Reduce ROM for step 2 Reduce latency here for steps 1 and 3
E N D
A Pipelined LNS ALU Mark G. Arnold marnold@uwyo.edu University of Wyoming & University of Manchester Institute of Science and Technology
Outline What is LNS and Why Bother? Three Step Addition Algorithm Prior research: Reduce ROM for step 2 Reduce latency here for steps 1 and 3 Novel interpolator to keep same ROM
Logarithmic Number System (LNS) Convert to base blogarithms once x = logb(X) Keep in LNS throughout all computation Convert back only when done: X = bx upper-case: values of real numbers lower-case: logarithmic representations.
Why bother with LNS? 1. Natural for some applications: Speech (-law CODEC) HMMs (ex), Neural Nets (1/(1+e-x)) 2. Over one hundred papers see www.xlnsresearch.com 3. Lower Power Consumption 4. Can “tune” precision/range to problem 5. Easy Multiplication, Division, Square Root
LNS Power Consumption LNS compresses information High-order bits of LNS change less frequently than equivalent fixed-point LNS can take fewer bits Can tune b to the application: 1. Only precision needed by app. 2. Less wasted dynamic range. LNS takes less power (from Paliouras and Stouraitis)
1 2 3 4 567891 2 3 4 567891 1 2 3 4 567891 2 3 4 567891 Easy Multiplication and Division Given LNS representations x = logb(X) y = logb(Y) LNS multiplication algorithm--only 1 step: Let p= x+ y X P xpy Justification: Y logb(P)= logb(X Y) = logb(X) + logb(Y). Hardware: Fixed-point adder (or a slide rule!) Division, square, square root similar1
Addition is more difficult LNS addition algorithm: 1. Let z1= y1– x1 2. Compute sb(z1) where sb(z1) = logb(1+ bz1) 3. Let y2= x1 + sb(z1) Hardware required: 1. Fixed-point subtractor. 2. ROM lookup and possibly interpolation. 3. Fixed-point adder.
Addition Justification LNS addition algorithm: 1. Let z1= y1– x1 Justification: 1. logb(Z1)= logb(Y1/ X1) = logb(Y1 )-logb(X1).
Addition Justification LNS addition algorithm: 1. Let z1= y1– x1 2. Compute sb(z1) = logb(1+ bz1) Justification: 1. logb(Z1)= logb(Y1/ X1) = logb(Y1 )-logb(X1). 2. logb(1+Z1)= logb(1+ Y1/ X1)
Addition Justification • LNS addition algorithm: • 1. Let z1= y1– x1 • 2. Compute sb(z1) = logb(1+ bz1) • 3. Let y2= x1 + sb(z1) • Justification: • 1. logb(Z1)= logb(Y1/ X1) • = logb(Y1 )-logb(X1). • 2. logb(1+Z1)= logb(1+ Y1/ X1) • 3. logb(X1 (1+Z1) ) = logb(X1 (1+ Y1/ X1) ) • = logb(X1+ Y1).
Prior research--improving step 2 Don’t tabulate for big z1 (“essential zero”): Lim sb(z1) = 0 z1- Cut sb table in half (commutativity) swap y1and x1 so that the z1<0 sb(z1) = logb((1+ b-z1) bz1) = sb(-z1)+z1 Interpolation: Fixed-point multiply accumulate. Cuts table address roughly in half
Memory Requirements _____________________________________________ Precision Tabulate Interpolate Interpolate Whole Ess. Zero Ess. Zero +/- ROM - only ROM _________________________________ Commute___ 8 216 8 23 8 22 10 218 10 24 10 23 12 220 12 25 12 24 14 222 14 26 14 25 16 224 16 27 16 26 18 227 18 28 18 27 _____________________________________________ “Tabulate Whole” assumes a dynamic range of 1038
LNS Addition without commutativity x y sb(z) x+sb(y-x) +/- z =y-x Hardware required (full ROM): 1. Fixed-point subtractor 2. sb ROM (+/-) and interpolator 3. Fixed-point adder + -
LNS Addition with commutativity x x-y sb(z) max(x,y) + -onlv sb(-|x-y| ) y-x high bit Hardware required (if cut ROM in half): 1. 2 fixed-point subtractors and 2 muxes. 2. sb ROM (- only) and interpolator 3. Fixed-point adder. max(x,y) mux y x-y<0, - y > x z = -|x -y| + mux y-x<0, - x > y
Justification Comutativity: X + Y = Y + X In LNS: x + sb(y - x) = y + sb(x - y) Because: sb(-z) = sb(z)+z Choosing max(x,y) reduces z width by 1 bit
Interpolation Splitz = zH + zL, zL< = spacing between tabulated points 2-N = = weight of zH lsb 2-F = weight of z and zL lsbs Address ROMs with zH sb(zH) is the intercept ROM c(zH) is the slope ROM Approximate sb(z) as sb(zH) + c(zH)·zL,
Non-redundant bus Fig. 1 • Redundant bus • v • zH CPA • redundant • z to non- • redundant • Total time tf+ti • tfis non-redundant add time • ti is the time for the redundant summation. CSA redundant + Intercept Sb(zH) sb(z) Slope c(zH) * zL ti tf
Choice of interpolation slope • linear Taylor-Series (tangent line) • c(zH) = sb(zH). • -log2(maxerr)=2N+3 • linear Lagrange (secant line) • c(zH) = (sb(zH+) - sb(zH) )/ • -log2(maxerr)=2N+5. • • 0 < c(~zH) < 1 in either case
Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0
Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1)
Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1) z2= y2– x2y3= x2 + sb(z2)
Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1) z2= y2– x2y3= x2 + sb(z2) z3= y3– x3y4= x3 + sb(z3) .
Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1) z2= y2– x2y3= x2 + sb(z2) z3= y3– x3y4= x3 + sb(z3) . Total time 6tf + 3ti. In general: yj = xj-1 + sb(zj-1) zj = yj – xj Time: (k-1)(2tf + ti), where k is # of values.
Fig. 2 C S A redundant x j-1 C P A redundant to non- redundant intercept C P A yj + * zj-1 slope yj-1 tf ti t f • Clock period 2tf + ti
Hardware required (Fig 1--full ROM): • 1. Fixed-point subtractor. • 2. ROM(+/-), CSA tree. • 3. Fixed-point adder. • Hardware required (to cut ROM in half): • 1. 2 fixed-point subtractors and 2 muxes. • 2. ROM(- only), CSA tree. • 3. Fixed-point adder.
Novel improvement • Assume ROM has extra address bit. • Takes both positive and negative z. • Rearrange steps 1 and 3 to reduce delay. • Novel interpolator for step 2 • keeps area same as prior LNS ALUs.
We want y4 = logb(X0+X1+X2+X3): z1= x0– x1
We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1)
We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1) z3= x2– x3+ sb(z2)
We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1) z3= x2– x3+ sb(z2) y4= x3 + sb(z3)
We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1) z3= x2– x3+ sb(z2) y4= x3 + sb(z3) In general, let z0 = - and x4 = 0: vj = xj-1– xj zj = vj + sb(zj-1) . vj computed in parallel Clock period: tf + ti . Time: ktf + (k-1)ti
Hardware required (full ROM): 1. Fixed-point subtractor (vj) 2. ROM(+/-), CSA tree 3. Fixed-point adder (zj) But cannot cut ROM in half by commutativity
Positive Argument w/o • Doubling ROM - I • First, a little complement: • z = zH + zL • -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) • (definition of two’s complement) • ~zH is one’s complement of bits only in zH
Positive Argument w/o Doubling ROM - II -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) (~zH) takes the role of zH ((~zL) + 2-F) takes the role of zL sb(-z) = sb(~zH) + c(~zH)·((~zL)+ 2-F)
Positive Argument w/o Doubling ROM - II -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) (~zH) takes the role of zH ((~zL) + 2-F) takes the role of zL sb(-z) = sb(~zH) + c(~zH)·((~zL)+ 2-F) = sb(~zH) + c(~zH)·(~zL) + 2-F · c(~zH)
Positive Argument w/o Doubling ROM - II -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) (~zH) takes the role of zH ((~zL) + 2-F) takes the role of zL sb(-z) = sb(~zH) + c(~zH)·((~zL)+ 2-F) = sb(~zH) + c(~zH)·(~zL) + 2-F · c(~zH) Remember that sb(z) = sb(-z) + z: sb(z) = sb(~zH) + c(~zH) · (~zL) + (2-F · c(~zH) + z)
Fig. 3 C S A v W sb(zH) +2-F-1 F+G C P A + Z+2-F . c(~zH) v+sb(z) X O R z W W+G +P slope W N+K * M F+G +P F-N Hardware required (cut ROM in half): 1. Fixed-point subtractor (vj, not shown) 2. ROM (- only), CSA tree (extra leaf), ANDs and XORs for novel term 3. Fixed-point adder
Compare Hardware requirements: • Novel: • 1. Fixed-point subtractor (vj) • 2. ROM (- only), CSA tree (extra leaf), • ANDs and XORs • 3. Fixed-point adder about same • Conventional: • 1. 2 fixed-point subtractors and 2 muxes. • 2. ROM(- only), CSA tree. • 3. Fixed-point adder Fixed-point subtractor (vj) (extra leaf), ANDs and XORs 2 muxes. 2 fixed-point subtractors
Layout of word and guard bits Fig. 4 ignore zH zL guard K N G P F W
Simulation F = precision G = guard bits for prior implementations M = bits from c(zH) ROM input to multiplier P = extra guard bits here for faithful round F G N M P % next nearest z < 0 z 0 6 2 1 6 1 0.14 0.17 8 2 2 7 1 0.14 0.17 10 2 3 8 2 0.15 0.17 12 2 4 9 4 0.12 0.12 14 2 5 10 4 0.13 0.13 16 2 6 11 6 0.13 0.12 18 2 7 12 6 0.12 0.12
Redundant LNS Subtraction problems Larger ROM Slower Redundant LNS avoids subtractions Factor positives and negatives X = X1 - X2 + X3 - X4 = (X1 + X3 ) - (X2 + X4) = X+- X- = b x+- b x- Defer all subtraction to a final fixed-point sub
Comparison of prior F = 12 implementations 0.7 _______________________________________________________ ____________ Bits__ ROMs multipliers__ time F=23_ Proposed 4,224 1 1 24 ns Coleman 4,808 2 2 42 ns Taylor 77,000 10 0 22 ns ______________________________________________________
Conclusion • Rearrangement of LNS addition • Logarithmic-increment-multiply (LIM) V · (1 + Z) • Rather than logarithmic addition (LADD) X + Y • Reduces fixed-point addition time • Need novel interpolator • Handles positive and negative arguments • Requires same memory, area • Supports LIM and LADD • Works with: • Signed LNS • Redundant LNS
module ul_incmul(s,z,v,zH,sb_zH,c_zH); input [19:0] z; input [18:0] v; output [18:0] s; output [7:0] zH; input [14:0] sb_zH; input [7:0] c_zH; wire [18:0] zt = (z[19]) ? z : (19'h7ffff^z); wire [7:0] zH = zt[15:8]; wire [7:0] zL = zt[7:0]; wire [14:0] sb_zH; wire [7:0] c_zH; wire [15:0] prod = c_zH*zL; wire [12:0] prodscale = prod >> 3; wire [24:0] novel = (z[19]) ? 0 : {z,1'b0,c_zH[7:3]}; wire ez = zt < 19'hf4000; wire [24:0] sum = (ez?0:(sb_zH<<4)) + (ez?0:prodscale) + novel + (v<<6); assign s = sum >> 6; endmodulea