1 / 47

A Pipelined LNS ALU Mark G. Arnold marnold@uwyo University of Wyoming &

A Pipelined LNS ALU Mark G. Arnold marnold@uwyo.edu University of Wyoming & University of Manchester Institute of Science and Technology. Outline What is LNS and Why Bother? Three Step Addition Algorithm Prior research: Reduce ROM for step 2 Reduce latency here for steps 1 and 3

jed
Download Presentation

A Pipelined LNS ALU Mark G. Arnold marnold@uwyo University of Wyoming &

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Pipelined LNS ALU Mark G. Arnold marnold@uwyo.edu University of Wyoming & University of Manchester Institute of Science and Technology

  2. Outline What is LNS and Why Bother? Three Step Addition Algorithm Prior research: Reduce ROM for step 2 Reduce latency here for steps 1 and 3 Novel interpolator to keep same ROM

  3. Logarithmic Number System (LNS) Convert to base blogarithms once x = logb(X) Keep in LNS throughout all computation Convert back only when done: X = bx upper-case: values of real numbers lower-case: logarithmic representations.

  4. Why bother with LNS? 1. Natural for some applications: Speech (-law CODEC) HMMs (ex), Neural Nets (1/(1+e-x)) 2. Over one hundred papers see www.xlnsresearch.com 3. Lower Power Consumption 4. Can “tune” precision/range to problem 5. Easy Multiplication, Division, Square Root

  5. LNS Power Consumption LNS compresses information High-order bits of LNS change less frequently than equivalent fixed-point LNS can take fewer bits Can tune b to the application: 1. Only precision needed by app. 2. Less wasted dynamic range. LNS takes less power (from Paliouras and Stouraitis)

  6. 1 2 3 4 567891 2 3 4 567891 1 2 3 4 567891 2 3 4 567891 Easy Multiplication and Division Given LNS representations x = logb(X) y = logb(Y) LNS multiplication algorithm--only 1 step: Let p= x+ y X P xpy Justification: Y logb(P)= logb(X Y) = logb(X) + logb(Y). Hardware: Fixed-point adder (or a slide rule!) Division, square, square root similar1

  7. Addition is more difficult LNS addition algorithm: 1. Let z1= y1– x1 2. Compute sb(z1) where sb(z1) = logb(1+ bz1) 3. Let y2= x1 + sb(z1) Hardware required: 1. Fixed-point subtractor. 2. ROM lookup and possibly interpolation. 3. Fixed-point adder.

  8. Addition Justification LNS addition algorithm: 1. Let z1= y1– x1 Justification: 1. logb(Z1)= logb(Y1/ X1) = logb(Y1 )-logb(X1).

  9. Addition Justification LNS addition algorithm: 1. Let z1= y1– x1 2. Compute sb(z1) = logb(1+ bz1) Justification: 1. logb(Z1)= logb(Y1/ X1) = logb(Y1 )-logb(X1). 2. logb(1+Z1)= logb(1+ Y1/ X1)

  10. Addition Justification • LNS addition algorithm: • 1. Let z1= y1– x1 • 2. Compute sb(z1) = logb(1+ bz1) • 3. Let y2= x1 + sb(z1) • Justification: • 1. logb(Z1)= logb(Y1/ X1) • = logb(Y1 )-logb(X1). • 2. logb(1+Z1)= logb(1+ Y1/ X1) • 3. logb(X1 (1+Z1) ) = logb(X1 (1+ Y1/ X1) ) • = logb(X1+ Y1).

  11. Prior research--improving step 2 Don’t tabulate for big z1 (“essential zero”): Lim sb(z1) = 0 z1- Cut sb table in half (commutativity) swap y1and x1 so that the z1<0 sb(z1) = logb((1+ b-z1) bz1) = sb(-z1)+z1 Interpolation: Fixed-point multiply accumulate. Cuts table address roughly in half

  12. Memory Requirements _____________________________________________ Precision Tabulate Interpolate Interpolate Whole Ess. Zero Ess. Zero +/- ROM - only ROM _________________________________ Commute___ 8 216 8 23 8 22 10 218 10 24 10 23 12 220 12 25 12 24 14 222 14 26 14 25 16 224 16 27 16 26 18 227 18 28 18 27 _____________________________________________ “Tabulate Whole” assumes a dynamic range of 1038

  13. LNS Addition without commutativity x y sb(z) x+sb(y-x) +/- z =y-x Hardware required (full ROM): 1. Fixed-point subtractor 2. sb ROM (+/-) and interpolator 3. Fixed-point adder + -

  14. LNS Addition with commutativity x x-y sb(z) max(x,y) + -onlv sb(-|x-y| ) y-x high bit Hardware required (if cut ROM in half): 1. 2 fixed-point subtractors and 2 muxes. 2. sb ROM (- only) and interpolator 3. Fixed-point adder. max(x,y) mux y x-y<0, - y > x z = -|x -y| + mux y-x<0, - x > y

  15. Justification Comutativity: X + Y = Y + X In LNS: x + sb(y - x) = y + sb(x - y) Because: sb(-z) = sb(z)+z Choosing max(x,y) reduces z width by 1 bit

  16. Interpolation Splitz = zH + zL, zL<  = spacing between tabulated points 2-N =  = weight of zH lsb 2-F = weight of z and zL lsbs Address ROMs with zH sb(zH) is the intercept ROM c(zH) is the slope ROM Approximate sb(z) as sb(zH) + c(zH)·zL,

  17. Non-redundant bus Fig. 1 • Redundant bus • v • zH CPA • redundant • z to non- • redundant • Total time tf+ti • tfis non-redundant add time • ti is the time for the redundant summation. CSA redundant + Intercept Sb(zH) sb(z) Slope c(zH) * zL ti tf

  18. Choice of interpolation slope • linear Taylor-Series (tangent line) • c(zH) = sb(zH). • -log2(maxerr)=2N+3 • linear Lagrange (secant line) • c(zH) = (sb(zH+) - sb(zH) )/  • -log2(maxerr)=2N+5. •  • 0 < c(~zH) < 1 in either case

  19. Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0

  20. Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1)

  21. Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1) z2= y2– x2y3= x2 + sb(z2)

  22. Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1) z2= y2– x2y3= x2 + sb(z2) z3= y3– x3y4= x3 + sb(z3) .

  23. Conventional LNS summation Positive reals: X0 , X1 , X2and X3 LNS representations: x0 , x1 , x2and x3 We want y4 = logb(X0+X1+X2+X3): y1 = x0 z1= y1– x1y2= x1 + sb(z1) z2= y2– x2y3= x2 + sb(z2) z3= y3– x3y4= x3 + sb(z3) . Total time 6tf + 3ti. In general: yj = xj-1 + sb(zj-1) zj = yj – xj Time: (k-1)(2tf + ti), where k is # of values.

  24. Fig. 2 C S A redundant x j-1 C P A redundant to non- redundant intercept C P A yj + * zj-1 slope yj-1 tf ti t f • Clock period 2tf + ti

  25. Hardware required (Fig 1--full ROM): • 1. Fixed-point subtractor. • 2. ROM(+/-), CSA tree. • 3. Fixed-point adder. • Hardware required (to cut ROM in half): • 1. 2 fixed-point subtractors and 2 muxes. • 2. ROM(- only), CSA tree. • 3. Fixed-point adder.

  26. Novel improvement • Assume ROM has extra address bit. • Takes both positive and negative z. • Rearrange steps 1 and 3 to reduce delay. • Novel interpolator for step 2 • keeps area same as prior LNS ALUs.

  27. We want y4 = logb(X0+X1+X2+X3): z1= x0– x1

  28. We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1)

  29. We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1) z3= x2– x3+ sb(z2)

  30. We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1) z3= x2– x3+ sb(z2) y4= x3 + sb(z3)

  31. We want y4 = logb(X0+X1+X2+X3): z1= x0– x1 z2= x1– x2+ sb(z1) z3= x2– x3+ sb(z2) y4= x3 + sb(z3) In general, let z0 = - and x4 = 0: vj = xj-1– xj zj = vj + sb(zj-1) . vj computed in parallel Clock period: tf + ti . Time: ktf + (k-1)ti

  32. Hardware required (full ROM): 1. Fixed-point subtractor (vj) 2. ROM(+/-), CSA tree 3. Fixed-point adder (zj) But cannot cut ROM in half by commutativity

  33. Positive Argument w/o • Doubling ROM - I • First, a little complement: • z = zH + zL • -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) • (definition of two’s complement) • ~zH is one’s complement of bits only in zH

  34. Positive Argument w/o Doubling ROM - II -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) (~zH) takes the role of zH ((~zL) + 2-F) takes the role of zL sb(-z) = sb(~zH) + c(~zH)·((~zL)+ 2-F)

  35. Positive Argument w/o Doubling ROM - II -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) (~zH) takes the role of zH ((~zL) + 2-F) takes the role of zL sb(-z) = sb(~zH) + c(~zH)·((~zL)+ 2-F) = sb(~zH) + c(~zH)·(~zL) + 2-F · c(~zH)

  36. Positive Argument w/o Doubling ROM - II -z =~z + 2-F = (~zH)+ ((~zL) + 2-F) (~zH) takes the role of zH ((~zL) + 2-F) takes the role of zL sb(-z) = sb(~zH) + c(~zH)·((~zL)+ 2-F) = sb(~zH) + c(~zH)·(~zL) + 2-F · c(~zH) Remember that sb(z) = sb(-z) + z: sb(z) = sb(~zH) + c(~zH) · (~zL) + (2-F · c(~zH) + z)

  37. Fig. 3 C S A v W sb(zH) +2-F-1 F+G C P A + Z+2-F . c(~zH) v+sb(z) X O R z W W+G +P slope W N+K * M F+G +P F-N Hardware required (cut ROM in half): 1. Fixed-point subtractor (vj, not shown) 2. ROM (- only), CSA tree (extra leaf), ANDs and XORs for novel term 3. Fixed-point adder

  38. Compare Hardware requirements: • Novel: • 1. Fixed-point subtractor (vj) • 2. ROM (- only), CSA tree (extra leaf), • ANDs and XORs • 3. Fixed-point adder about same • Conventional: • 1. 2 fixed-point subtractors and 2 muxes. • 2. ROM(- only), CSA tree. • 3. Fixed-point adder Fixed-point subtractor (vj) (extra leaf), ANDs and XORs 2 muxes. 2 fixed-point subtractors

  39. Layout of word and guard bits Fig. 4 ignore zH zL guard K N G P F W

  40. Simulation F = precision G = guard bits for prior implementations M = bits from c(zH) ROM input to multiplier P = extra guard bits here for faithful round F G N M P % next nearest z < 0 z 0 6 2 1 6 1 0.14 0.17 8 2 2 7 1 0.14 0.17 10 2 3 8 2 0.15 0.17 12 2 4 9 4 0.12 0.12 14 2 5 10 4 0.13 0.13 16 2 6 11 6 0.13 0.12 18 2 7 12 6 0.12 0.12

  41. Redundant LNS Subtraction problems Larger ROM Slower Redundant LNS avoids subtractions Factor positives and negatives X = X1 - X2 + X3 - X4 = (X1 + X3 ) - (X2 + X4) = X+- X- = b x+- b x- Defer all subtraction to a final fixed-point sub

  42. Comparison of prior F = 12 implementations 0.7 _______________________________________________________ ____________ Bits__ ROMs multipliers__ time F=23_ Proposed 4,224 1 1 24 ns Coleman 4,808 2 2 42 ns Taylor 77,000 10 0 22 ns ______________________________________________________

  43. Conclusion • Rearrangement of LNS addition • Logarithmic-increment-multiply (LIM) V · (1 + Z) • Rather than logarithmic addition (LADD) X + Y • Reduces fixed-point addition time • Need novel interpolator • Handles positive and negative arguments • Requires same memory, area • Supports LIM and LADD • Works with: • Signed LNS • Redundant LNS

  44. module ul_incmul(s,z,v,zH,sb_zH,c_zH); input [19:0] z; input [18:0] v; output [18:0] s; output [7:0] zH; input [14:0] sb_zH; input [7:0] c_zH; wire [18:0] zt = (z[19]) ? z : (19'h7ffff^z); wire [7:0] zH = zt[15:8]; wire [7:0] zL = zt[7:0]; wire [14:0] sb_zH; wire [7:0] c_zH; wire [15:0] prod = c_zH*zL; wire [12:0] prodscale = prod >> 3; wire [24:0] novel = (z[19]) ? 0 : {z,1'b0,c_zH[7:3]}; wire ez = zt < 19'hf4000; wire [24:0] sum = (ez?0:(sb_zH<<4)) + (ez?0:prodscale) + novel + (v<<6); assign s = sum >> 6; endmodulea

  45. a

  46. a

More Related