1 / 21

CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246: Computer Arithmetic Algorithms and Hardware Design. Fall 2006 Lecture 8: Division. Instructor: Prof. Chung-Kuan Cheng. Topics:. Radix-4 SRT Division Division by a Constant Division by a Repeated Multiplication. Project Update. Come in to speak briefly about the final project

rafer
Download Presentation

CSE 246: Computer Arithmetic Algorithms and Hardware Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 246: Computer Arithmetic Algorithms and Hardware Design Fall 2006 Lecture 8: Division Instructor: Prof. Chung-Kuan Cheng

  2. Topics: • Radix-4 SRT Division • Division by a Constant • Division by a Repeated Multiplication

  3. Project Update • Come in to speak briefly about the final project • Status Update • 2:30 – 3:00 p.m. • Tuesday or Thursday

  4. Radix-4 SRT Division • 4sj-1 = qjd + sj where • qj is in [-2,2] and sj-1 is in [-hd,+hd] • h is less than or equal to 2/3 • Therefore, sj-1 is in [-2d/3, 2d/3] • And, 4sj-1 is in [-8d/3, 8d/3] • s shifts to the left by 2 bits

  5. Radix-4 SRT Division 4sj-1 8d/3 11.0 Anything above 8d/3 goes against our assumption and is therefore the infeasible region 10.1 qj=2 5d/3 10.0 4d/3 1.1 qj=1 1.0 2d/3 0.1 d/3 d qj=0 0.0 .1 .101 .110 .111 1.00 -2d/3 • The overlap regions of qj denote a choice still allowing for recursion. The gap defines the precision for carry save addition.

  6. Radix-4 SRT Division • The value of qj determines the range it governs • For example, qj = 1 • 1 + 2/3 = 5/3 • 1 – 2/3 = 1/3 • The range is 1/3 to 5/3

  7. Division by a Constant • Multiplication is O(log n) but division is linear…much slower • Try to convert division to multiplication • Property: Given an odd number d m such that d*m = 2n– 1 • Ex. • d = 3, m = 5 3*5 = 24– 1 • d = 7, m =9 7*9 = 26– 1 • d = 11, m = 93 11 * 93 = 210 - 1 E

  8. Division by a Constant • 1/d = m/(2n– 1) • 1/(1-r) = 1+r+r2+r3+… = (1+r)(1+r2)(1+r4)(1+r8)… • Example • z/7 = zm/(2n-1), m=9, n=6 • log(n/6) operations m 1 m = = (1+2-n)(1+2-2n)(1+2-4n) 2n 1-2-n 2n z 9 9z = = (1+2-6)(1+2-12)(1+2-24) 26 1-2-6 26

  9. Division by Reciprocation • Find 1/d with iteration • Newton Raphson Algorithm xi+1=xi-f(xi)/f’(xi) • Set f(x)=1/x-d, (1/2<=d<1) We have f’(x)=-1/x2 • Thus xi+1=xi(2-xid) • Let ei=1/d-xi We have ei+1=1/d-xi+1=1/d-xi(2-xid) =d(1/d-xi)2=dei2 • The convergence rate is quadratic. • For k iterations, it takes 2k multiplications

  10. Division by Reciprocation • z/d=3/0.7 • x0=4(31/2-1)-2d=2.9282-2d=1.5282 • e0=1/d-x0=1/0.7-1.5282=-0.0996286 • x1=x0(2-x0d)=1.42164 • e1=1/d-x1=1/0.7-1.42164=0.0069314 • x2=x1(2-x1d)=1.4285377 • e2=1/d-x2=1/0.7-1.4285377=0.0000337 • x3=x2(2-x2d)=1.4285715 • e3=1/d-x3=1/0.7-1.4285715=-0.000000(1) • The convergence rate is quadratic.

  11. Division by Recursive Multiplication • q = z/d = (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) eq(a) • Let ½<=d<1 • It takes 2k multiplication for eq(a) • We also need k operations to find xi

  12. Division by a Repeated Multiplication • q = z/d = (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) • Let ½<=d<1 • Set d0=d, xk = 2-dk 1. d1 = dxo = d(2-d) = 1-(1-d)2 2. dk+1= dkxk = dk(2-dk) = 1-(1-dk)2 3. 1-dk+1 = (1-dk)2 =(1-d)2k quadratic convergence • For k-bit operands, we need 2m-1 multiplications • m 2’s complement • m = ceiling(log2 k) with log2 m extra bits for precision

  13. Division by a Repeated Multiplication • q = z/d=3/0.7 = (z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) • d0=d=0.7, xk = 2-dk, dk+1=dkxk 1. x0=2-d0=1.3, d1=d0xo= 0.7x1.3 = 0.91 2. x1=2-d1=1.09, d2=d1x1=0.91x1.09=0.9919 3. x2=2-d2=1.0081, d3=d2x2=0.9919x1.0081=0.9999343

  14. Division Methods • Iteration • Memory • Arithmetic

  15. 0.1 1 0 1 1 0 1 0 1 0 0 1 R0=A 1 0 1 0 1 0 0 0 R1 Q1 = 0.1Q2 = 0.01Q3 = 0.000Q4 = 0.0001 1 0 1 0 0 1 0 0 R2 0 0 0 0 1 0 0 0 R3 1 0 1 0 0 1 1 0 R4 Division –Iteration effort • Pencil and paper method: (A=QB+2-nR and R<B)1 bit partial quotient per iteration, n iterations A = 0.1001, B = 0.1010; Q= A / B. + Qi: Partial Quotient Ri: Partial Remainder Ri+1 = Ri – B  Qi Q = 0.1101

  16. Division –Memory effort • Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration. • SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor. Table size: 22m m • STR method is limited by memory wall.

  17. Division –Arithmetic effort • Partial quotient is calculated by arithmetic functions. • Prescaling: • Taylor expansion: • Series expansion:

  18. Division –Solution space • Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider. Memory Effort Our target SRT Memory Wall Low latency Prescaling Pencil-and-paper Series Expansion Iteration Effort Taylor Expansion Arithmetic Effort Low area

  19. Division –PST algorithm • Utilize the power of series expansion, but need a good start point. • Prescaling provide a scaled divisor close to 1. • 0-order Taylor expansion iterates to reach the final quotient

  20. z1 = z  E0 =0.1101,1000,0010 d1 = d  E0 =0.1111,0001,0001 Q1 = z1 E1 =0.1110,0011 R1 = B1 – Q1 d1 =0.0000,0010,0101,1110,1101 Q2 = R1 E1 =0.1001,1111 R2 = R1 – Q2 d1 =0.0000,0001,1111,1011,0001 Q =0.1110,0011+ 0.0000,0010,0111,11= 0.1110,0101,0111,11 Division –PST algorithm B(m) =0.1100 E0 =1.0011 z =0.1011,0110 d =0.1100,1011 E1 = INV(d1(2m)) =1.0000,1110 E0 = Table (d(m))  1/d z1 = zE0; d1 = dE0 E1 = (2  d1)  INV(d1(2m)) Qi = Ri-1  E1 Ri= Ri-1 Qi  B1 Q = Q + Qi

  21. Division –FPGA Implementation • PST algorithm is suitable for high-performance division unit design in FPGAs 32-bit division with 5-cycle latency

More Related