Binary FLP Number System X=(S, E, M) 2 = ( – 1) s ( M) ( E ) S: Sign 0= (+) & 1= ( –)

Binary FLP Number Systems • Binary FLP Number System • X=(S, E, M)2 = (–1)s (M) (E) • S: Sign 0= (+) & 1= (–) • M: Mantissa or Significand, • 1>M  0.5 or • 2 >M  1 • E: Exponent or Characteristic; biased • Larger Range & Less Precision (Fixed #)

IEEE Standard 754 • Single-precision Format for FLP • 32-bit: e=8, f=23 mantissa • F= (–1)s (1.f) (2E127) if 254  E  1 • F= (–1)s (0.f) (2126) if E=0 • E=255 if f=0, for ; f0 for NAN. • Ranges (254  E  1) • (2223) 2254127  F+  1•21127

IEEE Standard 754 (2) • Ranges (E=0) • (1223)2126 F+  223•2126 • Denormalized Number • May be excluded in Arith. Unit • Hidden bit: (1.f) for E  1 • 0: E=0 & f=0

Double-Precision Format • 64 bits e=11, f=52 mantissa • F= (–1)s (1.f) (2E1023) if 2046  E  1 • Reserve values of E=0 & 2047 • Comparisons

General Format • F= (–1)s (M) (Ebias) • 1 >M  1/  • =2k • Hidden bit used • F= (–1)s (0.1M)2 (2Ebias) • =2 • Zero?  E=M=0; Smallest number E=1

Operations • X= (1)S1M1E1 b >Y = (1)S2M2E2 b • ADD/SUB • XY= ( (1)S1M1)  (1)S2M2(E1E2) )E1b • If 1Mnew<2  post-normalization • steps for Add/Sub: • difference d = | E1  E2| • Shift smaller one d base-  digit to the right • Add & set Enew= larger one • post-normalization & check OV/UV if necessary

Operations(2) • MUL • X*Y= ( (1)S1M1) * (1)S2M2)E1+E2b b • Enew=E1+E2 b • If 1/ 2  Mnew <1/   post-normalization • DIV • Check Y=0? If Yes  set NAN or  • X/Y= ( (1)S1M1) / (1)S2M2)E1  E2+b b • Enew=E1–E2+b • If 1 Mnew < post-normalization

Choice • Range •  & Speed (alignment shift)

Choice(2) • Max. Relative Rep. Error (MRRE) =0.5(ulp) • Max.[(M(x)  x)/x]  0.5(ulp)E/(ME)  0.5(ulp) • Ave. RRE (ARRE) = (ulp)(  1)/(4ln )

Rounding • Trade-offs • Implementation Cost (machine) • Accuracy (Numerical) • Rounding • M(): Machine; x, y real • M(x)M(y) if x  y • If x M() then M(x) = x • If M(y)xM(y)+ulp then M(x)=M(y) or M(x)= M(y) +ulp

Truncation (chopping) • Neglect the extra LSB digit(s) • M(x)=chop(x) • Error=(M(x)-x) • Ex. x=010.1 then M(x)=010

Round-to-the-nearest • Rounding in general • M(x) =chop(x+ulp/2) • Ex. x=010.1 then ulp=(1.0)&M(x)=011

Average Error • Ave. Err=  Error/2d • d: extra bits • Ave. Trunc. Err if fraction is rounded • =  / 22d =  (2d 1)/ 2d+1 • Ave. rounding Err • = 0.5/ 2d = 1/ 2d +1

Average Error (2) • Want Ave. Err = 0  Round-to-nearest-even (odd)

Jamming (von Neumann) Rounding • ROM Implementation • M()= X(y2y1y0.) X(x2x1x0. x1 )x2x 3 • Input= (x2x1x0. x1 ) • Output= (y2y1y0.) • (y2y1y0.)=(x2x1x0.) if x1=0 or (x2x1x0)=(111) • Otherwise (y2y1y0.)=(x2x1x0.) + ulp • In General • Input bits= c bits (include d extra bits)

Jamming (von Neumann) Rounding(2) • c=3, d=1 • Ave. Err.= 0.5 (1/2)d0.5(1/2)c0.5(1/2)c • = 0.5(1/2)d0.5(1/2)c1 • 1st term= Ave Err if c >>1

Guard Digits • Find the smallest # of digits required • Ex1: m=4 & No extra bit • 0.1000*2  0.1111*20=0.1 *23 • Missing information

Guard Digits(2) • Ex2: m=3 & an extra bit(G) • 0.100*2  0.111*21=0.1001*20 0.101 *20 • Rounding error!

Guard Digits(3) • Require two digits at least • Guard digit (G) • Round digit (R) for Round-to-the-nearest scheme • Require two digits and a sticky bit (S) • if Round-to-the-nearest-even (odd) scheme applied • S= Logic-OR all shift-out (loss) bit(s)

Guard Digits(4) • Ex2: m=3 & RGS • LSB=RS+RS’L

Binary FLP Number System X=(S, E, M) 2 = ( – 1) s ( M) ( E ) S: Sign 0= (+) & 1= ( –)