200 likes | 410 Views
Binary FLP Number Systems. Binary FLP Number System X=(S, E, M) 2 = ( – 1) s ( M) ( E ) S: Sign 0= (+) & 1= ( –) M: Mantissa or Significand, 1>M 0.5 or 2 >M 1 E: Exponent or Characteristic; biased Larger Range & Less Precision (Fixed #). IEEE Standard 754.
E N D
Binary FLP Number Systems • Binary FLP Number System • X=(S, E, M)2 = (–1)s (M) (E) • S: Sign 0= (+) & 1= (–) • M: Mantissa or Significand, • 1>M 0.5 or • 2 >M 1 • E: Exponent or Characteristic; biased • Larger Range & Less Precision (Fixed #)
IEEE Standard 754 • Single-precision Format for FLP • 32-bit: e=8, f=23 mantissa • F= (–1)s (1.f) (2E127) if 254 E 1 • F= (–1)s (0.f) (2126) if E=0 • E=255 if f=0, for ; f0 for NAN. • Ranges (254 E 1) • (2223) 2254127 F+ 1•21127
IEEE Standard 754 (2) • Ranges (E=0) • (1223)2126 F+ 223•2126 • Denormalized Number • May be excluded in Arith. Unit • Hidden bit: (1.f) for E 1 • 0: E=0 & f=0
Double-Precision Format • 64 bits e=11, f=52 mantissa • F= (–1)s (1.f) (2E1023) if 2046 E 1 • Reserve values of E=0 & 2047 • Comparisons
General Format • F= (–1)s (M) (Ebias) • 1 >M 1/ • =2k • Hidden bit used • F= (–1)s (0.1M)2 (2Ebias) • =2 • Zero? E=M=0; Smallest number E=1
Operations • X= (1)S1M1E1 b >Y = (1)S2M2E2 b • ADD/SUB • XY= ( (1)S1M1) (1)S2M2(E1E2) )E1b • If 1Mnew<2 post-normalization • steps for Add/Sub: • difference d = | E1 E2| • Shift smaller one d base- digit to the right • Add & set Enew= larger one • post-normalization & check OV/UV if necessary
Operations(2) • MUL • X*Y= ( (1)S1M1) * (1)S2M2)E1+E2b b • Enew=E1+E2 b • If 1/ 2 Mnew <1/ post-normalization • DIV • Check Y=0? If Yes set NAN or • X/Y= ( (1)S1M1) / (1)S2M2)E1 E2+b b • Enew=E1–E2+b • If 1 Mnew < post-normalization
Choice • Range • & Speed (alignment shift)
Choice(2) • Max. Relative Rep. Error (MRRE) =0.5(ulp) • Max.[(M(x) x)/x] 0.5(ulp)E/(ME) 0.5(ulp) • Ave. RRE (ARRE) = (ulp)( 1)/(4ln )
Rounding • Trade-offs • Implementation Cost (machine) • Accuracy (Numerical) • Rounding • M(): Machine; x, y real • M(x)M(y) if x y • If x M() then M(x) = x • If M(y)xM(y)+ulp then M(x)=M(y) or M(x)= M(y) +ulp
Truncation (chopping) • Neglect the extra LSB digit(s) • M(x)=chop(x) • Error=(M(x)-x) • Ex. x=010.1 then M(x)=010
Round-to-the-nearest • Rounding in general • M(x) =chop(x+ulp/2) • Ex. x=010.1 then ulp=(1.0)&M(x)=011
Average Error • Ave. Err= Error/2d • d: extra bits • Ave. Trunc. Err if fraction is rounded • = / 22d = (2d 1)/ 2d+1 • Ave. rounding Err • = 0.5/ 2d = 1/ 2d +1
Average Error (2) • Want Ave. Err = 0 Round-to-nearest-even (odd)
Jamming (von Neumann) Rounding • ROM Implementation • M()= X(y2y1y0.) X(x2x1x0. x1 )x2x 3 • Input= (x2x1x0. x1 ) • Output= (y2y1y0.) • (y2y1y0.)=(x2x1x0.) if x1=0 or (x2x1x0)=(111) • Otherwise (y2y1y0.)=(x2x1x0.) + ulp • In General • Input bits= c bits (include d extra bits)
Jamming (von Neumann) Rounding(2) • c=3, d=1 • Ave. Err.= 0.5 (1/2)d0.5(1/2)c0.5(1/2)c • = 0.5(1/2)d0.5(1/2)c1 • 1st term= Ave Err if c >>1
Guard Digits • Find the smallest # of digits required • Ex1: m=4 & No extra bit • 0.1000*2 0.1111*20=0.1 *23 • Missing information
Guard Digits(2) • Ex2: m=3 & an extra bit(G) • 0.100*2 0.111*21=0.1001*20 0.101 *20 • Rounding error!
Guard Digits(3) • Require two digits at least • Guard digit (G) • Round digit (R) for Round-to-the-nearest scheme • Require two digits and a sticky bit (S) • if Round-to-the-nearest-even (odd) scheme applied • S= Logic-OR all shift-out (loss) bit(s)
Guard Digits(4) • Ex2: m=3 & RGS • LSB=RS+RS’L