120 likes | 234 Views
Floating Point (FLP) Representation. A Floating Point value: f = m*r**e Where: m – mantissa or fractional r – base or radix, usually r = 2 e - exponent. Normalization. Normalized value: 0.1011 Unnormalized value: 0.001011 Normalization: 0.001011*(2**2)*(2**-2)= =0.1011*2**-2
E N D
Floating Point (FLP) Representation A Floating Point value: f = m*r**e Where: m – mantissa or fractional r – base or radix, usually r = 2 e - exponent
Normalization • Normalized value: 0.1011 • Unnormalized value: 0.001011 • Normalization: 0.001011*(2**2)*(2**-2)= • =0.1011*2**-2 • Value of normalized mantissa: • 0.5<=m<1
FLP Format • sign exponent mantissa • sign: 0 + • 1 - • Biased exponent: assume exponent q bits • -2**(q-1)<=e<=2**(q-1)-1 add bias: • +2**(q-1) to all sides, get: • 0 <= eb <= 2**q -1 • e – true exponent; • eb – biased exponent
Example • f = -0.5078125*2**-2 • Assume a 32-bit format: • sign – 1 bit, exponent – 10 bits (q=10), • mantissa – 21 bits • q-1 = 9, b = bias = 2**9 = 512, • e = -2, eb = e + b = -2 + 512 = 510 • f representation: • 1 0111111110 10000010………0 • since 0.5=0.1, 0.0078125=2**-7
Range of representation • In fixed point, the largest number representable in 32 bits: • 2**31-1 approximately equal 10**9 • In the previous 32-bit format, the largest number representable: (1-2**-21)*2**511 • Approximately equal 10**153 • The smallest: 0.5*2**-512 • If a number falls above the largest, we have an overflow, if below the smallest, we have an underflow.
IEEE FLP Standard 754 1985 • Single precision: 32 bits • Double precision: 64 bits • Single Precision. • f = +- 1.M*2**(E’-127) where: • M – fractional, E’ – biased exponent, • bias = 127 • Format: sign: 1 bit, exponent – 8 bits, • fractional – 23 bits. • True exponent E = E’ – 127 • 0 < E’ < 255
Normalized single precision • Normalized: • 1.xxxxxx • The 1 before the binary point is not stored, but assumed to exist. • Example: convert 5.25 to single precision representation. • 5.25 = 101.01 not normalized. • Normalized: 1.0101*2**2 • True exponent E = 2, • Biased exponent E’ = E + 127 = 129, thus: • 0 10000001 01010…………0
Double precision • Value represented: • +- 1.M*2**(E’-1023) • Format: sign: 1 bit, exponent 11 bits, • fractional 52bits. • Bias = 1023 • Maximal number represented in single precision, approximately: 10**38 • In double precision: approximately 10**308
Precision • Increasing the exponent field, increases the range, but then, the fractional is decreased, decreasing the precision. • Suppose we want to receive a precision of n decimal digits. How many bits x we need in the fractional? • 2**x = 10**n, take decimal log on both sides: xlog2 = n; x=n/log2=n/0.301 • For n=7, need 7/0.301=23.3, 24 bits. • Achieved in single precision standard, since M has 23 bit and there is 1., not stored but existing.
Extended Precision 80 bits • Not a part of IEEE standard. Used primarily in Intel processors. • Exponent 15 bits. • Fractional 64 bits. • This is why FLP registers in Intel processors are 80 and not 64 bits. • Its precision is 19 decimal digits.
FLP Computation • Given 2 FLP values: • X=Xm*2**Xe; Y=Ym*2**Ye; Xe<Ye • X+-Y = (Xm*2**(Xe-Ye)+-Ym)*2**Ye • X*Y = Xm*Ym*2**(Xe+Ye) • X/Y = (Xm/Ym)*2**(Xe-Ye)