240 likes | 494 Views
Week 5. IEEE Floating Point Revision Guide for Phase Test. Mantissa. Exponent. Floating Point . 15900000000000000 could be represented as . 14. 159 * 10 14. 15.9 * 10 15 1.59 * 10 16 A calculator might display 159 E14 . Binary. The value of real binary numbers….
E N D
Week 5 IEEE Floating PointRevision Guide for Phase Test
Mantissa Exponent Floating Point • 15900000000000000 • could be represented as 14 159 * 1014 15.9 * 1015 1.59 * 1016 A calculator might display 159 E14
Binary The value of real binary numbers… 1 0 1 . 1 0 1 101.101 = 4+1+1/2+1/8 = 4+1+.5+.125= 5.625 = 5 ⅝
Binary Fractions The value of real binary numbers… 1 0 1 . 1 0 1 101.101 = 4+1+1/2+1/8 = 4+1+.5+.125= 5.625 = 5 ⅝
Binary Fractions The value of real binary numbers… 1 0 1 . 1 0 1 101.101 = 4+1+1/2+1/8 = 4+1+.5+.125= 5.625 = 5 ⅝
IEEE Single Precision • The number will occupy 32 bits • The first bit represents the sign of the number; • 1= negative 0= positive. • The next 8 bits will specify the exponent stored in biased 127 form. • The remaining 23 bits will carry the mantissa normalised to be between 1 and 2. • i.e. 1<= mantissa < 2
Basic Conversion • Converting a decimal number to a floating point number. • 1. Take the integer part of the number and generate the binary equivalent. • 2. Take the fractional part and generate a binary fraction • 3. Then place the two parts together and normalise.
= 1102 IEEE – Example 1 • Convert 6.75 to 32 bit IEEE format. • 1. The Mantissa. The Integer first. • 6 / 2 = 3 r 0 • 3 / 2 = 1 r 1 • 1 / 2 = 0 r 1 • 2. Fraction next. • .75 * 2 = 1.5 • .5 * 2 = 1.0 • 3. put the two parts together… 110.11 • Now normalise 1.1011 * 22 = 0.112
= 0.112 IEEE – Example 1 • Convert 6.75 to 32 bit IEEE format. • 1. The Mantissa. The Integer first. • 6 / 2 = 3 r 0 • 3 / 2 = 1 r 1 • 1 / 2 = 0 r 1 • 2. Fraction next. • .75 * 2 = 1.5 • .5 * 2 = 1.0 • 3. put the two parts together… 110.11 • Now normalise 1.1011 * 22 = 1102
IEEE – Example 1 • Convert 6.75 to 32 bit IEEE format. • 1. The Mantissa. The Integer first. • 6 / 2 = 3 r 0 • 3 / 2 = 1 r 1 • 1 / 2 = 0 r 1 • 2. Fraction next. • .75 * 2 = 1.5 • .5 * 2 = 1.0 • 3. put the two parts together… 110.11 • Now normalise 1.1011 * 22 = 1102 = 0.112
IEEE Biased 127 Exponent • To generate a biased 127 exponent • Take the value of the signed exponent and add 127. • Example. • 216 then 2127+16 = 2143 and my value for the exponent would be 143 = 100011112 • So it is simply now an unsigned value ....
Why Biased ? • The smallest exponent 00000000 • Only one exponent zero 01111111 • The highest exponent is 11111111 • To increase the exponent by one simply add 1 to the present pattern.
Back to the example • Our original example revisited…. 1.1011 * 22 • Exponent is 2+127 =129 or 10000001 in binary. • NOTE: Mantissa always ends up with a value of ‘1’ before the Dot. This is a waste of storage therefore it is implied but not actually stored. 1.1000 is stored .1000 • 6.75 in 32 bit floating point IEEE representation:- • 0 1000000110110000000000000000000 • sign(1) exponent(8) mantissa(23)
Special cases • 0 + Infinity and - infinity. • Zero is a pattern that only contains ‘0’s 00000000000000000000000000000000 • Positive Infinity is the pattern 011111111…. • Negative Infinity is the pattern 111111111….
Truncation and Rounding • Following arithmetic operations on a floating point number we may have increased the number of mantissa bits. • Since we will have a fixed storage (23 places) for the mantissa we require to limit these bits. • The simplest approach is to truncate the result prior to storage • Example 0.1101101 stored in 4 bits • stored in 4 bits => 0.1101 ( loss 0.0000101 )
Rounding • If lost digit is > ½ then add 1 to LSB • Example – in 4 bits • 0.1101101 <- 0.1101 + 0.0001 = 0.1110 ( rounded UP) • 0.1101011 <- 0.1101 ( rounded DOWN) • NOTE: • Rounding is always preferred to truncation partly because it is intrinsically more accurate , and because we end up with a FAIR error .
Other Considerations • Truncation always undervalues the result, and can lead to a systematic error situation . • Rounding has one major disadvantage since it requires up to two further arithmetic operations . • Note. When we use floating point care has to be taken when comparing the size of numbers because we are generating binary fractions of a predefined length. There is always going to be the chance of recurring numbers etc like 1/3 in decimal 0.333333333333333333333 etc..
From Floating Point Binary to Decimal Example • 1 0111101111100000100000000000000 • Sign = 1 therefore this number is a negative number. • Exponent 01111011 = 64+32+16+8+2+1 • = 123 • subtract the 127 = - 4 • Mantissa = 1.111000001 • 1.111000001 * 2- 4 • -ve 0.0001111000001 • 1/16 + 1/32 +1/64+1/128+1/8192 • or - 0.1173095703125
Floating Point Maths • Floating point addition and subtraction. • Make sure that the two numbers are of the same magnitude. Their Exponents have to be equal. • We then add or subtract the mantissas • Starting with the existing exponent re-normalise if needed.
Example • Example 1.1* 23 + 1.1 * 22 • Select the smaller number and make the mantissa smaller by moving the point whilst increasing the exponent until the exponents match. • 1.1 * 22 0.11 * 23 • Add the mantissas • Re-normalise.
Example • 1.1* 23 001.1 23 • +1.1 * 22 000.11 23 • 010.01 23 • Re normalise 010.01 * 23 • = 1.001 * 24
FP math • Floating Point Multiplication • Assume two numbers a x 2m b x 2n • Result (a x 2m ) x (b x 2n) = ( a x b ) x ( 2m+n ) • Floating Point Division • Assume two number a x 2m and b x 2n • Result (a x 2m ) / (b x 2n) = (a/b ) x 2m-n