150 likes | 362 Views
Chapter 3d: Floating-Point Numbers. Floating Point Numbers. Floating point is used to represent “real” numbers 1.23233, 0.0003002, 3323443898.3325358903 Real means “not imaginary”. Computer floating-point numbers are a subset of real numbers.
E N D
Chapter 3d: Floating-Point Numbers
Floating Point Numbers • Floating point is used to represent “real” numbers • 1.23233, 0.0003002, 3323443898.3325358903 • Real means “not imaginary” • Computer floating-point numbers are a subset of real numbers • Limit on the largest/smallest number represented • Depends on number of bits used • Limit on the precision • 12345678901234567890 --> 12345678900000000000 • Floating Point numbers are approximate, while integers are exact representation
Scientific Notation +34.383 x 102= 3438.3 Sign Significand Exponent Normalized form: Only onedigit before the decimal point +3.4383 x 103= 3438.3 Floating point notation +3.4383000E+03 = 3438.3 8 digit significand can only represent 8 significant digits
Binary Floating Point Numbers + 101.1101 = 1 x 22+ 0 x 21+ 1 x 20+ 1 x 2-1+ 1 x 2-2+ 0 x 2-3+ 1 x 2-4 = 4 + 0 + 1 + 1/2 + 1/4 + 0 + 1/16 = 5.8125 +1.011101 E+2 Normalized so that the binary point immediately follows the leading digit Note: First digit is always non-zero --> First digit is always one.
Converting Decimal Fractions to Binary Multiply by a power of 2, convert to binary, divide by the same power of 2 Example: 13.387 13.387 x 1048576 = 14037286.912 1. Multiply by 220 2. If fraction remains, multiply by a larger number or truncate it 3. Convert integer portion to binary 1403728610 = 1101011000110001001001102 4. Divide by 220 (shift radix point left 20) 1101.011000110001001001102 20 bits This works with any power of 2! Use larger powers to get more bits.
31 30 23 22 0 8 bits 23 bits IEEE Floating Point Format Sign Exponent Significand 0: Positive 1: Negative Biased by 127. Leading ‘1’ is implied, but notrepresented • Allows representation of numbers in range 2-127 to 2+128 (10±38) Number = -1S * (1 + Sig) x 2E-127 • Since the significand always starts with ‘1’, we don’t have to represent it explicitly • Significand is effectively 24 bits • Zero is represented by Sign=Significand=Exp=0
63 62 52 51 32 11 bits 20 bits 31 0 32 bits IEEE Double Precision Format Sign Significand Exponent Bias:1023 Number = -1S * (1 + Sig) x 2E-1023 • Allows representation of numbers in range 2-1023 to 2+1024(10± 308) • Larger significand means more precision • Takes two registers to hold one number
Conversion Convert 5.75 to Single-Precision IEEE Floating Point 1. Convert 5.7510 to Binary ---> 101.112 2. Normalize ---> 1.0111 x 22 Exponent Significand 3. Sign = 0(positive). 4. Add 127 (bias) to exponent. Exponent = 12910 = 100000012 5. Express significand as 24 bitsSig = 1.01110000000000000000000 6. Remove leading one from significand, leaving 23 bits Sig = .01110000000000000000000 7. Put in proper bit fields Number = 01000000101110000000000000000000 = 0x40B80000
Adding Floating Point Numbers 1.2232E+3 + 4.211E+5 1. Normalize to higher exponent a. Find the difference between exponents (= 2) b. Shift smaller number right by that amount 1.2232E+3 == 0.012232E+5 2. Now that exponents are the same, add significands together 4.211 E+5 + 0.012232 E+5 4.223232 E+5 5.0 E+2 + 7.0 E+2 12.0 E+2 = 1.2 E+3 Note: If carry out of MSD, re-normalize
Adding IEEE Floating Point Numbers SESig. 0x45B8CD8D --> 08B38CD8D = 5913.69410 + 0x46FC8672 --> 08D 7C8672 = 32323.2210 1. Check for Sign=Exp=Significand=0 --> If so, treat as a special case 2. Put the ‘1’ back in bit 23 of significands 38CD8D = 011 1000 1100 1101 1000 1101 ---> 1011 1000 1100 1101 1000 1101 = B8CD8D 7C8672 = 111 1100 1000 0110 0111 0010 ---> 1111 1100 1000 0110 0111 0010 = FC8672 08BB8CD8D + 08DFC8672
Adding IEEE Floating Point Numbers 08BB8CD8D + 08DFC8672 3. Normalize to higher exponent: a. Find difference in exponents: 8D - 8B = 2 b. Shift significand of number with smaller exponent right by the difference B8CD8D = 1011 1000 1100 1101 1000 1101 right shift by 2 --> 0010 1110 0011 0011 0110 0011 = 2E3363 c. Set lower-valued exponent to higher one 08D2E3363 (re-normalized form of 0 8B B8CD8D) + 08DFC8672
Adding IEEE Floating Point Numbers 08D2E3363 + 08DFC8672 4. Add significands: (note: carry produced one too many bits) 0010 1110 0011 0011 0110 0011 + 1111 1100 1000 0110 0111 0010 1 0010 1010 1011 1001 1101 0101 = 12AB9D5 5. Since bit 24 is ‘1’, we must re-normalize by shifting significand right 1 and incrementing exponent by one. 1 0010 1010 1011 1001 1101 0101 SRL --> 1001 0101 0101 1100 1110 1010 = 955CEA (significand) exp: 8D --> 8E Bit 24 Result is: 08E155CEA or 0x47155CEA= 38236.9110 6. Get rid of bit 23 in significand (for IEEE standard) 1001 0101 0101 1100 1110 1010 --> 001 0101 0101 1100 1110 1010 = 155CEA
Multiplying Floating Point Numbers 34.233 E+09 * 212.32 E +03 1. Add exponents: --> 9+ 3 = 12 2. Multiply significands -->34.233 * 212.32 = 7268.35056 Note: Number of digits to right of decimal point in product = sum of the number of bits to right of decimal points in factors 3. Result is 7268.35056 E +12 4. Normalize: 7.26835056 E +15 5. Truncate extra bits... --> 7.26835 E +15
Multiplying IEEE Floating Point Numbers 08B38CD8D = 5913.69410 x 08D7C8672 = 32323.2210 1. Check for zero. 2. Add exponents. (Note: both have the bias of 127 already. Only want to bias once, so subtract 127(7F) .) 8B = 0C+7F. 8D = 0E +7F. Sum: (0C+7F)+(0E+7F)-7F = (1A+7F)=99 3. Put ‘1’ back onto bit 23, multiply significands. 38CD8D --> B8CD8D 7C8672 --> FC8672 B8CD8D * FC8672 = 10.11 0110 0100 1011 0110 0100 1010 1111 0101 0110 1100 1010 Multiplying two 24-bit numbers, each with 23 bits to the right of the binary point – result has 46 bits to the right of the point
Multiplying IEEE Floating Point Numbers 08B38CD8D = 5913.69410 x 08D7C8672 = 32323.2210 10.11 0110 0100 1011 0110 0100 1010 1111 0101 0110 1100 1010 5. Re-normalize so one place to left of binary point. 1.011 0110 0100 1011 0110 0100 1010 1111 0101 0110 1100 1010 (Add one to exponent) --> 99 + 1 = 9A 6. Remove extra bits so only 24 bits remain (truncate) 1.011 0110 0100 1011 0110 0100 7. Remove implied one (bit 23) 011 0110 0100 1011 0110 0100 Result is: 09A364B64 = 191149632.174710