Fixed Point & Floating Point

Fixed Point & Floating Point Bryan Duggan

Are computers good at manipulating numbers?

What is a number anyway?

What about these numbers? • ∞ • ∏ • √2 • √-1 • 1/3 • 1/6 • 10/17 – Added up 17 times

A number? • A number is a mathematical object used in counting and measuring. A notational symbol which represents a number is called a numeral, but in common usage the word number is used for both the abstract object and the symbol, as well as for the word for the number • -Wikipedia

A symbol?

To be good computer scientists: • We should understand the symbols a computer uses to represent “concepts” like numbers • We should understand that they are only symbols • We should know the limitations of these symbols

29 28 27 26 25 24 23 22 21 20 2–1 2–2 2–3 2–4 2–5 2–6 (512) (256) (128) (64) (32) (16) (8) (4) (2) (1) (0.25) (0.0625) (0.015625) (0.5) (0.125) (0.03125) 0 1 0 0 0 0 1 1 0 1 0 1 1 1 1 0 Fixed point • insert implicit “binary point” between two bits • bits to left of point have value ≥ 1 • bits to right of point have value < 1 binary point . 269.46875

Fixed point • Problems with fixed point representations • small range of numbers • smallest number 0.015625 • largest number 1023.984375 • fixed point means wasted bits • π represented as 3.140625 • 8 most significant bits all 0

Converting to Fixed point • Take fractional part and multiply by 2 • If the result is > 1, then answer is 1, if 0 then answer is 0 • Start again with the remaining decimal part, until you get an answer of 0.

Converting to fixed point • E.g. Convert 0.75 to fixed point 0.75 * 2 = 1.5 Use 1 0.5 * 2 = 1.0 Use 1 Ans: 0.75 in Decimal = 0.11 in binary

mantissa: fixed point, 1.000... to 9.999... exponent: integer Scientific notation • Used to represent a wide range of numbers radix (base): constant sign: + or – – 6.341 × 10 –23

mantissa: fixed point binary, 1.000... to 1.111... exponent: binary integer Scientific notation • Same idea works in binary radix (base): constant sign: + or – –1.01101 × 21101 This number equals –1.40625 × 213 = –1152010

Scientific notation • Advantages • very wide range of respresentable numbers • limited by range of exponent • similar precision for all values • no wasted bits • Disadvantages • some values still not exactly representable • e.g., π

Floating point • Binary representation using scientific notation • IEEE754 standard sign mantissa 0 11011001 00100110000110101100101 exponent

Floating point • Sign • one bit • signed magnitude representation • 0  number is positive • 1  number is negative +1.5 0 01111111 10000000000000000000000 sign bit –1.5 1 01111111 10000000000000000000000

Floating point • Exponent • could be represented using two’s complement signed notation • instead represented using bias notation • represented value in exponent field is k more than intended value • k is constant (for C float, k = 127) • exponent 00000001 (110) is –126 • exponent 01111111 (12710) is 0 • exponent 11111110 (25410) is +127 • exponent 000...000 and 111...111 reserved for special meanings • all bits 0: denormalized numbers and zero • all bits 1: infinity and not-a-number (indeterminate)

choose this one: mantissa is in range 1.000...2 to 1.111...2 Floating point • Number is normalized • exponent is chosen such that 110 ≤ mantissa < 210 110.00000 × 2–3 11.000000 × 2–2 These are all valid representations of the number 0.7510 1.1000000 × 2–1 0.1100000 × 20 0.0110000 × 21

Floating point • Mantissa • represented as fixed-point value between 1.000...2 and 1.111...2 • first bit (before the point) is always 1 • don’t waste a bit to store it in number • fixed precision • for C float, 23 bits available • plus 1 implicit bit • 24 significant bits • mantissa sometimes called significand

To Convert fixed to floating point • Step 1: Convert to binary • Step 2: Normalise it • Step 3: Store the mantissa (drop the leading 1.) • Step 4: Add 127 to the exponent and store

mantissa = 1.4062510 = 1.011012, so store 01101 (skip leading 1), pad on right with 0s sign = 1 because number is negative 1 10001100 01101000000000000000000 exponent = +1310, so store 13 + 127 (excess) = 14010 = 100011002 Floating point • Example: = –1.4062510 × 1013 • 1.011012 in x 213 in Base 2

Floating point • C has floating point types of various sizes • 32 bits (float) • 1 bit sign, 8 bits exponent, 23 (24) bits mantissa, excess 127 • 64 bits (double) • 1 bit sign, 11 bits exponent, 52 (53) bits mantissa, excess 1023 • 80 bits (long double) • 1 bit sign, 15 bits exponent, 64 bits mantissa, excess 16383

Floating point • C does many calculations internally using doubles • most modern computers can operate on doubles as fast as on floats • may be less efficient to use float variables, even though smaller in size • long double operations may be very slow • may have to be implemented by software • Using literal floating point values in C • require decimal point • 5 is of type int, use 5.0 if you want floating point • optional exponent • 5.0e-12 means 5.0 × 10–12 (decimal)

Limitations of floating point • Size of exponent is fixed • cannot represent very large numbers • for C float, exponent is 8 bits, excess is 127 • largest exponent is +127 (25410 (111111102) – 127) • 11111111 reserved for infinity (and not-a-number, NaN) • largest representable numbers • positive: 1.1111...2 × 2127 = 3.403...10 × 1038 • negative: –1.1111...2 × 2127 = –3.403...10 × 1038 • overflow occurs if numbers larger than this are produced • rounds up to ±infinity • solution: use a floating point format with a larger exponent • double (11 bits), long double (15 bits)

Limitations of floating point • Size of exponent is fixed • cannot represent very small numbers • for C float, exponent is 8 bits, excess is 127 • smallest exponent is –126 (110 (000000012) – 127) • 00000000 reserved for zero (and denormalized numbers) • smallest representable (normalized) numbers • positive: 1.000...2 × 2–126 = 1.175...10 × 10–38 • negative: –1.000...2 × 2–126 = –1.275...10 × 10–38 • underflow occurs if numbers smaller than this are produced • rounds down to zero (also denormalized) • solution: use a floating point format with a larger exponent • double (11 bits), long double (15 bits)

Limitations of floating point • Size of mantissa is fixed • limited precision in representations • C float has 23 (24) bits of mantissa • smallest possible change in a number is to toggle LSB • place value 2–23 ≈ 1.2 × 10–7 • C float has (almost) 7 decimal digits of precision • solution: use a floating point format with a larger mantissa • double (53 bits), long double (64 bits)

Limitations of floating point • Size of mantissa is fixed • some values cannot be represented exactly • e.g. 1/310 = 0.010101010101010101...2 • continuing fraction never ends • cannot fit in 24 (or 240 (or 24000)) bits • solution: none, same problem occurs in decimal scientific notation • can use higher precision floating point type to improve accuracy • if exact representation is needed, use rational numbers

Limitations of floating point • Addition of floating point numbers not associative • A + (B + C) ≠ (A + B) + C • if numbers significantly different magnitudes • for instance: • A = 1.00 × 103 • B = 4.00 × 100 • C = 3.00 × 100 • 3 significant digits

Limitations of floating point • A + (B + C) B: 4.00 × 100 result of addition B + C + C: 3.00 × 100 7.00 × 100 rewrite sum to add to A 0.007 × 103 + A: 1.00 × 103 result after rounding: 1.01 × 103 1.007 × 103

Limitations of floating point • (A + B) + C B: 4.00 × 100 rewrite B to add to A 0.004 × 103 + A: 1.00 × 103 result of sum A + B 1.004 × 103 result after rounding, carry forward to next addition 1.00 × 103

Limitations of floating point • (A + B) + C (continued) C: 3.00 × 100 rewrite C to add to sum 0.003 × 103 carried forward from sum on previous slide 1.00 × 103 1.003 × 103 result after rounding: 1.00 × 103 1.00 × 103

For More Info • “Computer Architecture & Organisation” - Stallings • Chapter 8 • “What Every Computer Scientist Should Know about Floating Point” • http://cch.loria.fr/documentation/IEEE754/ACM/goldberg.pdf • “IEEE Standard 754 Floating Point Numbers” • http://research.microsoft.com/~hollasch/cgindex/coding/ieeefloat.html

01000010010101010110011000101010 • 11000000110010000101010001110100 • 01000010010000110111010110100011 • 01000000000101011111000110011101 • 01000010000110111000111011001100 • 11000010100110100001000111011001

http://tunepal.org/tunepal/maps/revgeo.php?lat=?&long=?

Fixed Point & Floating Point