Floating Point Arithmetic

Floating Point Arithmetic By, Kimberly Pustai

Floating Point Arithmetic • Floating-point arithmetic is widely used because it has many practical advantages • It provides a familiar approximation to the real numbers, with useful properties like automatic scaling • It is widely available on different computers and is well supported by programming languages • Current workstations have highly optimized native floating-point arithmetic, sometimes faster than native integer arithmetic • Floating-point arithmetic is sufficiently widespread in scientific computing that programmers rarely consider other options

Floating Point Arithmetic • We will look at one such machine used in computing floating point arithmetic • The IEEE (Institute of Electrical and Electronics Engineers) has produced a standard for floating point arithmetic. This standard specifies how single precision (32 bit) and double precision (64 bit) floating point numbers are to be represented, as well as how arithmetic should be carried out on them

MIPS • For years computer programmers and computer architects have been coming up with ways to improve a computers performance at the same time lowering it’s cost • A number of measures have been produced in attempts to create a standard and easy-to-use measure of computer performance • One result has been that simple metrics have been heavily misused • One alternative to time as the metric is MIPS (million instructions per second)

For a given program, MIPS is simply MIPS= Instruction Count Execution time x 10^6 • Since MIPS is an instruction execution rate, MIPS specifies performance end for end to execution time; faster machines have a higher MIPS rating • MIPS is easy to understand, and faster machines mean bigger MIPS

Single Precision • All Computers limit the precision, the maximum number of significant digits,of a floating point number, although most machines use binary rather than decimal arithmetic So why did we talk about MIPS in the previous section?? • The MIPS contains a floating-point coprocessor that operates on single precision, which is 32 bit( IEEE single precision) **Significant Digit– Those digits from the first nonzero digit on the left to the last nonzero digit on the right( plus any non zero digits that are exact)

The sign represents the sign of the number, for example, 1 represents negative and 0 represents positive • Exponent is the value of the 8 bit exponent field, including the sign of the exponent • Fraction is the 23-bit number in the fraction

Since the mantissa has a total of 24 bits (when you count the hidden bit) and is rounded, the magnitude of the relative error in a number is bounded by 2^{-24} = 5.96... x 10^{-8}. • This means we get a bit more than 7 decimal digit precision.(The largest possible mantissa is M = 2^{24} = 16777216, which has 7+ digits of precision.) *mantissa – the actual number being represented

The largest positive number that can be stored is1.11111111111111111111111 (binary) x 2^{127} = 3.402823... x 10^{38} *Notice that 1.11111111111111111111111 (binary) = 2 - 2^{-23}. • The smallest positive number is1.00000000000000000000000 (binary) x 2^{-126} = 1.175494... x 10^{-38}.

In general, floating-point numbers are of the form (-1)^s x F x 2^E F is for the value in the fraction E is for the value in the exponent field

The value V represented by the word may be determined as follows: • If E=255 and F is nonzero, then V=NaN ("Not a number") • If E=255 and F is zero and S is 1, then V=-Infinity • If E=255 and F is zero and S is 0, then V=Infinity • If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values • If E=0 and F is zero and S is 1, then V=-0 • If E=0 and F is zero and S is 0, then V=0

In particular, 0 00000000 00000000000000000000000 = 0 1 00000000 00000000000000000000000 = -0 0 11111111 00000000000000000000000 = Infinity 1 11111111 00000000000000000000000 = -Infinity 0 11111111 00000100000000000000000 = NaN 1 11111111 00100010001001010101010 = NaN 0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2 0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5

1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5 0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126) 0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 0 00000000 00000000000000000000001 = +1 * 2**(-126) * 0.00000000000000000000001 = 2**(-149) (Smallest positive value)

If we take the number –0.75 ten and represent it in the IEEE 754 binary form using single precision we get

Overflow • By using different sizes of exponents and fractions the MIPS computer arithmetic can have a very large range • We can have numbers as small as 2.0 ten x 10^-38 and numbers as large as 2.0 ten x 10^38

It is still possible though, with such a large range of numbers, to have a number too large to represent. This causes overflow interrupts in floating point arithmetic (also found in integer arithmetic) * An over flow interrupt simply means that the exponent of the number, E, is too large to be represented in the exponent field

An over flow exception is signaled if the rounded result exceeds the largest finite number of the destination format. This format is always enabled. If this trap occurs, an unpredictable value is stored in the destination register • For example in C++, the result of the calculation 9999 x 10^9 x 1000 x 10^9 9999000 x 10^18= 9999 x 10^21 can not be stored. For this particular example, we would need to set the result to 9999 x 10^9 (the maximum representable value in this case)

Most programming languages do not define what should happen in the case of over flow. Some systems print a run-time error message such as “FLOATING POINT OVERFLOW”. On other systems, you may get the largest number that can be represented

Double Precision • The MIPS has a floating-point coprocessor that, besides operating on single precision floating point numbers, it also operates on double precision floating point numbers( IEEE double precision) • The representation of a double precision floating point number takes two MIPS words, as shown in this picture

Sign represents the sign of the number, same as the single precision • Exponent is now the value of a 11 bit exponent field, including the sign of the exponent • Fraction represents the 52-bit number in the fraction

For double precision, the floating point number is represented in the form (-1)^s x (1+Fraction) x 2^E • The (1+Fraction) represents the fraction, which is actually 24 bits long in single precision (implied 1 and a 23-bit fraction), and 53-bits long in double precision *This is so the hardware doesn’t attach a leading 1 to it

The value V represented by the word may be determined as follows: • If E=2047 and F is nonzero, then V=NaN ("Not a number") • If E=2047 and F is zero and S is 1, then V=-Infinity • If E=2047 and F is zero and S is 0, then V=Infinity • If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values • If E=0 and F is zero and S is 1, then V=-0 • If E=0 and F is zero and S is 0, then V=0

Since the mantissa has a total of 53 bits (when you count the hidden bit) and is rounded, the magnitude of the relative error in a number is bounded by 2^{-53} = 1.11... x 10^{-16} • This means we get almost 16 decimal digit precision (The largest possible mantissa is M = 2^{53} = 9.007...x10^15, which has 15+ digits of precision.) *mantissa – the actual number being represented

The largest positive number that can be stored is1.11111....11111 (binary) x 2^{1023} = 1.797693... x 10^{308} *Notice that 1.11111....11111 (binary) = 2 - 2^{-52} • The smallest positive number is1.00000...00000 (binary) x 2^{-1022} = 2.225074... x 10^{-308}

If we take the same number, -0.75 ten, that we represented in single precision, and represent it in the IEEE 754 binary form using double precision we get,

*Notice: When we go to double precision using the IEEE 754 floating-point standard, we gain more than a factor of 2 in the precision of the mantissa and we gain a huge factor in the size of numbers we can work with before encountering an overflow condition.

Underflow • Just as programmers will want to know when they have calculated a number that is too large to be represented, they will want to know if the nonzero faction they are calculating has become so small that is cannot be represented. This is what’s known as underflow • An underflow exception occurs if the rounded result is smaller than the smallest finite number of the destination format. This trap can be disabled in Assembler. If this trap occurs, a true zero is always stored in the destination register

For example in C++, the result of the calculation 4210 x 10^-8 x 2000 x 10^-8 8420000 x 10^-16 = 8420 x 10^-13 could not be represented because an exponent of -13 is too small. The minimum for C++ is -9

One way in reducing underflow is to use a notation that has a larger exponent, this is double precision • Overflow is a more serious problem than underflow though because there is no logical recourse when it occurs. It simply can’t be stored • The only way to fix the problem of overflow is to set the result to the maximum representable value in the specific case

Floating-Point Instructions in MIPS • MIPS supports the IEEE 754 single-precision and double-precision formats with these instructions: floating-point addition single(add.s) and double(add.d) floating-point subtraction single(sub.s) and double(sub.d) floating-point multiplication single(mul.s) and double(mul.d) floating-point division single(div.s) and double(div.d) floating-point comparison single(c.x.s) and double(c.x.d) // x can be equal( eq ), not equal( neq ), less than ( lt ), less than or equal ( le ), greater than( gt ), or greater than or equal (ge ) floating-point branch, true( bclt ) and branch, false( bclf )

The add instruction adds the contents of s-reg or d-reg to the contents of s-reg2 and puts the result in d-reg. When the sum of two operands is exactly zero, the sum has a positive sign for all rounding modes except round toward –infinity. For that rounding mode, the sum has a negative sign • The subtract instruction subtract the contents of s.reg2 from the contents of s.reg1 or d.reg and put the result in d.reg. When the difference of two operands is exactly zero, the difference has a positive sign for all rounding modes except round toward –infinity. For that rounding mode, the sum has a negative sign

The divide instruction computes the quotient of two values. These instructions divide the contents of s.reg1 or d.reg by the contents of s.reg2 and put the result in d.reg. If the divisor is a zero, an error is signaled if the divide-by-zero exception is enabled • The multiply instruction takes the contents of s.reg1 or d.reg with the contents of s.reg2 and puts the results in d.reg

There are floating-point registers designed in the MIPS system, like: $f0, $f1, $f2,….ect these are used in both single precision and double precision

There is also load and store functions for floating point registers: lwcl (load) and swcl (store) values are moved in or out of these registers one word (32 bits) at a time by using these • The base registers for floating-point data transfers remain integer registers

MIPS floating-point operands This shows the registers that MIPS uses, 32 floating point registers in all as explained previously

The MIPS code to load two single precision numbers from memory, add them, and then store the sum might look like this: lcwl $f4, x($sp) # Load 32-bit F.P. number into F4 lcwl $f6, y($sp) # Load 32-bit F.P. number into F6 add.s $f2, $f4, $f6 # F2 = F4 +Ff6 single precision swcl $f2, z($sp) # Store 32-bit F.P. number from F2 • A double precision register is really an even-odd pair of single precision registers, using the even register number as its name

Summary • Floating-point arithmetic has the added challenge of being an approximation of real numbers, and care needs to be taken to ensure that the computer number selected is the representation closest to the actual number • When we use integer arithmetic, our results are exact. Floating point arithmetic, however, is seldom exact. The MIPS machine helps us to have the most exact number in floating-point that we possibly can get without having to manipulate our result

Floating Point Arithmetic