Floating Point Arithmetic – Part I

Floating Point Arithmetic – Part I

Motivation • Floating point representation and manipulation are considered a key aspect in computer design • FLOPS – Floating Point Operations Per Second gives a rough performance estimate of computers that must perform precise mathematical operations • Floating point operations are inherently more complex than integer operations • In addition or subtraction exponents must be equal before the operation • In division or multiplication exponents have to be added together, and the result normalized

All floating point arithmetic can be performed by treating individual parts of the representation as integers • The IEEE FPS is a widely accepted standard and will be the representation used in this lecture. • A hardware implementation to performing floating point arithmetic provides circuits that do the operations. • A software implementation will require less hardware and uses a program code to perform the operations.

Addition and Subtraction A floating point number can be expressed as N where N = (-1)s(m)(2e)

To add two floating point numbers A and B, we must first align their radix points. Let A be a number such that its exponent is smaller than B’s. • Aligning the radix points means shifting the fraction corresponding to the smaller exponent. • We have to increment A’s exponent until it is equal to B’s. At the same time, the contents of the mantissa of A must be shifted to the right including the hidden bit with the same amount the exponent of A was incremented. • We then add the mantissas of A and B.

Example 12.1 2.25 + 134.0625 0 1000 0000 (1)001 0000 0000 0000 0000 0000 0 1000 0110 (1)000 0110 0001 0000 0000 0000 0 1000 0110 (0)000 0010 0100 0000 0000 0000 0 1000 0110 (1)000 0110 0001 0000 0000 0000 0 1000 0110 (1)000 1000 0101 0000 0000 0000 Note that this is already normalized

In general, when adding two positive mantissas, the range of the resulting mantissa is 1 £ m < 4 • If m < 2, it is already normalized. If m ³ 2, then it must be normalized. • Note that only a single shift is required since it cannot be as large as four. • To normalize, simply add one to the exponent of the result and shift the mantissa to the right 1 bit position

Example 12.2 255.0625 + 134.0625 0 1000 0110 (1)111 1111 0001 0000 0000 0000 0 1000 0110 (1)000 0110 0001 0000 0000 0000 0 1000 0110(10)000 0101 0010 0000 0000 0000 To normalize: add 1 to the exponent and shift the mantissa 1 bit to the right. The answer is: 0 1000 0111 (1)000 0010 1001 0000 0000 0000 “overflow”

The exponents can be positive or negative. • If both numbers are negative, the “smaller exponent” means more negative. • In a biased-127 representation, the “more negative number” always has a smaller value for E. Note that E is unsigned. • Negative mantissas can also be handled by the same algorithm. • To add a negative mantissa, convert the mantissa first to 2’s complement. Then convert the result back to sign magnitude.

sign extend Example 12.3 2.25 + (-134.0625) 0 1000 0110 (0)000 0010 0100 0000 0000 0000 1 1000 0110 (1)000 0110 0001 0000 0000 0000 0000 0000 0000 0010 0100 0000 0000 0000 1111 1111 0111 1001 1111 0000 0000 0000 1111 1111 0111 1100 0011 0000 0000 0000 1111 1111 1000 0011 1101 0000 0000 0000 1 1000 0110 (1)000 0011 1101 0000 0000 0000 mantissas: answer

Subtraction can be achieved by simply adding the additive inverse of a number • The exponents are aligned and the mantissas are converted to 2’s complement. • The mantissas are then added. • The result, if there is a need, is normalized.

Example 12.4 135.901 - 135.861 0 1000 0110 (1)000 0111 1110 0110 1010 1000 1 1000 0110 (1)000 0111 1101 1100 0110 1010 0000 0000 1000 0111 1110 0110 1010 1000 1111 1111 0111 1000 0010 0011 1001 0110 0 1000 0110 (0)000 0000 0000 1010 0011 1110 0 0111 1010 (1)010 0011 1110 0000 0000 0000 Mantissas: Unnormalized result: Normalized result: Adjusted 12 positions Subtracted 12

If two numbers being compared are identical, the resulting subtraction will result in a mantissa of zero. • No shifting can move a one into the hidden bit position, thus this condition must be explicitly detected and E = F = 0 is set. • In subtraction, if the exponents of the numbers vary by more than the precision of the mantissa (24), the result of the shift will obtain a zero for the smaller number

Floating Point Arithmetic – Part I