CMPT 128: Introduction to Computing Science for Engineering Students

CMPT 128: Introduction to Computing Science for Engineering Students Floating Point Representation Data Types and Conversion

Floating point numbers • A floating point number is represented by a mantissa and an exponent • For example 1.23456 * 1012 • The mantissa will be represented by N binary digits. It can take the 2N-1 values. These 2N-1 values are the representable values of the floating point representation. Other values are not representable and are approximated by the nearest representable value • For a long double there will be more representable values than for a double because more bits are allocated to the mantissa. (the long double will have N approximately 2X larger) Exponent = 13 Sign bit Mantissa = 0.123456

IEEE an example • One of the most common floating point representations is a standard from IEEE

IEEE mantissa • Consider float. A float has a 23 bit mantissa. • Mantissa is 0.91234567 • Multiply by 223 to give 7653310 • Convert to binary to give the mantissa 111 0100 1100 0111 1011 1110

IEEE mantissa < 0 • Consider float. A float has a 23 bit mantissa. • Mantissa is -0.91234567 • Multiply by 223 to give -7653310 • Convert -7653310 to binary 111 0100 1100 0111 1011 1110 Take the two’s complement 000 1011 0011 1000 0100 0010

IEEE exponent • Consider float with a 8 bit exponent • Must be able to support both negative and positive exponents • Do not use a sign for the exponent • Take value of exponent and add 127 • 0 represents -127, 255 represents +128 • -127 and 255 are reserved for special uses

Floating point conversions • Consider, as an example, a system which uses • 1 sign bit, 5 of 10 bits for the mantissa for floats • 1 sign bit, 9 of 16 bits for the mantissa for doubles • 4 bit for short int • 8 bits for int • 16 bits for long int • Not using the actual representation in C++. A simplified version is used to illustrate ideas

Floating point conversions • Converting a float to double float + 10011 + 010 (19 *22) double +000010011 +000010 • Converting from a float to a double is exact. • All float mantissas can be exactly represented as double mantissas. • A parallel argument applies to the exponents • Such conversions are acceptable default conversions and are done automatically in C++

Floating point conversions • Converting from a double to a float can cause loss of accuracy double +001110011 +00010 (115 * 22 = 460) +000111001 +00011 (57 * 23 = 456) +000011100 +00100 (28 * 24 = 448) float + 11100 + 100 • The double cannot be accurately represented as a float, resulting in loss in accuracy as the number is “shifted” • All doubles between 448.0 (11100 +100) and 463.0 (11101 +100) are represented by the single float 448.0 • If you choose to assign an expression with a double value to a variable with a float value you will be reducing the accuracy of your variable. You must do this yourself.

Floating point conversions • Converting from a double to a float can lead to a number not being representable double +000010011 +11110 float + 10011 + ???? • The double cannot be accurately represented because the exponent is larger than can be represented • If you choose to assign an expression with a double value to a variable with a float value you may not be able to represent the number as a float. unexpected results) • Conversions that loose accuracy or may not be representable are not done automatically in C++

Floating point conversions • To convert floating point number to an integer float + 10101 +111 (21 * 27) + 101010 +110 (42 * 26) + 1010100 +101 (84 * 25) +10101000 +100 (168 * 24) … Need 4 more bits (binary digits) int too large to represent • This conversion may not be possible, it is not an automatic conversion in C++. It may give unexpected results. It is not done automatically

Floating point conversions • To convert an integer to a floating point number int +1010011 (83) +0101001 +001 (41 * 2 = 82) +0010100 +010 (20 * 22= 80) float + 10100 +010 (20 * 22= 80) • This nearest floating point value may not be exactly equal to the integer value. • Least significant digits have been dropped from mantissa • Values between 80.0 and 83.0 inclusive are all represented as a float with value 80.0

Numerical Values of Operands • When you evaluate an expression in C++ you may combine two values (operands) according to a binary operation. • 2 operands A and B combine with operation + • A+B • Both operands of a binary operation should have the same type (int, float, double ….)

Numerical Values and Expressions • In C++ if you use two types of operands with the same binary operator, one of the operands will be converted to the same type as the other operand. • When evaluating expressions some such conversions are automatically performed according to a defined set of rules, the usual arithmetic conversions.

Conversions for assignment • When executing an assignment statement (A=B) the value of expression B is placed in the location reserved for variableAin memory. • In this case the type of variable B must be converted to the type of variable A before placing the value in variable A • This can result in a loss of accuracy (for example converting an int to a double) • When conversions are not automatic, warnings and or unexpected results may result (for example converting an unsigned int to an int of the same size)

Example unexpected results intnumRows=0; unsigned intmaxint=4294967290; cout << numRows << endl; numRows = maxint; cout << numRows << endl; • Will print 0 -4

Why these results • In binary unsigned int 4294967292 is 11111111 11111111 11111111 11111011 • When interpreted as an signed integer • The leftmost bit is for the sign 1 means - • Flip all the bits 00000000 00000000 00000000 00000100 • So the number printed is -4

Conversions • To avoid unexpected results when evaluating an arithmetic expression not including an assignment statement you should understand how the usual arithmetic conversions. are done. • In some cases to be sure you get the results you want you will need to do the conversions explicitly yourself

Usual arithmetic conversions • If either operand is long double the other is converted to a long double • Otherwise, If either operand is a double the other is converted to a double • Otherwise, If either operand is a float the other is converted to a float • Characters are converted to int (or unsigned int) • Short integers are converted to int

Usual arithmetic conversions 2 • Otherwise, If both one operands are signed int XOR both are unsigned int then the shorter type is converted to the longer (int to long int, unsigned int to unsigned long int …) • Otherwise, If the unsigned int type is longer than the signed int type the signed int is converted to an unsigned int • Otherwise, If all unsigned values can be represented in the signed type the unsigned type is converted to the signed type

Explicit conversion: cast operation • In C++ you can explicitly convert the type of a variable or expression within a larger expression using a cast operator • The value of the variable cast is not changed • The variable's value used in the larger expression is converted to the requested type • Sample expressions including casts • Integerone + static_cast<int>(Floatone+Floattwo) • static_cast<float>Integerone + Float1 + Float2 • static_cast<double>unsigned1+ double1 * double2

Explicit conversion: cast operation • In C (and C++) you can explicitly convert the type of a variable or expression within a larger expression using a cast operator • The value of the variable cast is not changed • The variable's value used in the larger expression is converted to the requested type • Sample expressions including casts • Integerone + (int)Floatone+Floattwo) • (float)Integerone + Float1 + Float2 • (double)unsigned1+ double1 * double2

CMPT 128: Introduction to Computing Science for Engineering Students