1 / 22

CMPT 128: Introduction to Computing Science for Engineering Students

CMPT 128: Introduction to Computing Science for Engineering Students. Floating Point Representation Data Types and Conversion. Floating point numbers. A floating point number is represented by a mantissa and an exponent For example 1.23456 * 10 12

aradia
Download Presentation

CMPT 128: Introduction to Computing Science for Engineering Students

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMPT 128: Introduction to Computing Science for Engineering Students Floating Point Representation Data Types and Conversion

  2. Floating point numbers • A floating point number is represented by a mantissa and an exponent • For example 1.23456 * 1012 • The mantissa will be represented by N binary digits. It can take the 2N-1 values. These 2N-1 values are the representable values of the floating point representation. Other values are not representable and are approximated by the nearest representable value • For a long double there will be more representable values than for a double because more bits are allocated to the mantissa. (the long double will have N approximately 2X larger) Exponent = 13 Sign bit Mantissa = 0.123456

  3. IEEE an example • One of the most common floating point representations is a standard from IEEE

  4. IEEE mantissa • Consider float. A float has a 23 bit mantissa. • Mantissa is 0.91234567 • Multiply by 223 to give 7653310 • Convert to binary to give the mantissa 111 0100 1100 0111 1011 1110

  5. IEEE mantissa < 0 • Consider float. A float has a 23 bit mantissa. • Mantissa is -0.91234567 • Multiply by 223 to give -7653310 • Convert -7653310 to binary 111 0100 1100 0111 1011 1110 Take the two’s complement 000 1011 0011 1000 0100 0010

  6. IEEE exponent • Consider float with a 8 bit exponent • Must be able to support both negative and positive exponents • Do not use a sign for the exponent • Take value of exponent and add 127 • 0 represents -127, 255 represents +128 • -127 and 255 are reserved for special uses

  7. Floating point conversions • Consider, as an example, a system which uses • 1 sign bit, 5 of 10 bits for the mantissa for floats • 1 sign bit, 9 of 16 bits for the mantissa for doubles • 4 bit for short int • 8 bits for int • 16 bits for long int • Not using the actual representation in C++. A simplified version is used to illustrate ideas

  8. Floating point conversions • Converting a float to double float + 10011 + 010 (19 *22) double +000010011 +000010 • Converting from a float to a double is exact. • All float mantissas can be exactly represented as double mantissas. • A parallel argument applies to the exponents • Such conversions are acceptable default conversions and are done automatically in C++

  9. Floating point conversions • Converting from a double to a float can cause loss of accuracy double +001110011 +00010 (115 * 22 = 460) +000111001 +00011 (57 * 23 = 456) +000011100 +00100 (28 * 24 = 448) float + 11100 + 100 • The double cannot be accurately represented as a float, resulting in loss in accuracy as the number is “shifted” • All doubles between 448.0 (11100 +100) and 463.0 (11101 +100) are represented by the single float 448.0 • If you choose to assign an expression with a double value to a variable with a float value you will be reducing the accuracy of your variable. You must do this yourself.

  10. Floating point conversions • Converting from a double to a float can lead to a number not being representable double +000010011 +11110 float + 10011 + ???? • The double cannot be accurately represented because the exponent is larger than can be represented • If you choose to assign an expression with a double value to a variable with a float value you may not be able to represent the number as a float. unexpected results) • Conversions that loose accuracy or may not be representable are not done automatically in C++

  11. Floating point conversions • To convert floating point number to an integer float + 10101 +111 (21 * 27) + 101010 +110 (42 * 26) + 1010100 +101 (84 * 25) +10101000 +100 (168 * 24) … Need 4 more bits (binary digits) int too large to represent • This conversion may not be possible, it is not an automatic conversion in C++. It may give unexpected results. It is not done automatically

  12. Floating point conversions • To convert an integer to a floating point number int +1010011 (83) +0101001 +001 (41 * 2 = 82) +0010100 +010 (20 * 22= 80) float + 10100 +010 (20 * 22= 80) • This nearest floating point value may not be exactly equal to the integer value. • Least significant digits have been dropped from mantissa • Values between 80.0 and 83.0 inclusive are all represented as a float with value 80.0

  13. Numerical Values of Operands • When you evaluate an expression in C++ you may combine two values (operands) according to a binary operation. • 2 operands A and B combine with operation + • A+B • Both operands of a binary operation should have the same type (int, float, double ….)

  14. Numerical Values and Expressions • In C++ if you use two types of operands with the same binary operator, one of the operands will be converted to the same type as the other operand. • When evaluating expressions some such conversions are automatically performed according to a defined set of rules, the usual arithmetic conversions.

  15. Conversions for assignment • When executing an assignment statement (A=B) the value of expression B is placed in the location reserved for variableAin memory. • In this case the type of variable B must be converted to the type of variable A before placing the value in variable A • This can result in a loss of accuracy (for example converting an int to a double) • When conversions are not automatic, warnings and or unexpected results may result (for example converting an unsigned int to an int of the same size)

  16. Example unexpected results intnumRows=0; unsigned intmaxint=4294967290; cout << numRows << endl; numRows = maxint; cout << numRows << endl; • Will print 0 -4

  17. Why these results • In binary unsigned int 4294967292 is 11111111 11111111 11111111 11111011 • When interpreted as an signed integer • The leftmost bit is for the sign 1 means - • Flip all the bits 00000000 00000000 00000000 00000100 • So the number printed is -4

  18. Conversions • To avoid unexpected results when evaluating an arithmetic expression not including an assignment statement you should understand how the usual arithmetic conversions. are done. • In some cases to be sure you get the results you want you will need to do the conversions explicitly yourself

  19. Usual arithmetic conversions • If either operand is long double the other is converted to a long double • Otherwise, If either operand is a double the other is converted to a double • Otherwise, If either operand is a float the other is converted to a float • Characters are converted to int (or unsigned int) • Short integers are converted to int

  20. Usual arithmetic conversions 2 • Otherwise, If both one operands are signed int XOR both are unsigned int then the shorter type is converted to the longer (int to long int, unsigned int to unsigned long int …) • Otherwise, If the unsigned int type is longer than the signed int type the signed int is converted to an unsigned int • Otherwise, If all unsigned values can be represented in the signed type the unsigned type is converted to the signed type

  21. Explicit conversion: cast operation • In C++ you can explicitly convert the type of a variable or expression within a larger expression using a cast operator • The value of the variable cast is not changed • The variable's value used in the larger expression is converted to the requested type • Sample expressions including casts • Integerone + static_cast<int>(Floatone+Floattwo) • static_cast<float>Integerone + Float1 + Float2 • static_cast<double>unsigned1+ double1 * double2

  22. Explicit conversion: cast operation • In C (and C++) you can explicitly convert the type of a variable or expression within a larger expression using a cast operator • The value of the variable cast is not changed • The variable's value used in the larger expression is converted to the requested type • Sample expressions including casts • Integerone + (int)Floatone+Floattwo) • (float)Integerone + Float1 + Float2 • (double)unsigned1+ double1 * double2

More Related