Foundations of Computer Arithmetic

CSE 551 Computational Methods 2019/2020 Fall Chapter 2 Error Analysis and Computer Arithmetic

Outline Base Changes Introduction to Error Analysis Floating-Point Representations

References • W. Cheney, D Kincaid, Numerical Mathematics and Computing, 6ed, • Chapter 1 • Chapter 2 • Appendix B

Introduction • general number representation - • to bases 2, 8, and 16 • bases primarily used in computer arithmetic • The familiar decimal notation for numbers uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. • a whole number such as 37294 • individual digits represent coefficients of powers of 10: • We begin with a discussion of general number representation but move quickly to bases 2, 8, and 16, as they are the bases primarily used in computer arithmetic. • The familiar decimal notation for numbers uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. • When we write a whole number such as 37294, the individual digits represent coefficients • of powers of 10 as follows: • 37294 = 4 + 90 + 200 + 7000 + 30000 • = 4 × 100 + 9 × 101 + 2 × 102 + 7 × 103 + 3 × 104 • Thus, in gener

in general, a string of digits represents a number according to the formula anan−1. . . a2a1a0 = a0 × 100 + a1 × 101 +· · ·+an−1 × 10n−1 + an× 10n • This takes care of only the positive whole numbers.A number between 0 and 1 is represented • by a string of digits to the right of a decimal point. For example, we see that 0.7215 = 7 × 10−1 + 2 × 10 −2 + 1 × 10 −3 + 5 × 10 −4

In general, we have the formula • there can be an infinite string of digits to the right of the decimal point; indeed, • there must be an infinite string to represent some numbers. For example, we note that

For a real number of the form • the integer part is the first summation • the fractional part is the second summation • a number represented in base β is signified by • enclosing it in parentheses and adding a subscript β

β Base Numbers • Other bases used, especially in computers e.g., • the binary system uses 2 as the base • the octal system uses 8 • the hexadecimal system uses 16 • In the octal representation of a number • digits - 0, 1, 2, 3, 4, 5, 6, 7 • e.g., (21467)8 = 7 + 6 × 8 + 4 × 82 + 1 × 83 + 2 × 84 = 7 + 8(6 + 8(4 + 8(1 + 8(2)))) = 9015 • A number between 0 and 1, expressed in octal • represented with combinations of 8−1, 8−2, and so on. (0.36207)8 = 3 × 8−1 + 6 × 8−2 + 2 × 8−3 + 0 × 8−4 + 7 × 8−5 = 8−5(3 × 84 + 6 × 83 + 2 × 82 + 7) = 8−5(7 + 82(2 + 8(6 + 8(3)))) = 15495 / 32768 = 0.47286 987 . . .

If we use another base, say, β, then numbers represented in the β-system look like this: • The digits: 0, 1, . . . , β −2, β −1 • If β > 10 • necessary to introduce symbols for 10, 11, . . . , β − 1 • The separator between the integer and fractional part - called the radix point • decimal point - base-10 numbers

Conversion of Integer Parts • formalize the process of converting a number from one base to another • consider separately • the integer and fractional parts of a number • a positive integer N with base γ : • to convert this to the number system with base β Write N in its nested form:

replace each of the numbers on the right by its representation in base β • Next, carry out the calculations in β-arithmetic. replacement of the ak’s and γ by equivalent base-β numbers - a table • how each of the numbers 0, 1, . . . , γ −1 appears • in the β-system • a base-β multiplication table may be required.

decimal number 3781 to • binary form • the decimal binary equivalences • longhand multiplication in base 2, • for hand calculations: • Write down an equation digits c0, c1, . . . , cm:

if N is divided by β, then the remainder in this division is c0, and the quotient is • If this number is divided by β, the remainder is c1, and so on • divide repeatedly by β • saving remainders c0, c1, . . . , cmand quotients.

Example • Convert the decimal number 3781 to binary form using the division algorithm. • Solution: divide repeatedly by 2, saving the remainders

Here, the symbol ˙↓ is used to remind us that the digits ci are obtained beginning with the • digit next to the binary point. Thus, we have • (3781.)10 = (111 011 000 101.)2 • and not the other way around: (101 000 110 111.)2 = (2615)10

Example • Convert the number N = (111 011 000 101)2 to decimal form by nested multiplication. • Solution:

Another conversion problem exists in going from an integer in base γ to an integer in base β • when using calculations in base γ • the unknown coefficients • determined by a process of successive division • this arithmetic is carried out in the γ –system • At the end, the numbers ckare in base γ • a table of γ -β equivalents

e.g., convert a binary integer into decimal form by repeated division by (1 010)2 • equals (10)10 • carrying out the operations in binary • A table of binary-decimal equivalents • binary division is easy only for computers • develop alternative procedures

Conversion of Fractional Parts • convert a fractional number such as (0.372)10 to binary • a direct yet naive approach:

Dividing in binary arithmetic is not straightforward • easier ways conversion. • x in the range 0 < x < 1 and that the digits ck in the representation • are to be determined

it is necessary to shift the radix point only when multiplying by base β • the unknown digit c1 can be described as the integer part of βx • denoted by I(βx). • The fractional part, (0.c2c3c4. . .)βdenoted by F(βx) • The process is repeated in the • following pattern: • the arithmetic is carried out in the decimal system.

Example • Use the preceding algorithm to convert the decimal number x = (0.372)10 to binary form.

repeatedly multiplying by 2 and removing the integer parts: • (0.372)10 = (0.010 111 . . .)2

Base Conversion 10↔8↔2 • Most computers - binary system representation of numbers. • The octal system (base 8) useful in converting from the decimal system to the binary system and vice versa • With base 8, the positional values of the numbers • 80 = 1, 81 = 8, 82 = 64, 83 = 512, 84 = 4096,...

Example

converting between decimal and binary form • convenient - octal representation - intermediate step • Conversion between • octal and decimal • octal and binary – simple • starts at the binary point and proceeds in both directions. (101 101 001.110 010 100)2 = (551.624)8 • Conversion of an octal number to binary can be done in a similar manner but in reverse order. (5362.74)8 = (101 011 110 010.111 100)2

Example • What is (2576.35546 875)10 in octal and binary forms? • Solution: convert the decimal number first to octal and then to binary • For the integer partrepeatedly divide by 8: 2576. = (5020.)8 = (101 000 010 000.)2

For the fractional part - repeatedly multiply by 8 0.35546 875 = (0.266)8 = (0.010 110 110)2 • the result 2576.35546 875 = (101 000 010 000.010 110 110)2

Base 16 • hexadecimal system (base 16) • A, B, C, D, E, and F represent 10, 11, 12, 13, 14, and 15, respectively • table of equivalences:

Conversion between binary numbers and hexadecimal numbers • regroup the binary digits to groups of four (010 101 110 101 101)2 = (0010 1011 1010 1101)2 = (2BAD)16 • and (111 101 011 110 010.110 010 011 110)2 = (1010 1111 0010.1100 1001 1110)2 = (7AF2.C9E)16

More Examples • convert (0.276)8, (0.C8)16, and (492)10 into different number systems

Significant Digits • digits beginning with the leftmost nonzero digit and ending with the rightmost correct digit, including final zeros that are exact.

Example • solving for the variable y in this linear system of equations in two variables 0.1036 x + 0.2122 y = 0.7381 0.2081 x + 0.4247 y = 0.9327 • First, carry only three significant digits of precision in the calculations • Second, repeat with four significant digits throughout • Finally, use ten significant digits.

Solution • first task - round all numbers in the original problem to three digits • round all the calculations, keeping only three significant digits • take a multiple α of the first equation and subtract it from the second equation to eliminate the x-term in the second equation • The multiplier is α = 0.208/0.104 ≈ 2.00 • in the second equation, - new coefficient of the x-term: 0.208 − (2.00)(0.104) ≈ 0.208 − 0.208 = 0 • new y-term coefficient: 0.425 − (2.00)(0.212) ≈ 0.425 − 0.424 = 0.001 • righthand side: 0.933 − (2.00)(0.738) = 0.933 − 1.48 = −0.547 y = −0.547/(0.001) ≈ −547.

keep four significant digits: • the multiplier: α = 0.2081/0.1036 ≈ 2.009 • In the second equation - new coefficient of the x-term: 0.2081 − (2.009)(0.1036) ≈ 0.2081 − 0.2081 = 0 • new coefficient of the y-term: 0.4247 − (2.009)(0.2122) ≈ 0.4247 − 0.4263 = −0.00160 0 new right-hand side: 0.9327−(2.009)(0.7381) ≈ 0.9327−1.483 ≈ −0.5503 y = −0.5503/(−.00160 0) ≈ 343.9 shocked to find that the answer has changed • from −547 to 343.9, which is a huge difference!

carry ten significant decimal digits • find that: • even 343.9 is not accurate • obtain: y = 356.29071 99 • The lesson learned: • data thought to be accurate should be carried with full precision and not be rounded off prior to each of the calculations

In most computers, the arithmetic operations • a double-length accumulator • twice the precision of the stored quantities • may not avoid a loss of accuracy! • Loss of accuracy • roundoff errors • subtracting nearly equal numbers

Figure - geometric illustration of what can happen in solving two equations in two unknowns • The point of intersection of the two lines - exact solution • dotted lines - degree of uncertainty • from errors in the measurements or roundoff errors. • sharply defined point v.s. small trapezoidal • area containing many possible solutions. • if the two lines are nearly parallel • area of possible solutions can increase dramatically! • well-conditioned and ill-conditioned systems of linear equations

In 2D, wellconditione an ill-conditionedlinear systems

Errors: Absolute and Relative • α,β - two numbers • one is regarded as an approximation to the other • The error of β as an approximation to α:α − β; • the error – the exact value minus the approximate value • The absolute error of β as an approximation to α: |α −β| • The relative error of β as an approximation to α: |α −β|/|α| • in absolute error, the roles of α and β are the same, • in computing the relative error, • relative error is undefined in the case α = 0.

relative error is usually more meaningful than the absolute error • e.g., α1 = 1.333, β1 = 1.334 • α2 = 0.001, β2 = 0.002 • absolute error of βias an approximation to αi: • the same in both cases - 10−3 • However, the relative errors: (3/4) × 10−3 and 1, • respectively • relative error clearly indicates that • β1 is a good approximation to α1 • but that β2 is a poor approximation to α2

In summary • the exact value - the true value • A useful way to express the absolute error and relative error - to drop the absolute values: (relative error)(exact value) = exact value − approximate value approximate value = (exact value)[1 + (relative error)] • relative error - related to the approximate value rather than to the exact value • the true value may not be known

Example • Consider x = 0.00347 rounded to x_head = 0.0035 and y = 30.158 rounded to y_head = 30.16 • What are the number of significant digits, absolute errors, and relative errors? • Interpret the results.

Solution • Case 1. x_head = 0.35 × 10−2 - two significant digits, • absolute error: 0.3 × 10−4 • relative error 0.865 × 10−2 • Case 2. y_head = 0.3016 × 102 - four significant digits • absolute error: 0.2 × 10−2 • relative error 0.66 × 10−4. • the relative error is a better indication of the number of significant digits than the absolute error

Accuracy and Precision • Accurate to n decimal places • can trust n digits to the right of the decimal place • Accurate to n significant digits • can trust a total of n digits as being meaningful beginning with the leftmost nonzero digit.

a ruler graduated in millimeters to measure lengths • The measurements will be accurate to one millimeter, or 0.001 m • three decimal places written in meters • A measurement such as 12.345 m would be accurate to three decimal places • A measurement such as 12.34567 89 m would be meaningless, since the ruler produces only • three decimal places • and it should be 12.345 m or 12.346 m. If the measurement 12.345 m has five dependable digits • then it is accurate to five significant figures. • a measurement such as 0.076 m has only two significant figures.

using a calculator or computer in a laboratory experiment, one may get a false sense of having higher precision than is warranted by the data • e.g., • (1.2) + (3.45) = 4.65 • only two significant digits of accuracy • because the second digit in 1.2 may be • the effect of rounding 1.24 down or rounding 1.16 up to two significant figures • Then the left-hand side - as large as (1.249) + (3.454) = (4.703) • or as small as (1.16) + (3.449) = (4.609)

In Addition and Subtraction • In adding and subtracting numbers • the result is accurate only to the smallest number of significant digits • used in any step of the calculation • In the above example, the term 1.2 has two significant digits; • therefore, the final calculation has an uncertainty in the third digit

Rule of Thumb • In multiplication and division of numbers • the results may be even more misleading. • e.g., computations on a calculator: (1.23)(4.5) = 5.535 (1.23)/(4.5) = 0.27333 3333. • there are four and nine significant digits inhe results • but there are really only two • As a rule of thumb • keep as many significant digits in a sequence of calculations as there are in the least accurate number involved in the computations.

Foundations of Computer Arithmetic