Floating Point Computation

Floating Point Computation Jyun-Ming Chen

Contents • Sources of Computational Error • Computer Representation of (Floating-point) Numbers • Efficiency Issues

Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources: round off error (limited precision of representation) truncation error (limited time for computation) Misc. Error in original data Blunder (programming/data input error) Propagated error Sources of Computational Error

Gross error: caused by human or mechanical mistakes Roundoff error: the consequence of using a number specified by n correct digits to approximate a number which requires more than n digits (generally infinitely many digits) for its exact specification. Truncation error: any error which is neither a gross error nor a roundoff error. Frequently, a truncation error corresponds to the fact that, whereas an exact result would be afforded (in the limit) by an infinite sequence of steps, the process is truncated after a certain finite number of steps. Supplement: Error Classification (Hildebrand)

Common measures of error • Definitions • total error = round off + truncation • Absolute error = | numerical – exact | • Relative error = Abs. error / | exact | • If exact is zero, rel. error is not defined

Representation consists of finite number of digits Implication: real-number is discrete (more later) R Ex: Round off error

Watch out for printf !! • By default, “%f” prints out 6 digits behind decimal point.

Ex: Numerical Differentiation • Evaluating first derivative of f(x) Truncation error

Select a problem with known answer So that we can evaluate the error! Numerical Differentiation (cont)

Error analysis h  (truncation) error  What happened at h = 0.00001?! Numerical Differentiation (cont)

Ex: Polynomial Deflation • F(x) is a polynomial with 20 real roots • Use any method to numerically solve a root, then deflate the polynomial to 19th degree • Solve another root, and deflate again, and again, … • The accuracy of the roots obtained is getting worse each time due to error propagation

Computer Representation of Floating Point Numbers Floating point VS. fixed point Decimal-binary conversion Standard: IEEE 754 (1985)

Floating VS. Fixed Point • Decimal, 6 digits (positive number) • fixed point: with 5 digits after decimal point • 0.00001, … , 9.99999 • Floating point: 2 digits as exponent (10-base); 4 digits for mantissa (accuracy) • 0.001x10-99, … , 9.999x1099 • Comparison: • Fixed point: fixed accuracy; simple math for computation (sometimes used in graphics programs) • Floating point: trade accuracy for larger range of representation

Decimal-Binary Conversion • Ex: 134 (base 10) • Ex: 0.125 (base 10) • Ex: 0.1 (base 10)

Floating Point Representation • Fraction, f • Usually normalized so that • Base, b • 2 for personal computers • 16 for mainframe • … • Exponent, e

Understanding Your Platform

How about Padding

IEEE 754-1985 • Purpose: make floating system portable • Defines: the number representation, how calculation performed, exceptions, … • Single-precision (32-bit) • Double-precision (64-bit)

S: sign of mantissa Range (roughly) Single: 10-38 to 1038 Double: 10-307 to 10307 Precision (roughly) Single: 7 significant decimal digits Double: 15 significant decimal digits Number Representation Describe how these are obtained

Implication • When you write your program, make sure the results you printed carry the meaningful significant digits.

Implicit One • Normalized mantissa to increase one extra bit of precision • Ex: –3.5

Exponent Bias • Ex: in single precision, exponent has 8 bits • 0000 0000 (0) to 1111 1111 (255) • Add an offset to represent +/ – numbers • Effective exponent = biased exponent – bias • Bias value: 32-bit (127); 64-bit (1023) • Ex: 32-bit • 1000 0000 (128): effective exp.=128-127=1

Ex: Convert – 3.5 to 32-bit FP Number

Explain how this program works Examine Bits of FP Numbers

The “Examiner” • Use the previous program to • Observe how ME work • Test subnormal behaviors on your computer/compiler • Convince yourself why the subtraction of two nearly equal numbers produce lots of error • NaN: Not-a-Number !?

Design Philosophy of IEEE 754 • [s|e|m] • S first: whether the number is +/- can be tested easily • E before M: simplify sorting • Represent negative by bias (not 2’s complement) for ease of sorting • [biased rep] –1, 0, 1 = 126, 127, 128 • [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01 • More complicated math for sorting, increment/decrement

Exceptions • Overflow: • ±INF: when number exceeds the range of representation • Underflow • When the number are too close to zero, they are treated as zeroes • Dwarf • The smallest representable number in the FP system • Machine Epsilon (ME) • A number with computation significance (more later)

Extremities More later • E : (1…1) • M (0…0): infinity • M not all zeros; NaN (Not a Number) • E : (0…0) • M (0…0): clean zero • M not all zero: dirty zero (see next page)

Not-a-Number • Numerical exceptions • Sqrt of a negative number • Invalid domain of trigonometric functions • … • Often cause program to stop running

1. 1. Extremities (32-bit) • Max: • Min (w/o stepping into dirty-zero) (1.111…1)2254-127=(10-0.000…1) 21272128 (1.000…0)21-127=2-126

a.k.a.: also known as Dirty-Zero (a.k.a. denormals) • No “Implicit One” • IEEE 754 did not specify compatibility for denormals • If you are not sure how to handle them, stay away from them. Scale your problem properly • “Many problems can be solved by pretending as if they do not exist”

denormals R 2-126 0 dwarf Dirty-Zero (cont) 2-126 00000000 10000000 00000000 00000000 2-127 00000000 01000000 00000000 00000000 2-128 00000000 00100000 00000000 00000000 00000000 00010000 00000000 00000000 2-129 (Dwarf: the smallest representable)

Drawf (32-bit) Value: 2-149

Machine Epsilon (ME) • Definition • smallest non-zero number that makes a difference when added to 1.0 on your working platform • This is not the same as the dwarf

Computing ME (32-bit) 1+eps Getting closer to 1.0 ME: (00111111 10000000 00000000 00000001) –1.0 = 2-23 1.12  10-7

Effect of ME

Significance of ME • Never terminate the iteration on that 2 FP numbers are equal. • Instead, test whether |x-y| < ME

Number Density: there are as many IEEE 754 numbers between [1.0, 2.0] as there are in [256, 512] Revisit: “roundoff” error ME: a measure of density near the 1.0 Implication: Scale your problem so that intermediate results lie between 1.0 and 2.0 (where numbers are dense; and where roundoff error is smallest) R Numerical Scaling

Scaling (cont) • Performing computation on denser portions of real line minimizes the roundoff error • but don’t over do it; switch to double precision will easily increase the precision • The densest part is near subnormal, if density is defined as numbers per unit length

How Subtraction is Performed on Your PC • Steps: • convert to Base 2 • Equalize the exponents by adjusting the mantissa values; truncate the values that do not fit • Subtract mantissa • normalize

1. 1110111 0100011 1010100… – Subtraction of Nearly Equal Numbers • Base 10: 1.24446 – 1.24445 Significant loss of accuracy (most bits are unreliable)

Theorem of Loss Precision • x, y be normalized floating point machine numbers, and x>y>0 • If then at most p, at least q significant binary bits are lost in the subtraction of x-y. • Interpretation: • “When two numbers are very close, their subtraction introduces a lot of numerical error.”

When you program: You should write these instead: Implications Every FP operation introduces error, but the subtraction of nearly equal numbers is the worst and should be avoided whenever possible

Efficiency Issues • Horner Scheme • program examples

Horner Scheme • For polynomial evaluation • Compare efficiency

Accuracy vs. Efficiency

Good Coding Practice

On Arrays …

Issues of PI • 3.14 is often not accurate enough • 4.0*atan(1.0) is a good substitute

Compare:

Floating Point Computation

Floating Point Computation

Presentation Transcript

Floating Point

Floating-Point Arithmetic

Floating Point Representation

Floating point

Floating Point

IA32 Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Floating Point

Floating point

Floating Point Computation

Floating Point

Floating point

Floating Point

Floating Point

Floating Point

Floating Point

Floating Point