Floating Point in computers

Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559

Timeline • Introduction quite short • Binary review not so long • Integer Arithmetic 1/3 • Floating Point 1/3 • Floating Point Arithmetic 1/3 • Other issues extra short

Introduction • Who does computer arithmetic? • Intel’s spare money • How is it done in hardware? • How Integer relates to Floating point • Now, we go back to “computer structure”

Binary numbers • What is 1 0 0 1 0 1 1 . 0 0 1 0 1 ? 64 8 2 1

Signed Binary Integers • Sign-magnitude • 2’s complement • 1’s complement • biased

Sign-Magnitude • High order bit = Sign • 0101 = 5 • 1101 = -5 • 2 zero’s

2’s complement • Number + Negative = 2n • 0101 = 5 • 1011 = -5 • Easy addition (drop carry) • Formula: -an-12n-1 + an-22n-2 + … +a121 + a0

1’s Complement • Negative - complement to 1 • 0101 = 5 • 1010 = -5 • 2 zero’s • Number + Negative = 2n-1

Biased • Binary = Number + Bias • Bias = 5: 1101 = 5 5+5=10 0000 = -5 (-5)+5 = 0 • Relative order remains

Integer Arithmetic

Adding (usigned) Integers • Elementry school : 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1 1 + 1 0 1 0 1 0 0 1 1 • Result has n+1 bits!

a b a b Cout Cout Cin s s Adding Integers - hardware Full Adder Half Adder 2 logical levels

a0 b0 a1 b1 an-1 bn-1 an-2 bn-2 Cin Cout s0 s1 sn-1 sn-2 Ripple carry Adder • Slow - 2n logical levels • Small constant (CMOS) • Other ways exist

Adding Signed Integers • In 2’s complement: b + (-a) = b + (2n-a) = 2n + (b-a) (-b) + (-a) = (2n-b)+(2n-a) = (2n - (b+a)) + 2n • hence - add as integers, discard carry out • Example: 0011 + 1100 = ?

Substracting Integers • Add the negation • Negating 2’s complement: 11010100101011000110000 = ? 001010110101001110 1 0000

Integer (unsigned) Multiplication • Elementry school : 1 1 0 1 * 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 • Result is 2n bits !

Shift Carry P A n n B n Hardware Multiplier • P=0 • loop: (i) if A0=1, add B to P (ii) right-shift P & A

Integer (unsigned) Division • Elementry school : 0 1 0 0 11 1101 00 Result: 0100, Rem 1 Dec: 13/3=4, Rem 1 011 11 000 00 001 00 01

Hardware Divider Shift P A n+1 n B 0 n+1 • P=0 • loop: (i) left-shift P & A (ii) Sub. B from P: positive: a0=1 negative: a0=0, restore P (add B)

Example • 13 / 3 = 4 (1) • n=4 • A=1101 B=00011 P=00000

P A B 0 0 0 0 0 1 1 0 1 0 0 0 1 1

P A B 0 0 0 0 1 0 1 0 0 0 0 0 1 1 Remainder Quotient

Division - remarks • Non-restoring Algorithm • Load P only if positive • Check for 0 • (Total) Result is 2n bits!

Integer arithmetic - remarks • Signed Multiply and Division • Algorithms exist • We will not use them • What to do with extra bits? • Faster methods

Floating Point

Non Integers - Other Methods • Fixed Point • example: # # # . # • Binary point shifted • Integer arithmetic (extra shifting) • Small number magnitude • Rational • a/b (a,bZ)

Floating Point • Exponent + Significand (= Mantisa) • x = s • 2e • Example: s=101 e=011 x = 101 • 211 = 5 • 23 = 40 = 101000

Uniqueness • Denormal Numbers: 123.456  107 0.123  104 • Normalized: #.###  10# 1.123  104 • What about 0 ?

Floating Point Standard • Why Standartize? • Hardware accelerators • Software compatibility • Build Software Libraries • etc….. • IEEE 754-1985 ISO/IEC 559 • Includes: Structure, Arithmetic results

Float Types • 4 Precision Types: • Single • Single extended • Double • Double extended

Single Precision • 32 bits: • Exponent (e): Biased ( + 127) • Significand (f): Fixed fraction: 0 . # # # … • Nuber: 1.f • 2e-127 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Sign(1) Exponent(8) Significand(23)

Single Precision - Example • 1 10000001 01000000000000000000000 • 10000001 = 129  129-127=2  1.01= 1.25 • 01000… = 0.01000… • X = - 1.25 • 22 • X = - 5

Single Precision - Range • Emax = 127 (e = 254) • Emin = -126 (e = 1) • Why |Emin|<|Emax|? • 1/2Emin does not overflow • Why Biased notation? • What about 0 and 255 ?

Floating Point Precision

Exmaples • We shall use base 10 sometimes: • f will have 3 digits • Emax will be 98 • Emin will be -97 • Ex: 5.341070

NaN • Not a Number • Result of ilegal computation: • Any computation involving a NaN • e = Emax + 1 & f  0 • # 11111111 ####################### • Many NaN’s (different f’s)

NaN’s in use • Zero finder outside domain • f(x) = sqrt(x) - 1 • Works since all computations NaN • No exception caused !

Zero’s • 0 00000000 00000000000000000000000 ? • this is NOT 1.02Emin • 1 00000000 00000000000000000000000 ? • 0 is signed! 0 both exits! • What is the difference?

Signed 0’os • +0 = -0 BUT: • Multiply/Divide keep sign rules: • Monivation: • Using inf correctly (describe later) • log(x) : log(0)=-inf log(negative)=Nan log(x) if x(-0) ?

± inf • More logic: • e = Emax + 1 & f =0 • # 11111111 00000000000000000000000

Inf usage Example (If tan-1 is defined properly)

More on 0’os and inf’s • General Rule for 0/inf arithmetic: • Take appropriate limit: • 1/(1/x) where x=0 or inf • Why not Max # instead?

Zero’s and inf’s - yet again • X/(x2+1) is bad! Why? • 1/(x+x-1) is better • Do we need to check for x=0? • Using 2 zero’s and inf’s saves some special cases checks.

Denormalized numbers • Example: • x=1.23•10-98y=1.11•10-98 • x-y = 1.20•10 -99 = 0 • so: x-y=0 but: x  y • think of: if(x  y) then z=1/(x-y) • Soluition: • use denormalized numbers!

Denormal Numbers • Smallest normal: 1.0 • 2Emin • Below, use denormal: 0.f • 2Emin • e = Emin - 1 & f  0 • # 00000000 ####################### • Gradual underflow: 1.23 • 10-4 ( /10 ) 0.12 • 10-4 ( /10 ) 0.01 • 10-4 ( /10 ) 0

Denormal Numbers • Back to our Example: • x=1.23•10-98y=1.11•10-98 • x-y = 0.12•10 -98 • and this is not 0 !

Flush to 0 Vs Gradual Underflow 2-2 2-1 0 2-4 2-3 2-2 2-1 0 2-4 2-3

Special Values - Summary ExponentFractionRepresents Emin-1 f=0 0 Emin-1 f0 0.f2Emin Emin  e  Emax ---- 1.f2e Emax+1 f=0 0 Emax+1 f0 0.f2Emin

Rounding • Why is rounding needed? • Infinit numbers  Finit representation • Integers only overflow • Almost all operations need rounding • IEEE - specifies algorithms for arithmetic

Numbers need rounding • Out of range: • x>22Emax x<12Emin • Between 2 floats: • 0.110 = 0.00011001100….2 = 1.1001100…. 2-4 • 1.10012-4

Floating Point in computers

Floating Point in computers

Presentation Transcript

Floating Point

Floating-Point Arithmetic

Floating Point Representation

Decimal Floating Point

Floating point

Floating Point

IA32 Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Floating point

Floating Point

Floating point

Floating Point

Floating Point

Floating Point

Floating-Point Representation

Floating Point

Floating Point