390 likes | 543 Views
An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures. A literature study by Tom Bruintjes. 27/09/10. 01/10/10. Floating Point Unit. 1. Assignment.
E N D
An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by Tom Bruintjes 27/09/10 01/10/10 Floating Point Unit 1
Assignment • Design or modify a Floating Point Unit so that can also be used as Integer Unit, and determine its cost in terms of Area and Energy efficiency. • Requirements- Floating Point addition and multiplication & Integer addition and multiplication- Pipeline should be shallow (preferably no more than 2-stages) -Low area costs- Low power consumption 2 01/10/10
Motivation • Multicore architecture- MPSoC- Tile Processor • Hetrogeneous but no Floating Point- Too expensive (area, energy)- Fixed Point alternative- Software Emulation Tilera TILE-Gx100 (100 cores but no floating point) 3 01/10/10
Motivation (2) • What if we did add a FPU? - High performance FP ops- A lot of hardware needed- Complex datapath → High latency (low frequency) → Deep pipeline- A lot of area wasted if FP is idle 4 01/10/10
Motivation (3) • Idea: Add FP core and make it compatible with Integer operation so that Integer ops can be offloaded to the FP core when it is idle. • The shared core should be deployable in an embedded system (MPSoC), hence the low area and power consumption requirements. • Few pipeline stages to keep compiler manageable. 5 01/10/10
Floating Point - History • Need for FP recognized early • The First FPU:Konrad Zuse’s Z1 (1938) - 22-bit floating-point numbers- storage of 64 numbers - sliding metal parts memory 6 01/10/10
Floating Point – History (2) • In the beginning floating-point hardware was typically an optional feature- “scientific computers”- extremely expensive • Then FP became available in the form of (“math”) Co-processors- Intel x87 (486 vs )- Weitek • Mid 90’s: most GPP’s are equipped with FP units • Current situation: FP also in small processors 7 01/10/10
Why Floating Point • Unsigned/Signed(…,-2,-1),0,1,2,3,…[0000,0001,0010,0011]- what about rational numbers or very large/small numbers ? • Fixed Point 0.11, 1.22, 2.33,…[00.11, 01.10, 10.11] • Limited range and precision- Solution: Floating Point (scientific) notation- 1.220 x 105 (12.20 x 104 or 122.0 x 103, hence floating point) 8 01/10/10
Floating Point representation/terminology Significand (mantissa) • Floating Point representation- Sign S- Significand M (not Mantissa!)- Exponent E (biased)- Base (implicit) • Binary representation [1 | 00001111 | 10101010101010101010101] Exponent 6.02 * 1023 Base (radix) 01/10/10
Binary Floating Point storage (issues) • Normalization- Prevent redundancy: 0.122 * 105 vs 1.22 * 104- Normalization means that the first bit is never a zero- For binary numbers this means MSB is always 1→ “hidden bit” • Single, Double or Quad precision- 32 bits: single (23-bit significand & 8 bit exponent)- 64 bits: double (52-bit significand & 8 bit exponent) • Base is implicit- 2, 10 or 16 are common • Special cases? (NaN, 0, ∞) 01/10/10
The road to getting standardized • Many ways to represent a FP number- Significand (sign-magnitide, two’s complement, BCD)- Exponent (biased, two’s complement)- Special numbers • Unorganized start- Every company used their own format- IBM, DEC, Cray • Highly incompatible - 2 * 1.0 on machine A gives a different result then B- Situation even worse for exceptions (e.g., underflow and overflow) 11 01/10/10
IBM System/360 & Cray-1 • IBM highlights- Sign magnitude & biased exponent - Base-16 numeral system (more efficient/less accurate) • Cray-1 highlights- Sign magnitude & biased exponent- Very high precision (64-bit single precision) 12 01/10/10
IEEE-754 • Standardized FP since 1985 (updated in 2008)Arithmetic formats - binary and decimal Floating Point data (+special cases)Operations - arithmetic and operations applied to arithmetic formatsRounding algorithms - rounding routines for arithmetic and conversionExceptions handling - exceptional conditions • Format (binary or decimal)- Sign magnitide significand & biased exponent- base-2 or base-10- N = (-1) S * (1.M) * 2 e-127 13 01/10/10
IEEE-754 (2) • Operations- Minimum set: Add, Sub, Mul, Div, Rem, Rnd to Int, Comp- Recommended set: Log,… • Rounding modes - Round to nearest, ties to even - Round Up- Round to zero - Round down • Exceptions- Invalid operation - Overflow- Division by zero - Underflow- Overflow- Underflow 14 01/10/10
Rounding • Almost never exact FP representation [1.11110]*25 (62d) [1.11111]*25 (63d) • Rounding is required • IEEE-754 rounding modes: - Round to nearest (ties to even) - Round to zero - Round up - Round down • Rounding (to nearest) algorithm based on 3 LSBs (guard bits)0-- (down) | 100 (even) | 1-- (up) 15 01/10/10
Floating Point arithmetic • More complex than Integer • Lots of shifting results and overhead due to exceptional cases • Addition 2.01 * 10121.33 * 1011 + 1. Check for zeros. 2. Align significands so exponents match (guard bits): rightshift! 3. Add/Subtract significands. 4. Normalize and Round the result 16 01/10/10
Floating Point addition 1. Check for zeros. 2. Align significands so exponents match 3. Add/Subtract significands. 4. Normalize and Round the result 17 01/10/10
Floating Point Arithmetic (2) • Multiplication 1. Checking for zeros. 2. Multiplying significands 3. Adding exponents (correct for double bias) 4. Normalizing & Rounding the result • Division 1. Checking for zeros. 2. Divide significands 3. Subtract exponents (correct for double bias) 4. Normalizing & Rounding the result 18 01/10/10
Floating Point Architecture • Architecture is a combination of HW, SW, Format, Exceptions, … • Focus on hardware (datapath) of a Floating Point Unit- Multiplier- Adder/Subtracter(- Divider)- Shifters- Comparators- Leading Zero Detection- Incrementers • How are components connected, what techniques are used and how does that influence the efficiency of the FPU?- Latency (paralelism)- Throughput (ILP, pipeline stages)- Area & Power (clockgating) 19 01/10/10
Highlighted Architectures • UltraSparc T2 • Itanium • Cell 20 01/10/10
UltraSparc T2 • UltraSparc T2 was released in 2007 by Sun • Features- Multicore (since 2008 SMP capable) microprocessor- Eight cores, 8 threads = 64 threads concurrently- Up to 1.6GHz- Two Integer ALUs per core- One FPU per core- “Open” design • Applications- Only servers produced by Sun 21 Floating Point Unit 27/09/10 01/10/10
UltraSparc T2 Floating Point • Eight cores, each with a FPU- Single and Double precision IEEE • Conventional FPU design- Dedicated datapath for each instruction • UltraSparc characteristics- Pipeline for addition/multiplication 6 stages, 1 instruction per cycle → shared- Combinatorial division datapath- Area and power efficient clock gating reduced switching 22 01/10/10
Itanium • Intel and HP combined efforts to revolutionize computer architecture in ‘98- Complete overhaul of the legacy x86 architecture based on instruction level parallelism- RISC replace by VLIW - Large registers • First Itanium appeared in 2001, the latest model (Tukwila) is from February 2010 • Tukwila features- 2-4 Cores per CPU- Up to 1.73GHz- Four Integer ALUs per core- Two FPUs per core 23 01/10/10
Itanium • Very powerful very big- Two full IEEE double precision FP units- Leader in SPECfp- Single and double precision + custom formats • Architecture- Unfortunately (too) much details are undisclosed- So why look at Itanium at all? Because what has been disclosed is interesting: Fused Multiply-Add 24 01/10/10
Fused Multiply Add • FMA architecture fused multiply and add instructions(A*C)+B vs A*C and A+B • FMA advantages- Atomic MAC operations (~double performance)- Only one rounding error • Expensive?- Multiplication: Wallace Tree of CSAs- Partial addition product: 3:2 CSA- Full adder for conversion CS format- Leading Zero Detection/Anticipation- Shifters for alignment and PostnormalizationNo: end-around-carry principle 25 01/10/10
End-around carry multiplication • Carry-save adder vs Full adder • CSA chain • CSA tree • Add one more CSA before conversion → → 26 01/10/10
Fused Multiply-Add (2) • FP ops based on Fused Multiply-Add architectureFMA: fma.[pc].[sf].f1 = f3 f4 f2 f1 = (f2 * f4) + f2ADD: fadd.[pc].[sf].f1 = f3 (f0) f2 f0 hardwired to +1.0MUL: fmul.[pc].[sf].f1 = f3 f4 (f1) f1 hardwired to +0.0 - Not as efficient as single add and multiply instructions • Division and Square Root- Division and Square Root can be implemented in Software- Lookup table for initial estimate (1/a and 1/√a)- Newton Raphson approximation (1 approximation and 13 FMA instructions on the Itanium)- Intel FPU bug! ($475.000.000) 27 01/10/10
Cell • Combined efforts from Sony, Toshiba and IBM- Sony: Architecture & Applications - IBM: SOI process technology- Toshiba: Manufacturing- Develpment started 2000, 400 people, $400M- First Cell in 2006 • Applications- Playstation 3- Blue ray- HDTVs- High performance computing • Features- 9 cores (PPC and SPE) for Integer and FP- 3.2GHz- All SIMD instruction 28 01/10/10
Cell (2) • 1 PPC and 8 SPEs- PPC for compatibility- SPEs for performance • 1 FPU per SPE- 4 single precision cores per FPU- 1 double precision core per FPU • Why separate?- Performance requirements for SP Float too high for a double precision unit 29 01/10/10
Single Precicion FP in the Cell • Single precision- Full FMA unit- Similar approach as Itanium- DIV/SQRT/Convert/… in software • Aggressive optimization- Denormal numbers forced to zero- NaN/∞ treated as normal number- Only round to zero 30 01/10/10
Shared Integer/FP ALUs • Have FPUs been used for Integer operations in the past?- Yes, in fact the UltraSparc T2 and Cell already do so- Cell: converts Integers into some format that can be processed by the SPfpu- UltraSparc: Maps Integer multiplication, addition and division directly on the respective FP hardware, however not the full MAC capabilities… • Issues- Overhead due to FP specific hardware- Priorities- Starvation 31 01/10/10
Approach • Design FPU- Implement single precision core and drop most of the stuff that makes FP so expensive …. Much like the Cell processor- Widen the design to make it compatible with 32-bit Integer operands • Add integer capability- Add switches and control in the design to support Integer operands- …without affecting FP performance • Optimization- Optimize the design for efficiency- Area/Power • Measure Performance, Area and Power Consumption- 65 or 90nm 32 01/10/10
Approach – Floating Point Unit • Formatting- Close to IEEE format (Not GPP but don’t make it too obscure, i.e. Itanium) - Sign magnitude - Biased exponent - Base-2- Single Precision (double is excessive)- Initially ignore special cases • Architecture- Fused-Multiply-Add unit only + comparesA la Cell: Shifter, Tree Multiplier, CSA, Full adder- Initially three pipeline stage 1) Align/Multiply 2) Add/Prepare normalization 3) Post-normalize- Reduce to two stages if possible 33 01/10/10
Approach – Floating Point Unit (2) • IEEE-754 compatibility- Format (not all the special cases)- Arithmetic (next slide)- Rounding modes - Round to zero - Round to nearest - Round up - Round downExceptions and special cases - Denormalized numbers - NaN, Infinity (to be determined) - Exceptions (underflow, overflow, etc.) 34 01/10/10
} Fused Multiply-Add → Software → Software → Software Approach – Floating Point Unit (3) • FP Arithmetic- Multiplication- Addition- Division- Square Root- Conversion- Compare 35 01/10/10
Approach – Integer Unit • 32-bit signed Integer ALU- Preferably two’s complement (most common representation)- Single precision maps nicely to 2x32bit registers • Arithmetic mapping- Addition → Full adder- Multiplication → Wallace Tree- MAC- Shift → Aligner • Reconfiguring- Initially no bypassing (drain pipeline before reconfiguring) 36 01/10/10
Proposed architecture • 32-bit Input registers- FP: 32-bit significand & 32-bit exponent- Integer: 32-bit signed • 3-Stage pipeline- Stage 1: Aligner for FP or Barrelshifter 32x32 Multiplier- Stage 2: Full Adder and Leading Zero Det.- Stage 3: Normalization and Rounding • 2-stage pipeline?- Merge stage 2 and 3 37 01/10/10
Testing/Benchmarking • After functional testing, implementation in 65 or 90nm • Measure area and power usage- Benchmark to be determined 38 01/10/10
Questions Whatever the question, lead is the answer. 39 01/10/10