1 / 39

27/09/10

An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures. A literature study by Tom Bruintjes. 27/09/10. 01/10/10. Floating Point Unit. 1. Assignment.

nhi
Download Presentation

27/09/10

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by Tom Bruintjes 27/09/10 01/10/10 Floating Point Unit 1

  2. Assignment • Design or modify a Floating Point Unit so that can also be used as Integer Unit, and determine its cost in terms of Area and Energy efficiency. • Requirements- Floating Point addition and multiplication & Integer addition and multiplication- Pipeline should be shallow (preferably no more than 2-stages) -Low area costs- Low power consumption 2 01/10/10

  3. Motivation • Multicore architecture- MPSoC- Tile Processor • Hetrogeneous but no Floating Point- Too expensive (area, energy)- Fixed Point alternative- Software Emulation Tilera TILE-Gx100 (100 cores but no floating point) 3 01/10/10

  4. Motivation (2) • What if we did add a FPU? - High performance FP ops- A lot of hardware needed- Complex datapath → High latency (low frequency) → Deep pipeline- A lot of area wasted if FP is idle 4 01/10/10

  5. Motivation (3) • Idea: Add FP core and make it compatible with Integer operation so that Integer ops can be offloaded to the FP core when it is idle. • The shared core should be deployable in an embedded system (MPSoC), hence the low area and power consumption requirements. • Few pipeline stages to keep compiler manageable. 5 01/10/10

  6. Floating Point - History • Need for FP recognized early • The First FPU:Konrad Zuse’s Z1 (1938) - 22-bit floating-point numbers- storage of 64 numbers - sliding metal parts memory 6 01/10/10

  7. Floating Point – History (2) • In the beginning floating-point hardware was typically an optional feature- “scientific computers”- extremely expensive • Then FP became available in the form of (“math”) Co-processors- Intel x87 (486 vs )- Weitek • Mid 90’s: most GPP’s are equipped with FP units • Current situation: FP also in small processors 7 01/10/10

  8. Why Floating Point • Unsigned/Signed(…,-2,-1),0,1,2,3,…[0000,0001,0010,0011]- what about rational numbers or very large/small numbers ? • Fixed Point 0.11, 1.22, 2.33,…[00.11, 01.10, 10.11] • Limited range and precision- Solution: Floating Point (scientific) notation- 1.220 x 105 (12.20 x 104 or 122.0 x 103, hence floating point) 8 01/10/10

  9. Floating Point representation/terminology Significand (mantissa) • Floating Point representation- Sign S- Significand M (not Mantissa!)- Exponent E (biased)- Base (implicit) • Binary representation [1 | 00001111 | 10101010101010101010101] Exponent 6.02 * 1023 Base (radix) 01/10/10

  10. Binary Floating Point storage (issues) • Normalization- Prevent redundancy: 0.122 * 105 vs 1.22 * 104- Normalization means that the first bit is never a zero- For binary numbers this means MSB is always 1→ “hidden bit” • Single, Double or Quad precision- 32 bits: single (23-bit significand & 8 bit exponent)- 64 bits: double (52-bit significand & 8 bit exponent) • Base is implicit- 2, 10 or 16 are common • Special cases? (NaN, 0, ∞) 01/10/10

  11. The road to getting standardized • Many ways to represent a FP number- Significand (sign-magnitide, two’s complement, BCD)- Exponent (biased, two’s complement)- Special numbers • Unorganized start- Every company used their own format- IBM, DEC, Cray • Highly incompatible - 2 * 1.0 on machine A gives a different result then B- Situation even worse for exceptions (e.g., underflow and overflow) 11 01/10/10

  12. IBM System/360 & Cray-1 • IBM highlights- Sign magnitude & biased exponent - Base-16 numeral system (more efficient/less accurate) • Cray-1 highlights- Sign magnitude & biased exponent- Very high precision (64-bit single precision) 12 01/10/10

  13. IEEE-754 • Standardized FP since 1985 (updated in 2008)Arithmetic formats - binary and decimal Floating Point data (+special cases)Operations - arithmetic and operations applied to arithmetic formatsRounding algorithms - rounding routines for arithmetic and conversionExceptions handling - exceptional conditions • Format (binary or decimal)- Sign magnitide significand & biased exponent- base-2 or base-10- N = (-1) S * (1.M) * 2 e-127 13 01/10/10

  14. IEEE-754 (2) • Operations- Minimum set: Add, Sub, Mul, Div, Rem, Rnd to Int, Comp- Recommended set: Log,… • Rounding modes - Round to nearest, ties to even - Round Up- Round to zero - Round down • Exceptions- Invalid operation - Overflow- Division by zero - Underflow- Overflow- Underflow 14 01/10/10

  15. Rounding • Almost never exact FP representation [1.11110]*25 (62d) [1.11111]*25 (63d) • Rounding is required • IEEE-754 rounding modes: - Round to nearest (ties to even) - Round to zero - Round up - Round down • Rounding (to nearest) algorithm based on 3 LSBs (guard bits)0-- (down) | 100 (even) | 1-- (up) 15 01/10/10

  16. Floating Point arithmetic • More complex than Integer • Lots of shifting results and overhead due to exceptional cases • Addition 2.01 * 10121.33 * 1011 + 1. Check for zeros. 2. Align significands so exponents match (guard bits): rightshift! 3. Add/Subtract significands. 4. Normalize and Round the result 16 01/10/10

  17. Floating Point addition 1. Check for zeros. 2. Align significands so exponents match 3. Add/Subtract significands. 4. Normalize and Round the result 17 01/10/10

  18. Floating Point Arithmetic (2) • Multiplication 1. Checking for zeros. 2. Multiplying significands 3. Adding exponents (correct for double bias) 4. Normalizing & Rounding the result • Division 1. Checking for zeros. 2. Divide significands 3. Subtract exponents (correct for double bias) 4. Normalizing & Rounding the result 18 01/10/10

  19. Floating Point Architecture • Architecture is a combination of HW, SW, Format, Exceptions, … • Focus on hardware (datapath) of a Floating Point Unit- Multiplier- Adder/Subtracter(- Divider)- Shifters- Comparators- Leading Zero Detection- Incrementers • How are components connected, what techniques are used and how does that influence the efficiency of the FPU?- Latency (paralelism)- Throughput (ILP, pipeline stages)- Area & Power (clockgating) 19 01/10/10

  20. Highlighted Architectures • UltraSparc T2 • Itanium • Cell 20 01/10/10

  21. UltraSparc T2 • UltraSparc T2 was released in 2007 by Sun • Features- Multicore (since 2008 SMP capable) microprocessor- Eight cores, 8 threads = 64 threads concurrently- Up to 1.6GHz- Two Integer ALUs per core- One FPU per core- “Open” design • Applications- Only servers produced by Sun 21 Floating Point Unit 27/09/10 01/10/10

  22. UltraSparc T2 Floating Point • Eight cores, each with a FPU- Single and Double precision IEEE • Conventional FPU design- Dedicated datapath for each instruction • UltraSparc characteristics- Pipeline for addition/multiplication 6 stages, 1 instruction per cycle → shared- Combinatorial division datapath- Area and power efficient clock gating reduced switching 22 01/10/10

  23. Itanium • Intel and HP combined efforts to revolutionize computer architecture in ‘98- Complete overhaul of the legacy x86 architecture based on instruction level parallelism- RISC replace by VLIW - Large registers • First Itanium appeared in 2001, the latest model (Tukwila) is from February 2010 • Tukwila features- 2-4 Cores per CPU- Up to 1.73GHz- Four Integer ALUs per core- Two FPUs per core 23 01/10/10

  24. Itanium • Very powerful very big- Two full IEEE double precision FP units- Leader in SPECfp- Single and double precision + custom formats • Architecture- Unfortunately (too) much details are undisclosed- So why look at Itanium at all? Because what has been disclosed is interesting: Fused Multiply-Add 24 01/10/10

  25. Fused Multiply Add • FMA architecture fused multiply and add instructions(A*C)+B vs A*C and A+B • FMA advantages- Atomic MAC operations (~double performance)- Only one rounding error • Expensive?- Multiplication: Wallace Tree of CSAs- Partial addition product: 3:2 CSA- Full adder for conversion CS format- Leading Zero Detection/Anticipation- Shifters for alignment and PostnormalizationNo: end-around-carry principle 25 01/10/10

  26. End-around carry multiplication • Carry-save adder vs Full adder • CSA chain • CSA tree • Add one more CSA before conversion → → 26 01/10/10

  27. Fused Multiply-Add (2) • FP ops based on Fused Multiply-Add architectureFMA: fma.[pc].[sf].f1 = f3 f4 f2 f1 = (f2 * f4) + f2ADD: fadd.[pc].[sf].f1 = f3 (f0) f2 f0 hardwired to +1.0MUL: fmul.[pc].[sf].f1 = f3 f4 (f1) f1 hardwired to +0.0 - Not as efficient as single add and multiply instructions • Division and Square Root- Division and Square Root can be implemented in Software- Lookup table for initial estimate (1/a and 1/√a)- Newton Raphson approximation (1 approximation and 13 FMA instructions on the Itanium)- Intel FPU bug! ($475.000.000) 27 01/10/10

  28. Cell • Combined efforts from Sony, Toshiba and IBM- Sony: Architecture & Applications - IBM: SOI process technology- Toshiba: Manufacturing- Develpment started 2000, 400 people, $400M- First Cell in 2006 • Applications- Playstation 3- Blue ray- HDTVs- High performance computing • Features- 9 cores (PPC and SPE) for Integer and FP- 3.2GHz- All SIMD instruction 28 01/10/10

  29. Cell (2) • 1 PPC and 8 SPEs- PPC for compatibility- SPEs for performance • 1 FPU per SPE- 4 single precision cores per FPU- 1 double precision core per FPU • Why separate?- Performance requirements for SP Float too high for a double precision unit 29 01/10/10

  30. Single Precicion FP in the Cell • Single precision- Full FMA unit- Similar approach as Itanium- DIV/SQRT/Convert/… in software • Aggressive optimization- Denormal numbers forced to zero- NaN/∞ treated as normal number- Only round to zero 30 01/10/10

  31. Shared Integer/FP ALUs • Have FPUs been used for Integer operations in the past?- Yes, in fact the UltraSparc T2 and Cell already do so- Cell: converts Integers into some format that can be processed by the SPfpu- UltraSparc: Maps Integer multiplication, addition and division directly on the respective FP hardware, however not the full MAC capabilities… • Issues- Overhead due to FP specific hardware- Priorities- Starvation 31 01/10/10

  32. Approach • Design FPU- Implement single precision core and drop most of the stuff that makes FP so expensive …. Much like the Cell processor- Widen the design to make it compatible with 32-bit Integer operands • Add integer capability- Add switches and control in the design to support Integer operands- …without affecting FP performance • Optimization- Optimize the design for efficiency- Area/Power • Measure Performance, Area and Power Consumption- 65 or 90nm 32 01/10/10

  33. Approach – Floating Point Unit • Formatting- Close to IEEE format (Not GPP but don’t make it too obscure, i.e. Itanium) - Sign magnitude - Biased exponent - Base-2- Single Precision (double is excessive)- Initially ignore special cases • Architecture- Fused-Multiply-Add unit only + comparesA la Cell: Shifter, Tree Multiplier, CSA, Full adder- Initially three pipeline stage 1) Align/Multiply 2) Add/Prepare normalization 3) Post-normalize- Reduce to two stages if possible 33 01/10/10

  34. Approach – Floating Point Unit (2) • IEEE-754 compatibility- Format (not all the special cases)- Arithmetic (next slide)- Rounding modes - Round to zero - Round to nearest - Round up - Round downExceptions and special cases - Denormalized numbers - NaN, Infinity (to be determined) - Exceptions (underflow, overflow, etc.) 34 01/10/10

  35. } Fused Multiply-Add → Software → Software → Software Approach – Floating Point Unit (3) • FP Arithmetic- Multiplication- Addition- Division- Square Root- Conversion- Compare 35 01/10/10

  36. Approach – Integer Unit • 32-bit signed Integer ALU- Preferably two’s complement (most common representation)- Single precision maps nicely to 2x32bit registers • Arithmetic mapping- Addition → Full adder- Multiplication → Wallace Tree- MAC- Shift → Aligner • Reconfiguring- Initially no bypassing (drain pipeline before reconfiguring) 36 01/10/10

  37. Proposed architecture • 32-bit Input registers- FP: 32-bit significand & 32-bit exponent- Integer: 32-bit signed • 3-Stage pipeline- Stage 1: Aligner for FP or Barrelshifter 32x32 Multiplier- Stage 2: Full Adder and Leading Zero Det.- Stage 3: Normalization and Rounding • 2-stage pipeline?- Merge stage 2 and 3 37 01/10/10

  38. Testing/Benchmarking • After functional testing, implementation in 65 or 90nm • Measure area and power usage- Benchmark to be determined 38 01/10/10

  39. Questions Whatever the question, lead is the answer. 39 01/10/10

More Related