190 likes | 336 Views
P6 Binary Floating-Point Unit. Son Dao-Trong, Michael Kroener IBM Boeblingen Eric M. Schwarz IBM Poughkeepsie Martin Schmookler IBM Austin. Multi-GHz Floating Point Unit for Power6. Power6 Processor announced in May 2007 64-bit architecture
E N D
P6 Binary Floating-Point Unit Son Dao-Trong, Michael Kroener IBM Boeblingen Eric M. Schwarz IBM Poughkeepsie Martin Schmookler IBM Austin
Multi-GHz Floating Point Unit for Power6 Power6 Processor announced in May 2007 • 64-bit architecture • Dual-core, 790 Mio. Transistors running at 4.7 GHz • Power & area efficiency are key enablers Multi-Core design Binary FPU on Power6 • Two FPU’s per core • 7 cycles pipeline • Dependent data: • Reused after 6 cycles on one FPU • Reused after 7 cycles on both FPU’s • Key features: • Aggressive cycle time • Commonality with hexadecimal • Reuse of unfinished results
Agenda Overview: • Challenges Dataflow Key design features: • Shifter • Normalizer • Forward unfinished results • Divide/Square root Miscellaneous: • Timing efforts • Stall forwarding • Clock gating Summary
Challenges Performance – Cycle time • 7-cycle FMA @ 13 fo4 91 fo4 total No execution stalling due to operand data (denorm, massive cancellation case…) Result can be reused after 6 cycles (same as in Power5) Commonality with hexadecimal dataflow (56 bit operands) -> wider dataflow, aligner, multiplier • Support all features of Power5 (except out-of-order) • Dual FPU’s, multi-threading, … • Comparing to Power5: 6-cycle pipeline (@ 23 fo4) No stall for massive cancellation Cycle time increased by 70% Very aggressive goal Additional features: • Support decimal floating point interface • Share same FPR • Implement fixed-point multiplies and divides • 64x64 multiplier • Improve hardware checking • Dataflow residue checked • FPR/data busses parity checked • Pipelined clock-gating for low power
Dataflow Conservative design • Standard FMA dataflow • No use of aggressive circuit techniques • 7-stage pipeline • 64x64 multiplier array (support integer ops): • 33 partial product terms • Rounding correction term • Commonality with hexadecimal: • 176 bit aligner • 60 bit incrementer • 120 bit adder/LZA • No stall execution • Normalization of 176 bit • Overflow/undeflow handling • Denormal handling
Dataflow … Feedback paths: • Unfinished result after 6 cycles: • Only internally to one FPU • Need another step of shift left one correction (unprecise LZA) • Feedback to all three operands A/B/C • Result to A/C not rounded • bypass rounded result to Breg2 • Rounded result after 7 cycles: • Reused by the other FPU NaN parallel path: • Bypass unchanged operand A/C • Used for special results: • A + B where B is zero • AxC + B where A or C is NaN’s
Shifter • Addend alignment started farther left to simulate a left shift. • Add leading zero count for addend (lzdb) • Use shift amount calculation to detect special corner cases Traditional alignment addend product Shift amount (SA) =<0 Alignment w/ shift left ability addend product Shift amount (SA) =<0 addend product Shift amount (SA) =68
Shifter… • Redefine the case where the addend is much greater than the product: addend product lzdb Addend much greater than product: SA + lzdb < 68 product addend aligned addend addend bypassed addend Result is addend with product only as sticky – addend is bypassed to incrementer
Shifter… • Left shift of denormalized addend: • When underflow trap is enabled, the denormal result must be normalized • If the addend is greater than the product, it must be normalized with bits of the product eventually shifted in. • A shift left alignment of the addend is needed. addend product lzdb Addend greater than product and needs shift left alignment: SA + lzdb > 68 product addend aligned addend addend is now correctly aligned with product, shifted left to the product. The result can be normalized and bits of the product will be shifted in for the full precision.
Normalizer • 176 bit normalization without recycling • Recomplement done first while buffering multiplexer control signals • First coarse normalize shift of 60 calculated during shift amount calculation: • Effective add: shift60= 1 if SA + lzdb – 68 > 59 • Effective sub: shift60= 1 if SA + lzdb – 68 > 60 • Imprecise LZA used • Timing and area reasons • A precise LZA for the adder would also implies an extra precise LZA for the incrementer • Imprecise LZA for the incrementer is simple: • Effective add: LZA= SA + lzdb - 68 -1 • Effective sub: LZA= SA + lzdb - 68
Normalizer… • Generating denormal results: • Denormal addend is greater than product, use LZA of incrementer with lzdb=0: (this includes the case where addend is zero, same exponent as denormal) • Effective add: LZA= SA - 68 -1 • Effective sub: LZA= SA – 68 (need to check if product much smaller than Nmin) • Product is greater than addend: • OR a LZA-string which decodes the intermediate exponent to limit normalization product addend aligned addend 1 Adder LZA string 1 decoded LZA string Intermediate result exponent
Feedback unfinished results • Goal: Feedback result after 6 cycles • Aggressive cycle time target doesn’t allow that • Result calculation cannot be completely finished • Move computation steps to next cycle (second cycle after operand read) • Shift left one correction after imprecise normalization • No exponent correction due to last normalize correction • Rounding • No exponent correction due to rounder carry out
Feedback unfinished results… • Impact on dataflow – Feedback to addend: • Addend path needs one more bit to the right • Addend path needs one more bit to the left • For shift amount calculation addend exponent before shift left and rounding can be used – this helps timing • Correctly rounded fraction is fed back to Breg2 in next cycle before alignment. Covered by hex requests product addend aligned addend addend aligned addend not shifted left by one bit 10000000000000 aligned addend is all ones rounded up
Feedback unfinished results… • Impact on dataflow – Feedback to multiplicand (multiplier): • Multiplicand (multiplier) needs one more bit to the right • Exponent doesn’t need to be corrected for shift amount calculation • No possibility to use correctly rounded up result for the multiplier array • Additional rounding correction term needed multiplicand multiplier +1 product + multiplicand Rounding correction term
Feedback unfinished results… • Impact on dataflow – Feedback to both multiplicand and multiplier: • This corresponds to squaring • Special handling is only needed for round up correction • (A + 1) x (A + 1) = A2 + 2A + 1 Rounding correction term = 2A + 1 >> only one term! multiplier +1 multiplier +1 product + multiplier 2A = A shift left one bit +1 ‚1‘ forced in vacant bit Rounding correction term
Feedback unfinished results… • Impact on dataflow – Feedback to both multiplicand and multiplier: • Special case: Feedback result is all ones and needs a round up correction (1111 +1) x (1111 +1) = 1 0000 0000 The product has one bit to the left of the multiplier array -> This is a single case and can be detected. A one is forced into bit 59 left to the product. -> The multiplier array delivers correctly an all zeros product.
Divide and Square root • Use similar iteration algorithm as Goldschmidt • Divide table with 14-bit precision (needed for estimate instructions) • Take advantage of multiply-add operations • No need to complement on feedback path to the multiplier • Make use of wider dataflow for more precision, thus eliminating last compensation step (Newton-Raphson) • Implement fixed-point divide • 32 and 64 bit integer operands • Same processing as for floating point divide • Convert integer to floating-point before start • Use empty pipeline slots for pipelineable instructions • Communicate available slots to instruction scheduler • Execute independent instructions or instructions from the other thread
Miscellaneous • Timing efforts: • Lots of tuning and balancing of logic and wires • Many iterations of placements needed • Stalling issue of dependent instructions • For rare corner cases (invalid exception, Nmax, Infinity …) • No forwarding possible • Checking • FPR and data busses are parity checked • Dataflow is residue checked for multiply-adds • Clock gating • FPU is sleep mode when not used • Clocking also in pipeline mode (fine grain clock gating)
Summary • Binary Floating-Point Unit for Power6 • 7 cycle FMA at 13fo4 cycle time • Push traditional design to high-frequency and low-power • Lots of fine tuning work to achieve timing • Spread logic across functional blocks to get optimal timing • Make use of wider dataflow to feedback unfinished results • 310 mW at 1.1V and 4 GHz • 2.5 sqmm in 65 nm SOI technology