faster functional modules lessons taught from FP-ADDERs

faster functional moduleslessons taught from FP-ADDERs Guy Even Electrical Engineering Dept. Tel-Aviv Univ. Silicon Value Seminar (April 29, 2002)

outline • FP-Adder: an example of a complicated module • brief overview • focus on two sub-blocks • Counting leading zeros – priority encoders various design methods: • divide & conquer • parallel prefix computation • redundant addition • Adders: • fast adders • compound adders

Background Faster clock rates require faster modules. • Example: Floating-Point Adders • early designs: 50-60 logic levels. • 15-20 gate levels per cycle  3-4 cycles! • new designs: 25 gate levels  2 cycles. How? better algorithms and faster sub-blocks…

Floating-Point Add • Algorithm: Why 50-60 logic levels? • Sub-Modules: List of sub-blocks. floating-point number: S-sign, E-exponent, F-mantissa • floating-point addition: • input: (Sa,Ea,Fa) & (Sb,Eb,Fb) • output: (S,E,F) such that:

Normalize sum (shift left) SWAP operands Add/Sub mantissas Align mantissa of smaller operand (shift right) Convert sum to sign & mag abs(negative sum) rounding decision INC according to rounding decision Compute sticky-bit (OR of bits shifted outside) RESULT FP-Add: naïve algorithm Pre-process: Add/Sub: Round:

focus: normalization shift Problem: • LZ= number of leading zeros • shift left by LZ positions unary example: X[1:4]=0010 A[1:4]=0011 • Use a priority encoder! • two types of priority encoders: • unary • binary binary example: X[1:4]=0010 Y[2:0]= 010

Unary PENC • input: X[1:n] • Output: A[1:n] • functionality: Simpler: Implementation: what is the best design? Delay =  (log n) & Cost = (n).

X[1:n/2] X[1+n/2:n] U- PENC(n/2) OR-tree(n/2) U-PENC(n/2) n/2 1 n/2 OR(n/2) A[1:n/2] A[1+n/2:n] Unary PENC –divide & conquer share OR-tree slight reduction of cost linear fan-out logarithmic delay delay: O(log n) is optimal O(log n) even if fan-out considered cost: O(n log n) not optimal

Unary PENC - improve Parallel Prefix Computation (PPC) [FL,BK]!

delay = O(log n) cost = O(n) X[1] X[2] X[3] X[4] X[n-1] X[n] OR OR OR U-PENC (n/2) OR OR A[2] A[4] A[n] Unary PENC = PPC(OR) A[1] A[3] A[n-1]

X[1] X[2] X[3] X[4] X[n-1] X[n] OR OR OR U-PENC (n/2) OR OR A[1] A[2] A[3] A[4] A[n-1] A[n] PPC - properties Fan-out: Logarithmic fan-out can be decreased to constant (cost still O(n)). Layout: O(n log n) area. Same design as “Brent-Kung” adder. Applicable for every associative operator.

Binary PENC • input: X[1:n] (n=2^k) • Output: Y[k:0] • functionality: Relation to Unary PENC: Implementation: what is the best design? Delay =  (log n) & Cost = (n).

Binary PENC –simple & optimal X[1:n] PPC (OR) A[1:n] delay(diff(n)) = constant delay(encoder(n)) = O(log n) diff(n) encoder(n) cost(diff(n)) = O(n) cost(encoder(n)) = O(n) Y[k:0]

Binary PENC –with adder tree X[1:n] PPC (OR) A[1:n] ADD-tree(n) • problem: • adder(k) in tree  O(log k) delay per adder • total delay is O(log n log k). Y[k:0]

a3 c3 a2 c2 a1 c1 a0 c0 b3 b2 b1 b0 x3 x2 x1 x0 y3 y2 y1 y0 Redundant addition add columns in parallel using Full-Adders Partial compression or (3:2)-addition: delay is constant! Tree structure enables (n:2)-addition with O(log n) delay. (n:2)-addition used in fast multipliers

X[1:n] PPC (OR) FA-tree(n) 2:1-Adder Y[k:0] Binary PENC –O(log n) delay A[1:n] 2[k:0] Tree of Full-Adders: delay of each full-adder is constant depth is O(log n) output is carry-save number

Binary PENC –divide & conquer 1 2 n/2 n/2+1 n XL XR XL=00…0  YL[k-1]=1

X[1:n/2] X[1+n/2:n] k Bin-PENC(n/2) Bin-PENC(n/2) 1 k k YL YR k YL[k-1] delay=constant cost=O(log n) delay=constant cost=O(log n) MUX(k) +2(k-1) (Half Adder) Y[k:0] Binary PENC –divide & conquer fan-out=k incurs O(log log n) delay initial analysis: delay = O(log n) cost = O(n) bottom line: delay = O(log n log logn ) cost = O(n)

PENC – quick summary

PENC - further issues back to FP-Adder: can we estimate LZ before subtracting? must pre-process to avoid “catastrophic cancellation”! method: partial compression (signed half-adders).

k a b Compound Adder a+ba+b+1 rounding decision MUX RESULT focus – adder Avoid INC after rounding decision by pre-computing increment.

Define: PPC Adder [FL,BK] • computes carry bits C[n:1] • sum bits satisfy: S[i]=XOR(A[i],B[i],C[i]). • computation of carry bits C[n:1]. example: A[3:0]=0100 B[3:0]=1110 [3:0]=pgpk C[4:1]=1100 claim:

how to compute the event: define an operator  : {k,p,g} {k,p,g} {k,p,g} as follows: g  x = g p  x = p k  x = k claim: PPC adder (cont.) compute [i] using PPC with -gates! claim: is associative. definition: [i] = [i] …  [0].

a+b+1 = (a+0.5)+(b+0.5) Therefore, Compound Adder [T] how to compute a+b & a+b+1? • use 2 separate adders • understand PPC adder recall: [i] = [i] … [0]. Now, for a+b+1: ’[i] = [i] … [0]  g.

Conclusion • faster modules require clever designs • starting point: gate count (for delay & cost) • Must take fan-out & layout into account • lots of methods: • divide & conquer • parallel prefix computation • redundant arithmetic

faster functional modules lessons taught from FP-ADDERs

faster functional modules lessons taught from FP-ADDERs

Presentation Transcript

Jackson Adders

Adders

Lessons From

Lessons from

Fast Adders

Adders

Adders

Adders

Lessons from

Adders

Adders

Adders

Adders

Adders

Adders

Parallel Adders

10 Important Life Lessons Connectium Taught Us.

SAP Technical and Functional modules

5 important lessons coworking taught us-converted