480 likes | 493 Views
Recent Developments in Theory and Implementation of Parallel Prefix Adders. Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University. Motivation. Parallel Prefix Adders (e.g. Kogge-Stone) mostly ignored for deep submicron VLSI large fan-out points
E N D
Recent Developments inTheory and Implementationof Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University
Motivation • Parallel Prefix Adders (e.g. Kogge-Stone) mostly ignored for deep submicron VLSI • large fan-out points • wide wiring channels • Recent insights: can remove both and do... • absolute difference • late increment • media processing
Structure of Presentation • Parallel Prefix Adder theory • Kogge-Stone, Ladner-Fisher • New log-depth prefix trees • Knowles’ “family of adders” • New applications of prefix adders • late operations, media adder
A (0: w -1) B (0: w -1) Bit propagate and generate cells p (0: w -1) g (0: w -1) Prefix carry tree c (1: w ) Sum cells (XOR gates) s (0: w ) Prefix adder structure
Prefix Equations - 1 • g(i) = a(i) b(i) “carry generate” • p(i) = a(i) b(i) “carry propagate” • k(i) = {a(i) b(i)} “carry kill” • g(i), p(i), & k(i) are mutually exclusive • Use any two: g(i) & k(i) = NAND & NOR • p(i) needed as well: s(i) = p(i) c(i)
Prefix Equations - 2 • Generate and Not Kill signals are com-bined to form “Group Signals” GxzKxz interpretation 0 0 c(x+1) = 0 0 1 c(x+1) = c(z) 1 0 Don’t care 1 1 c(x+1) = 1
Prefix Equations - Interpretation • Group signals yield carry signals: • Tree outputs: c(i+1) = Gi0 • Tree inputs: Gii =g(i) ; Kii = k(i)
Prefix Equations - characteristics • Associative • sub-terms may be pre-computed in parallel
Prefix equations - characteristics • Idempotent • sub-terms may be “overlapped” g (2), k(2) g (1), k(1) g (0), k(0) g (2), k(2) g (1), k(1) g (0), k(0) 0 1 1 GK GK GK 1 2 2 0 0 GK GK 2 2 c (3) c (2) c (1) c (3) c (2) c (1)
4-bit Ladner-Fisher prefix tree • 1 sub-term pre-computed • Logarithmic depth • Fan-out = 2 in 2nd row (laterally)
8-bit Ladner-Fisher prefix tree • Log depth; lateral fan-out = 4 in 3rd row • No exploitation of idempotency
16-bit Ladner-Fisher prefix tree • Log depth with large fan-out in final row
4-bit Kogge-Stone prefix graph • Fan-out = 1 (laterally) • 1 extra cell • parallel wires in 2nd row
8-bit Kogge-Stone prefix graph • More cells & wiring than Ladner-Fisher
16-bit Kogge-Stone prefix graph • Low fan-out but wider wiring channels • No exploitation of idempotency
Black cells and grey cells • Carries, c(i) = Gi-10; Ki-10 terms not needed • G-only cells called and coloured “grey”
The story so far… • Parallel prefix adders available in VLSI • Log-depth adders possible: • high fan-outs {1,2,4,8…} & low cell count • low fan-outs {1,1,1,1…} & high cell count • Problematic in VLSI (buffering, area) • Idempotency of ‘’ operator not exploited
Log-depth prefix trees • In VLSI: • L-F trees require too much buffering delay • K-S trees require too much area (wire flux) • Fan-outs characterised as: • {1,2,4,8…} Ladner-Fisher • {1,1,1,1…} Kogge-Stone
Knowles’ insight • Use other fan-out schemes • 5 possible 8-bit log-depth prefix trees: • {1,1,1} 17 cells Kogge-Stone • {1,1,2} 17 cells uses idempotency • {1,1,4} 14 cells no idempotency • {1,2,2} 14 cells no idempotency • {1,2,4} 12 cells Ladner-Fisher
Knowles’ 8-bit prefix trees • All trees are log-depth
Tree construction rules • Levels are labelled 0,1,2... • Fan-out at jth level, 2k , satisfies 2k 2j • Fan-out at jth level fan-out at j+1th level • Lateral wire length at jth level is 2j
Knowles’ 16-bit trees - I • {1,1,1,1} 49 cells {1,1,1,8} 42 cells • {1,1,1,2} 49 cells {1,2,2,2} 42 cells • {1,1,1,4} 49 cells {1,1,4,4} 40 cells • {1,1,2,2} 49 cells {1,1,4,8} 36 cells • {1,1,2,4} 49 cells {1,2,2,8} 36 cells • {1,1,2,8} 42 cells {1,2,4,4} 36 cells • {1,2,2,4} 42 cells {1,2,4,8} 32 cells
Knowles’ 16-bit trees - II • {1,1,1,1} {1,1,1,8} • {1,1,1,2} Idempotent {1,2,2,2} • {1,1,1,4} Idempotent {1,1,4,4} • {1,1,2,2} Idempotent {1,1,4,8} • {1,1,2,4} Idempotent {1,2,2,8} • {1,1,2,8} Idempotent {1,2,4,4} • {1,2,2,4} Idempotent {1,2,4,8}
Knowles’ 16-bit trees - III • {1,1,1,1} {1,1,1,8} R • {1,1,1,2} I {1,2,2,2} R • {1,1,1,4} I {1,1,4,4} R • {1,1,2,2} I {1,1,4,8} R • {1,1,2,4} I {1,2,2,8} R • {1,1,2,8} R, I {1,2,4,4} R • {1,2,2,4} R, I {1,2,4,8} R
Quick way of spotting R, I • Define span(l) as distance from start of wire to first cell in lth level • span(l) = 2lfanout(l) 1 • tree characteristics • R if span(j) span(k) for j < k • I if span(i) + span(j) = span(k) for i < j < k
Examples of R & I spotting fanout(l) span(l) characteristic • [1,1,1,1] [1,2,4,8] neither R nor I • [1,1,2,2] [1,2,3,7] I only • [1,2,2,2] [1,1,3,7] R only • [1,2,2,4] [1,1,3,5] R & I • Are R & I adders “best”?
VLSI design of prefix adders • Adders laid out as rectangular array of prefix cells (and gaps) • Assume cells measure 10m 4m • 2 cells per significance 20m / bit • Key design parameters: • buffering (area & delay) • wiring channels (area)
16-bit adder example • Assumptions • Maximum fan-out without buffering: • 3 cells + 80m wire (4 cell widths) • Maximum fan-out with buffering: • 9 cells + 240m wire (12 cell widths) • Employ {1,2,2,4} architecture
40 38 36 34 32 30 28 26 24 12 12.5 13 13.5 14 Area vs Time for 32-bit adders Area K-S {1,1,1,1,1} {1,1,2,2,2} {1,2,2,4,4} [1,1,3,5,13] L-F {1,2,4,8,16} Delay
32-bit prefix tree adders • Exploitable trade-off between adder’s delay and area • Kogge-Stone adder 16% faster than Ladner-Fisher but 66% larger • {1,2,2,4,4} adder 8% faster than Ladner-Fisher but only 3% larger • buffering also trades off speed for area
Other addition operations • Late increment • Mod 2w-1 addition for Reed-Solomon coding • floating-point rounding • Late complement • absolute difference for video motion estimation • sign-magnitude addition • Typically use 2 adders and a MUX
Increments in prefix trees • Row of prefix cells = ‘late +1’ operation • Ladner-Fisher comprises many late +1’s • 1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit
Late increment tree • Adder returns A+B if inc = 0 • Adder returns A+B+1 if inc = 1 inc
Late increment logic • “Late Carry” lc(i) set high if: • c(i) = 1 or • inc = 1 and a(n),b(n) 0,0 n: 0 n < i 0 c(i) = G i -1 lc(i) inc 0 Ø K s(i) i -1 p(i)
Late complement theory • In 2’s-complement, N = -(N+1) • A + B = A-B - 1 * late increment then yields A - B • (A + B) = -(A-B - 1+1) = B-A • Absolute difference readily available
Øc(w) 0 Ø K i -1 s(i) p(i) c(i) Absolute difference logic • If c(w) = 0, result negative • if c(w) = 0, invert all the bits • else always perform late increment with Ki-10
Summary of “late” ops • Available on all prefix adders • Extra delay: 1 gate’s delay + buffering • Extra hardware: w black cells • This technique used in floating-point units • late increment for rounding • late complement for true subtraction
Media (“packed”) arithmetic • Fundamental strategy: Use full wordlength hardware for multiple sub-wordlength computations • Examples: • 32-bit adder 4 8-bit adders • 32-bit multiplier 2 16-bit multipliers
Partitioning an adder • Criteria: • support carries propagating within sub-adders • prevent carries propagating between sub-adders • Solutions: • put AND gates on carry chains slower adder • put dummy 0’s on operand bits larger adder • Use prefix adder!!
Packed prefix adder - 1 • Force k(n) = 0 at partition points • prevents carries propagating across bit n • exploits don’t care condition (g,k) = (1,0) • Implementation • change k(n) gate to (2,1) OR-AND gate • delay-neutral modification
Packed prefix adder - 2 • Force c(n) = Gn-10 = 0 at partition points • prevents c(n) s(n) errors • Implementation • insert AND gates (off critical path) or • change Gn-10 gate to ({2,1},1) complex gate • BUT need Gn-10 signal for sub-adder overflows
Force k(n) = 0 Force c(n) = 0 Packed prefix adder - 3 • Sub-adder carries complete early • Extraneous cells automatically do nothing
Last Slide • Recent developments in prefix adders: • new “family” of log-depth trees • late operations • packed arithmetic for media processing • Future possibilities: • systematic exploitation of idempotency • trees with reduced buffering • combine packed arithmetic/late ops