Efficient Datapath Functional Units for Arithmetic Operations

Lecture 18: DatapathFunctionalUnits

Outline • Comparators • Shifters • Multi-input Adders • Multipliers 18: Datapath Functional Units

Comparators • 0’s detector: A = 00…000 • 1’s detector: A = 11…111 • Equality comparator: A = B • Magnitude comparator: A < B 18: Datapath Functional Units

1’s & 0’s Detectors • 1’s detector: N-input AND gate • 0’s detector: NOTs + 1’s detector (N-input NOR) Asymmetric design that favors the late-arriving inputs. Here, the delay from the latest bit A7 is a single gate 18: Datapath Functional Units

1’s & 0’s Detectors • Fast pseudo-nMOS or dynamic NOR detector. • Works well for words up to about 16 bits • For larger words, the gates can be split into 8–16-bit chunks to reduce the parasitic delay 18: Datapath Functional Units

Equality Comparator • Check if each bit is equal (XNOR, aka equality gate) • 1’s detect on bitwise equality 18: Datapath Functional Units

Magnitude Comparator • Compute B – A and look at sign • B – A = B + ~A + 1 • If there is a carry-out: A<=B. Otherwise: A>B. • zero detector indicates that the numbers are equal. 18: Datapath Functional Units

Counters • An N-bit binary counter sequences through 2N outputs in binary order. • Features of counters: • Resettable: counter value is reset to 0 when RESET is asserted • Loadable: counter value is loaded with N-bit value when LOAD is asserted • Enabled: counter counts only on clock cycles when EN is asserted • Reversible: counter increments or decrements based on UP/DOWN input • Terminal Count: TC output asserted when counter overflows 18: Datapath Functional Units

Binary Counter • Asynchronous ripple-carry counter • It is composed of N registers. • The falling transition of each register clocks the subsequent register. • Asynchronous circuits introduce a whole assortment of problems. • Delay is long. • Delay from clk to Q3 is long. 18: Datapath Functional Units

Binary Counter • Synchronous up/down counter • Resettable • Cycle time is limited by the ripple-carry delay with reset, load, and enable 18: Datapath Functional Units

Fast Binary Counter • The speed of the previous counter is limited by the adder • It can be overcome by dividing the counter into two or more segments. • A 32-bit counter could be constructed from: • a 4-bit prescalar counter • 28-bit counter 18: Datapath Functional Units

Shifters • Logical Shift: • Shifts number left or right and fills with 0’s • 1011 LSR 1 = 0101 1011 LSL1 = 0110 • Arithmetic Shift: • Shifts number left or right. Rt shift sign extends • 1011 ASR1 = 1101 1011 ASL1 = 0110 • Rotate: • Shifts number left or right and fills with lost bits • 1011 ROR1 = 1101 1011 ROL1 = 0111 18: Datapath Functional Units

Funnel Shifter • A funnel shifter can do all six types of shifts • creates a 2N – 1-bit input word Z from A and/or the kill values. • Selects N-bit field Y from 2N–1-bit input • Shift by k bits (0  k < N) • Logically involves N N:1 multiplexers 18: Datapath Functional Units

Funnel Source Generator Rotate Right Logical Right Arithmetic Right Rotate Left Logical/Arithmetic Left 18: Datapath Functional Units

Array Funnel Shifter • N N-input multiplexers • Use 1-of-N hot select signals for shift amount • nMOS pass transistor design (Vt drops!) N=4 18: Datapath Functional Units

Array Funnel Shifter • Source generator consists of a 2N – 1-bit 5:1 multiplexer controlled by the shift type and direction. Z2N-2:N 18: Datapath Functional Units

Multi-input Adders • Suppose we want to add k N-bit words • Ex: 0001 + 0111 + 1101 + 0010 = 10111 • Straightforward solution: k-1 N-input CPAs • Large and slow 18: Datapath Functional Units

Carry Save Addition • A full adder sums 3 inputs and produces 2 outputs • Carry output has twice weight of sum output (X + Y + Z = S + 2C) • N full adders in parallel are called carry save adder • Produce N sums and N carry outs 18: Datapath Functional Units

CSA Application • Use k-2 stages of CSAs • Keep result in carry-save redundant form • Final CPA computes actual result 18: Datapath Functional Units

Multiplication • Multiplication is less common than addition. • Multiplier is essential for • Microprocessors • Digital signal processors • Graphics engines 18: Datapath Functional Units

Multiplication • Example: • M x N-bit multiplication • Produce N M-bit partial products (PPs) • Sum these to produce M+N-bit product 18: Datapath Functional Units

General Form • Multiplicand: Y = (yM-1, yM-2, …, y1, y0) • Multiplier: X = (xN-1, xN-2, …, x1, x0) • Product: 18: Datapath Functional Units

Dot Diagram • Large multiplications are illustrated using dot diagrams. • Each dot represents a bit M N 18: Datapath Functional Units

Multiplier design • Depends on: • Latency • Throughput • Energy • Area • Design complexity • Simple approach: • use an M + 1-bit carry-propagate adder (CPA) to add the first two partial products • then another CPA to add the third partial product to the running sum, and so forth • Requires N – 1 CPAs and is slow • Better approach: using arrays 18: Datapath Functional Units

Y Y Y Y X 3 2 1 0 0 X Y Y Y Y 1 3 2 1 0 HA FA FA HA X Y Y Y Y 2 3 2 1 0 FA FA FA HA X Y Y Y Y 3 3 2 1 0 FA FA FA HA P0 P1 The Array Multiplier P6 P5 P4 P2 P3 P7

HA FA FA HA Critical Path 1 FA FA FA HA Critical Path 2 FA FA FA HA Many paths of almost identical length can be identified. The MxN Array Multiplier-Critical Path ta tc tc tc ts tc ts Critical Path 1 & 2 tc ts For path 2:

Carry-Save Multiplier

Rectangular Array • Squash array to fit rectangular floorplan (the parallelogram shape wastes space. 18: Datapath Functional Units

Improvement • The first row of CSAs adds the first partial product to a pair of 0s (Sin=0 and Cin=0). • Leads to a regular structure • but is inefficient. • The first row of CSAs can be used to add the first three partial products together. • This reduces the number of rows by two and reduces the adder propagation delay • Another improvement: replace the bottom row with a faster CPA such as a lookahead or tree adder • The critical path of an array multiplier involves N–2 CSAs and a CPA. 18: Datapath Functional Units

Two’s Complement Array Multiplication • MSB of a two’s complement number has a negative weight. • Two of the PPs have negative weight and should be subtracted. • Subtraction is handled by taking the two’s complement of the terms to be subtracted, i.e., inverting the bits and adding one. 18: Datapath Functional Units

Baugh-Wooley multiplier algorithm • The upper parallelogram: Multiplication of all but the MSBs of the inputs • Next row: product of the MSB (X5Y5) • Next two pairs of rows: the inversions of the terms to be subtracted. • Each term has implicit leading and trailing zeros, which are inverted to leading and trailing ones. • Extra ones must be added in the least significant column when taking the two’s complement. 18: Datapath Functional Units

Modified Baugh-Wooley multiplier • Reduces the number of partial products: • precomputing the sums of the constant ones • pushing some of the terms upward into extra columns 18: Datapath Functional Units

squashed into a rectangle • Almost identical to the unsigned multiplier • AND gates are replaced by NAND gates in the hatched cells. • 1s are added in place of 0s at two of the unused inputs. • Signed and unsigned arrays are so similar • A single array can be used for both • XOR gates are used to conditionally invert some of the terms depending on the mode. 18: Datapath Functional Units

Fewer Partial Products • Array multiplier requires N partial products • If we looked at groups of r bits, we could form N/r partial products. • Faster and smaller? • Called radix-2r encoding • Ex: r = 2: look at pairs of bits • Form partial products of • 00  0 • 01  Y • 10  2Y simple shift • 11  3Y hard multiple, requiring a slow carry-propagate addition of Y+2Y • First three are easy, but 3Y requires adder  18: Datapath Functional Units

Booth Encoding • Observe that: • 3Y = 4Y – Y • 2Y = 4Y – 2Y • 4Y in a radix-4 multiplier array is equivalent to Y in the next row 18: Datapath Functional Units

Booth Encoding • Instead of 3Y, try –Y, then increment next partial product to add 4Y • Similarly, for 2Y, try –2Y + 4Y in next partial product 18: Datapath Functional Units

Example 18: Datapath Functional Units

Booth Hardware • Booth encoder generates control lines for each PP • Booth selectors choose PP bits 18: Datapath Functional Units

Efficient Datapath Functional Units for Arithmetic Operations

Efficient Datapath Functional Units for Arithmetic Operations

Presentation Transcript

Introduction to CMOS VLSI Design Lecture 12: Datapath Functional Units

Datapath Functional Units

Chap. 8 Datapath Units

Lecture 9. Functional Genomics at the Protein Level: Proteomics

RaPiD The Reconfigurable Pipelined Datapath

Pipelined Datapath and Control

LC3-2 The LC-3 Datapath

EE466: VLSI Design Lecture 14: Datapath Functional Units

Lecture #6

CEG3420 Computer Design Lecture 9: Designing a Single Cycle Datapath

ENGIN 112 Intro to Electrical and Computer Engineering Lecture 34 Datapath Analysis

240-334 Computer System Design Lecture 4

CPU Organization (Design)

332:578 Deep Submicron VLSI Design Lecture 17 Functional Units, Multiplication, and Division

October 1st, 2007 Majd F. Sakr msakr@qatar.cmu qatar.cmu/~msakr/15447-f07/

ECE3055 Computer Architecture and Operating Systems Lecture 6 Multi-cycle Datapath

MIPS Datapath

Functional Programming Lecture 1 - Introduction

Single Cycle datapath

Multi-Cycle Datapath