396-ps 32-bit Han-Carlson ALU in 180nm TSMC process

396-ps 32-bit Han-Carlson ALU in 180nm TSMC process Liang-Kai Wang VLSI CAD Lab University of Wisconsin, Madison

Outline • Review of Adders • The Idea of Han-Carlson Adder • The Implementation of Han-Carlson Adder • Simulation Result • Discussion • Comparison between Ling’s and H-C Adder • Future work • Reference

Review of Adders • Carry Ripple Adder

Review of Adders(cont.) • Carry Skip Adder

Review of Adders(cont.) • Carry-Select Adder

Review of Adders(cont.) • Carry-Save Adder

Review of Adders(cont.) • Carry Lookahead Adder

Review of Adders(cont.) • Ling Adder Observation: Back

Review of Adders(cont.) • Hybrid (Parallel) Prefix Adder • Brent-Kung Adder • Kogge-Stone • Han-Carlson Adder

Review of Adders(cont.) • Brent-Kung Adder • Cost : C(k)=C(k/2)+k-1=2k-2-log2k (# of adder cells) • Time : 2*log2k – 2 (in terms of adder levels)

Review of Adders(cont.) • Kogge-Stone Adder • Cost : klog2k-(k-1) • Time : log2k

The idea of Han-Carlson Adder • Han-Carlson Adder • B-K adder: small area, but slow • K-S adder: large area, but fast • Speed: 2log2n-2log2n (1/2 reduction) • Cost: 2k-2-log2kklog2k-k+1 (log2k/2 increase) • The area-time tradeoff results in Han-Carlson Adder

The idea of Han-Carlson Adder (cont.) • Han-Carlson Adder • Cost : O(k/2*log2k) • Time : O(log2k+1)

Review of Adders(cont.) • Optimized Brent-Kung Adder • Cost : C(k)=C(k/2)+k-1=2k-2-log2k • Time : log2k (in terms of adder levels)

The idea of Han-Carlson Adder (cont.)

The idea of Han-Carlson Adder (cont.) • Improved: Domino circuit with odd stage in Dynamic and even stage in Static. • Produce Generate, Propagate, and Partial Sum bit in the first stage. • Single-rail circuit with double-rail in the last stage to perform XOR function. Sum=Partial_Sum XOR CarryIn

The implementation of Han-Carlson Adder • Schematics Design by Composer, Simulation by Spectre. Both of them are in the Cadence design kits • The simulation result is from Schematic (pre-layout) • The best speed is achieved by using the fast mode in the technology file instead of tuning the Bulk voltage • Clock is generated by ring oscillator with five inverters in the loop. • Cadence tutorial for both of them and about how to setup the environment are providedhere.

output trigger NMOS The implementation of Han-Carlson Adder(cont.) • Clock generation: • Ring Oscillator : five inverters followed by lots of buffers

PG gen. S0 S1 S2 S3 S4 stclk3 stclk2 Sum. Sum gen. Sum# Latch Ø1 Ø2 Correct The implementation of Han-Carlson Adder(cont.) • Clock distribution

A B Carry In Path for P and G bit M1 M2 PG gen. CM0 CM1 CM2 CM3 CM4 Path for Psum bit Correct Sum Sum gen. Sum # The implementation of Han-Carlson Adder(cont.) • The whole view Single Rail Circuit Foot-transistor added Double Rail inside

The implementation of Han-Carlson Adder(cont.) • ALU PG/Partial Sum Circuit. Back

The implementation of Han-Carlson Adder (cont.) • Dynamic and Static Carry Merge Stage : i=0, 2,…30 Even Stage : i=1, 3, … 31, or the carry at that bit is already got. Odd Stage:

The implementation of Han-Carlson Adder (cont.) • Dynamic and Static Carry Merge Stage (cont.): • Carry-In of LSB should be merged in order to do subtraction. • The generate and propagate bit MSB are passed to the last stage to produce the carry_out of the ALU. (for the check bit)

The implementation of Han-Carlson Adder (cont.) • Even/Odd-bits CSG Sum Generation Complementary signal generator (CSG) logic

The implementation of Han-Carlson Adder (cont.) • Even/Odd-bits CSG Sum Generation • Use a latch to increase noise tolerance Carry_bar Carry

Simulation Result • Try the worst case pattern to test this design: • A=0, B=-2, Carry-In=1 is the worst case delay. • Why? Because from the structure of the circuit, the worst case is 3N-2P-2N-2P-2N-2P-3N (For Propagate bit)

Simulation Result (cont.) • 0th stage: Carry-In=1 • 1st stage: g=0, p=0, Psum=0 (P/G/Psum, 3N) • 2nd stage: g# =1, p# =1 (Static, 2P) • 3rd stage: g=0, p=0 (Dynamic, 2N) • 4th stage: g# =1, p# =1 (Static, 2P) • 5th Stage: g=0, p=0 (Dynamic, 2N) • 6th stage: g# =1, p# =1 (static, 2P) • 7th stage: Cin31=0, (Dynamic, 3N) • The result should be “2” Correct = 1

Simulation Result (cont.)

Simulation Result (cont.) • The result window

Simulation Result (cont.) • Test if the error flag is correct. • 1st Test pattern: A=-231 B=-1. The answer is 231-1 (1’b0+31’b1), which is the wrong answer. And the correct bit should be equal to 0. (test the lower bound) • Also check the clock period is about 396.23ps

Simulation Result (cont.) • 2nd Test pattern: A=231-1 B=2. The answer is -231+1 (1’b1 +30’b 0+1’b1, wrong answer), the correct bit should be equal to 0. (test the upper bound)

Discussion: P/G/Psum Block P circuit G circuit Psum circuit Psum= A xor B Mine

Discussion (cont.) • What might be the problem? • Longer path to the ground • When pre-charge, both of the propagate and generate bit are “1” • What we need to consider? If p=0, g=0, this circuit may have a good performance. • However, what if g goes from 1 to 0, but p=1?

Discussion (Cont.)

Discussion (cont.) • If the longest path is cut, then… Mine

Discussion (Cont.) • Mine

Comparison between H-C adder and Ling Adder • Ling Adder: • For n-bit Ling adder combining r groups • critical path: • “logrn-1” levels • r1 reduction result in logrn levels, • “-1” is because of the using of CLA expression rather than Ling’s expression for the last group. Therefore, additional stage is saved. • The worst case delay will remain the second path from the last block • For each block, there are r+1 transistors serially connected. • Use carry-select block for the generation of Sum bit. Only additional “2” gate delays needed.

Comparison between H-C adder and Ling Adder(cont.) Lookahead Network • Td=(logrn-1)(r+1)+2 • E.g. r=3, n=32, Td=14 Group Generation CLA expression Carry-Select structure (MUX)

Comparison between H-C adder and Ling Adder(cont.) • H-C Adder: • P, G generation =3 • Carry Merge in each stage (including dynamic and static) = 2 • CSG Sum = 5 • Td=2*log2n+3(P, G generation)+5 (CSG Sum) • E.g. n=32, Td=18

Comparison between H-C adder and Ling Adder(cont.) • What is the pros and cons? • Ling Adder: • Advantage: shorter worse case path  might be faster theoretically. • Disadvantage.: • not regular layout Area waste • Lots of complex gates imply the charge sharing problem. • Lots of input for a stage contribute to the long path of wire  delay problem for high frequency • Carry-Select logic makes the area bigger.

Comparison between H-C adder and Ling Adder(cont.) • Han-Carlson Adder: • Disadvantage. : Longer path to the output • Advantage.: • Regular layout for each stage • Fewer of inputs for each path imply the resolution of interconnection • Simpler gates means few charge sharing problem

Future Work • Power Reduction by inserting sleep transistors • Speed improvement by inserting discharge transistors in the intermediate stack nodes of the dynamic stages during precharge phase. • Area Reduction in layout • SOI model test • Self-Resetting to minimize the clock period

Reference • A 6.5GHz 130nm Single-Ended Dynamic ALU and Instruction Scheduler Loop, ISSCC 2002 • Sub-500-ps 64-b ALUs in 0.18-um SOI/Bulk CMOS: Design and Scaling Trends, JSSC, Nov, 2001 • Fast Area-Efficient VLSI Adders, Proc. 8th Symp. Computer Arithmetic, Sept. 1987

Reference (cont.) • Computer Arithmetic, Algorithms and Hardware Design. Behrooz Parhami, Oxford University Press. • Advanced Computer Arithmetic Design. Michael J. Flynn, et al. John Wiley & Sons, INC. • 5 GHz 32b Integer-Execution Core in 130nm Dual-Vt CMOS, ISSCC 2002 • Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability, JSSC Aug. 1999

396-ps 32-bit Han-Carlson ALU in 180nm TSMC process

396-ps 32-bit Han-Carlson ALU in 180nm TSMC process

Presentation Transcript

4 BIT Arithmetic Logic Unit (ALU)

4-bit ALU

8 Bit ALU

Design of 8-BIT ALU

8-Bit Arithmetic Logic Unit (ALU)

8-Bit ALU

4 Bit ALU

4-bit ALU

4-Bit ALU

DESIGN OF 8-BIT ALU

Design of 4-Bit ALU (Philips)

Design of 4-bit ALU

A 32-bit ALU with Sleep Mode for Leakage Power Reduction

8 Bit ALU

4-bit ALU

4-BIT ALU

Design of 4-bit ALU

Single Bit ALU

Emulate a 4-Bit ALU in Java

32 bit OS

32-Bit Barrel Shifter

16 bit vs. 32 bit programming