1 / 46

396-ps 32-bit Han-Carlson ALU in 180nm TSMC process

396-ps 32-bit Han-Carlson ALU in 180nm TSMC process. Liang-Kai Wang. VLSI CAD Lab University of Wisconsin, Madison. Outline. Review of Adders The Idea of Han-Carlson Adder The Implementation of Han-Carlson Adder Simulation Result Discussion Comparison between Ling ’ s and H-C Adder

ennisj
Download Presentation

396-ps 32-bit Han-Carlson ALU in 180nm TSMC process

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 396-ps 32-bit Han-Carlson ALU in 180nm TSMC process Liang-Kai Wang VLSI CAD Lab University of Wisconsin, Madison

  2. Outline • Review of Adders • The Idea of Han-Carlson Adder • The Implementation of Han-Carlson Adder • Simulation Result • Discussion • Comparison between Ling’s and H-C Adder • Future work • Reference

  3. Review of Adders • Carry Ripple Adder

  4. Review of Adders(cont.) • Carry Skip Adder

  5. Review of Adders(cont.) • Carry-Select Adder

  6. Review of Adders(cont.) • Carry-Save Adder

  7. Review of Adders(cont.) • Carry Lookahead Adder

  8. Review of Adders(cont.) • Ling Adder Observation: Back

  9. Review of Adders(cont.) • Hybrid (Parallel) Prefix Adder • Brent-Kung Adder • Kogge-Stone • Han-Carlson Adder

  10. Review of Adders(cont.) • Brent-Kung Adder • Cost : C(k)=C(k/2)+k-1=2k-2-log2k (# of adder cells) • Time : 2*log2k – 2 (in terms of adder levels)

  11. Review of Adders(cont.) • Kogge-Stone Adder • Cost : klog2k-(k-1) • Time : log2k

  12. The idea of Han-Carlson Adder • Han-Carlson Adder • B-K adder: small area, but slow • K-S adder: large area, but fast • Speed: 2log2n-2log2n (1/2 reduction) • Cost: 2k-2-log2kklog2k-k+1 (log2k/2 increase) • The area-time tradeoff results in Han-Carlson Adder

  13. The idea of Han-Carlson Adder (cont.) • Han-Carlson Adder • Cost : O(k/2*log2k) • Time : O(log2k+1)

  14. Review of Adders(cont.) • Optimized Brent-Kung Adder • Cost : C(k)=C(k/2)+k-1=2k-2-log2k • Time : log2k (in terms of adder levels)

  15. The idea of Han-Carlson Adder (cont.)

  16. The idea of Han-Carlson Adder (cont.) • Improved: Domino circuit with odd stage in Dynamic and even stage in Static. • Produce Generate, Propagate, and Partial Sum bit in the first stage. • Single-rail circuit with double-rail in the last stage to perform XOR function. Sum=Partial_Sum XOR CarryIn

  17. The implementation of Han-Carlson Adder • Schematics Design by Composer, Simulation by Spectre. Both of them are in the Cadence design kits • The simulation result is from Schematic (pre-layout) • The best speed is achieved by using the fast mode in the technology file instead of tuning the Bulk voltage • Clock is generated by ring oscillator with five inverters in the loop. • Cadence tutorial for both of them and about how to setup the environment are providedhere.

  18. output trigger NMOS The implementation of Han-Carlson Adder(cont.) • Clock generation: • Ring Oscillator : five inverters followed by lots of buffers

  19. PG gen. S0 S1 S2 S3 S4 stclk3 stclk2 Sum. Sum gen. Sum# Latch Ø1 Ø2 Correct The implementation of Han-Carlson Adder(cont.) • Clock distribution

  20. A B Carry In Path for P and G bit M1 M2 PG gen. CM0 CM1 CM2 CM3 CM4 Path for Psum bit Correct Sum Sum gen. Sum # The implementation of Han-Carlson Adder(cont.) • The whole view Single Rail Circuit Foot-transistor added Double Rail inside

  21. The implementation of Han-Carlson Adder(cont.) • ALU PG/Partial Sum Circuit. Back

  22. The implementation of Han-Carlson Adder (cont.) • Dynamic and Static Carry Merge Stage : i=0, 2,…30 Even Stage : i=1, 3, … 31, or the carry at that bit is already got. Odd Stage:

  23. The implementation of Han-Carlson Adder (cont.) • Dynamic and Static Carry Merge Stage (cont.): • Carry-In of LSB should be merged in order to do subtraction. • The generate and propagate bit MSB are passed to the last stage to produce the carry_out of the ALU. (for the check bit)

  24. The implementation of Han-Carlson Adder (cont.) • Even/Odd-bits CSG Sum Generation Complementary signal generator (CSG) logic

  25. The implementation of Han-Carlson Adder (cont.) • Even/Odd-bits CSG Sum Generation • Use a latch to increase noise tolerance Carry_bar Carry

  26. Simulation Result • Try the worst case pattern to test this design: • A=0, B=-2, Carry-In=1 is the worst case delay. • Why? Because from the structure of the circuit, the worst case is 3N-2P-2N-2P-2N-2P-3N (For Propagate bit)

  27. Simulation Result (cont.) • 0th stage: Carry-In=1 • 1st stage: g=0, p=0, Psum=0 (P/G/Psum, 3N) • 2nd stage: g# =1, p# =1 (Static, 2P) • 3rd stage: g=0, p=0 (Dynamic, 2N) • 4th stage: g# =1, p# =1 (Static, 2P) • 5th Stage: g=0, p=0 (Dynamic, 2N) • 6th stage: g# =1, p# =1 (static, 2P) • 7th stage: Cin31=0, (Dynamic, 3N) • The result should be “2” Correct = 1

  28. Simulation Result (cont.)

  29. Simulation Result (cont.) • The result window

  30. Simulation Result (cont.) • Test if the error flag is correct. • 1st Test pattern: A=-231 B=-1. The answer is 231-1 (1’b0+31’b1), which is the wrong answer. And the correct bit should be equal to 0. (test the lower bound) • Also check the clock period is about 396.23ps

  31. Simulation Result (cont.)

  32. Simulation Result (cont.) • 2nd Test pattern: A=231-1 B=2. The answer is -231+1 (1’b1 +30’b 0+1’b1, wrong answer), the correct bit should be equal to 0. (test the upper bound)

  33. Simulation Result (cont.)

  34. Discussion: P/G/Psum Block P circuit G circuit Psum circuit Psum= A xor B Mine

  35. Discussion (cont.) • What might be the problem? • Longer path to the ground • When pre-charge, both of the propagate and generate bit are “1” • What we need to consider? If p=0, g=0, this circuit may have a good performance. • However, what if g goes from 1 to 0, but p=1?

  36. Discussion (Cont.)

  37. Discussion (cont.) • If the longest path is cut, then… Mine

  38. Discussion (Cont.) • Mine

  39. Comparison between H-C adder and Ling Adder • Ling Adder: • For n-bit Ling adder combining r groups • critical path: • “logrn-1” levels • r1 reduction result in logrn levels, • “-1” is because of the using of CLA expression rather than Ling’s expression for the last group. Therefore, additional stage is saved. • The worst case delay will remain the second path from the last block • For each block, there are r+1 transistors serially connected. • Use carry-select block for the generation of Sum bit. Only additional “2” gate delays needed.

  40. Comparison between H-C adder and Ling Adder(cont.) Lookahead Network • Td=(logrn-1)(r+1)+2 • E.g. r=3, n=32, Td=14 Group Generation CLA expression Carry-Select structure (MUX)

  41. Comparison between H-C adder and Ling Adder(cont.) • H-C Adder: • P, G generation =3 • Carry Merge in each stage (including dynamic and static) = 2 • CSG Sum = 5 • Td=2*log2n+3(P, G generation)+5 (CSG Sum) • E.g. n=32, Td=18

  42. Comparison between H-C adder and Ling Adder(cont.) • What is the pros and cons? • Ling Adder: • Advantage: shorter worse case path  might be faster theoretically. • Disadvantage.: • not regular layout Area waste • Lots of complex gates imply the charge sharing problem. • Lots of input for a stage contribute to the long path of wire  delay problem for high frequency • Carry-Select logic makes the area bigger.

  43. Comparison between H-C adder and Ling Adder(cont.) • Han-Carlson Adder: • Disadvantage. : Longer path to the output • Advantage.: • Regular layout for each stage • Fewer of inputs for each path imply the resolution of interconnection • Simpler gates means few charge sharing problem

  44. Future Work • Power Reduction by inserting sleep transistors • Speed improvement by inserting discharge transistors in the intermediate stack nodes of the dynamic stages during precharge phase. • Area Reduction in layout • SOI model test • Self-Resetting to minimize the clock period

  45. Reference • A 6.5GHz 130nm Single-Ended Dynamic ALU and Instruction Scheduler Loop, ISSCC 2002 • Sub-500-ps 64-b ALUs in 0.18-um SOI/Bulk CMOS: Design and Scaling Trends, JSSC, Nov, 2001 • Fast Area-Efficient VLSI Adders, Proc. 8th Symp. Computer Arithmetic, Sept. 1987

  46. Reference (cont.) • Computer Arithmetic, Algorithms and Hardware Design. Behrooz Parhami, Oxford University Press. • Advanced Computer Arithmetic Design. Michael J. Flynn, et al. John Wiley & Sons, INC. • 5 GHz 32b Integer-Execution Core in 130nm Dual-Vt CMOS, ISSCC 2002 • Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability, JSSC Aug. 1999

More Related