360 likes | 378 Views
This paper discusses the design flow and implementation of a high-performance asynchronous ASIC back-end design using single-track full-buffer standard cells.
E N D
High Performance Asynchronous ASIC Back-End Design Flow Using Single-Track Full-Buffer Standard Cells Marcos Ferretti, Recep O. Ozdag, Peter A. Beerel Department of Electrical Engineering Systems University of Southern California
Key to High-Speed Async Design Control logic • Completion detection demands 2-D pipelining Async. channels Pipeline stages Latches Latches Latches Datapath Bundle-data pipeline 2-D pipeline USC Asynchronous CAD/VLSI Group
1 2 Control channel Req Ack Receiver Sender Control Data Latches Single-rail data Latches Data stable Ack 1-of-N 2 4 Acknowledge 1 3 Receiver Sender 1-of-N data 1-of-N channel 1 2 1-of-N data Receiver Sender 1-of-N Acknowledge 1-of-N single-track channel Asynchronous Channels GasP bundle-data channel USC Asynchronous CAD/VLSI Group
fw = 4 t = 6 Includes latch setup time and delay GasP (Sutherland et al.’01) Self-resetting NAND A GasP R L L R B Latches Staticizer Pulse to data latches Datapath Bundled-data pipeline using single-track control USC Asynchronous CAD/VLSI Group
RCD fw = 2 t = 14+ Precharge Half-Buffer (Lines’98) Schematic for each output rail Pc Eval Le Re C Rx Sx Eval Pc LCD NMOS transistor stack L L L R R Precharge Half-Buffer Template 2-D pipeline using 1-of-N delay-insensitive channels and QDI cells USC Asynchronous CAD/VLSI Group
fw = 2 t = 10 Pulse generator Pulse generator Single-Track Asynchronous Pulsed Logic (Nyström’01) Schematic for dual-rail output xv Re S0 S1 RCD L01 L0n NMOS transistor stack L11 L1n S Re xv R0 R1 L L R R re R0 R1 R4 Reset L01 L11… L0n L1n STAPL template R4 xv STAPL uses pulse generators to control drivers activation timing USC Asynchronous CAD/VLSI Group
B B RCD B S0 S1 L L R R L01 L0n NMOS transistor stack L11 L1n R1 R0 S A C Reset SCD L01 L11… L0n L1n B A R0 R1 S0 S1 B Timing Diagram L S A B R Single-Track Full-Buffer (Ferretti’02) Block diagram Schematic for dual-rail output fw = 2 t = 6 Small and fast USC Asynchronous CAD/VLSI Group
STFB: Tradeoff Speed for Robustness GasP performance • Features of STFB • 3x faster than QDI and about half the size • Smaller and faster than STAPL • Smaller forward latency and less timing assumptions than GasP (Sutherland - Sun) STFB (Ferretti - USC) STAPL (Nyström - Caltech) QDI (Lines - Caltech) robustness USC Asynchronous CAD/VLSI Group
Motivation and Goals • Develop a methodology to design STFB-based asynchronous circuits using conventional CAD tools • Create a STFB standard cell library • Make the library publicly-available • Design and fabricate a demonstration test chip • Evaluate the results Ultimate Goal: Full-custom Performance with ASIC Design Times USC Asynchronous CAD/VLSI Group
Outline • STFB standard-cell design • Backend design flow • Demonstration test chip • Conclusions USC Asynchronous CAD/VLSI Group
STFB channels are point to point (no forked wires) One size per cell in the library is adequate STFB Standard-Cell Design • Transistor sizing USC Asynchronous CAD/VLSI Group
2x 8x 2.8 2.8 Sx Sx 10 10 NMOS transistor stack NMOS transistor stack Rx Rx C C RCD RCD B B Wn Wn SCD SCD A A 5 5 8x STFB Standard-Cell Design • Transistor sizing • 2x min. size N-stack strength • 1:4-5 drive ratio ≤ 1mm L L Up to 1mm long wire TSMC 0.25mm, widths in mm and all lengths 0.24 mm USC Asynchronous CAD/VLSI Group
1.4 R0 1.4 2.8 2.8 A 1.4 R1 1.2 1.2 1.4 S0 B S1 1.2 1.2 1.2 1.2 STFB Standard-Cell Design Balanced response SCD/RCD SCD balanced NAND (2x) RCD balanced NOR (1x) TSMC 0.25mm, widths in mm and all lengths 0.24 mm Data-independent timing assumptions USC Asynchronous CAD/VLSI Group
fast S reset fights charge–sharing fights leakage current staticizer STFB Standard-Cell Design STFB_POUT sub-cell layout STFB_POUT sub-cell B 0.6 2.8 0.3 1.4/0.6 S 0.6 10 R 1.2 NR 1.2 TSMC 0.25mm, widths in mm and all lengths 0.24 mm Yields less load on B and faster S reset USC Asynchronous CAD/VLSI Group
Reset transistors, reset inverter and NAND layout (from STFB_XOR2 cell) L01 L11… L01 L11… A2 A2 /Reset S2 /Reset L01 L11… L01 L11… L01 L11… S0 /Reset S1 A1 S0 S1 A1 S0 S1 A 1-of-2 cell 2-input NAND + inverter 1-of-3 cell two 2-input NAND Initial idea 3-input NAND STFB Standard-Cell Design • Reset transistors TSMC 0.25mm, widths in mm and all lengths 0.24 mm 2-input NAND →less load on S USC Asynchronous CAD/VLSI Group
VDD VDD -Vtp Vtn 0V Ipeak1 Ipeak2 0A VAVSx t Idp t STFB Standard-Cell Design • Direct-path current analysis VDD VDD -Vtp Vtn 0V Ipeak 0A Vin M1 t Vin Vout M2 Idp Idp t Sx M1 M2 A Idp Average direct-path current is similar to inverter USC Asynchronous CAD/VLSI Group
Outline • STFB standard cell design • Backend design flow • Demonstration test chip • Conclusions USC Asynchronous CAD/VLSI Group
Standard-Cell Library Development (Ozdag’04) Template specifications Cell specifications Symbol, Schematic and Functional (Virtuoso, Emacs) Simulation (Verilog, Hspice) Symbol Schematic Functional Asynchronous Cell Library LVS/DRC (Dracula/Diva) Layout Layout (Virtuoso) Abstract Standard cell specifications Cell Abstract (Envisia) Same tools and flow as synchronous USC Asynchronous CAD/VLSI Group
Design specifications Symbol Schematic Functional Schematic (Virtuoso) Simulation (Verilog, Nanosim) Asynchronous Cell Library Place & Route (Silicon Ensemble) Abstract Chip Assembly (Virtuoso) LVS/DRC (Dracula/Diva) Layout Chip Fabrication Asynchronous ASIC Design Flow (Ozdag’04) Same tools and flow as synchronous USC Asynchronous CAD/VLSI Group
STFB_POUT STFB_POUT R1 R0 S0 S1 S0 S1 S1 S0 C RCD B R0 R1 b0 b1 b0 b1 A0 A1 B0 B1 A0 A1 B0 B1 a0 a0 a1 a1 SCD S0 S1 A Reset /Reset C B S B S R R Cell Layout Example: STFB2_XOR2 Each cell comprises an entire STFB pipeline stage USC Asynchronous CAD/VLSI Group
Outline • STFB standard cell design • Backend design flow • Demonstration test chip • Conclusions USC Asynchronous CAD/VLSI Group
3 + élog2 nù Prefix Adder STFB2_FORK (fork stage) STFB2_BUFFER (buffer stage) STFB2_XOR2 (2-input xor stage) STFB3_AB_KPG and STFB3_AB_KPG2 STFB3_KPG2_KPG and STFB3_KPG2_KPG2 STFB3_KPGC_C and STFB3_KPGC_C2 b7 a7 b6 a6 b5 a5 b4 a4 b3 a3 b2 a2 b1 a1 b0 a0 c-1 (Goldovsky’99) c7 s7 s6 s5 s4 s3 s2 s1 s0 2*n + 1 USC Asynchronous CAD/VLSI Group
M4 and M5 power grid 129 rows Input pins on the left (A64, B64 and C) Output pins on the right (S64 and C) 70% area utilization Floor plan Plan power Pins and cell placement Filler cell Routing 64-bit Adder Block • Silicon Ensemble P&R Schematic (Virtuoso) Place & Route (Silicon Ensemble) USC Asynchronous CAD/VLSI Group
Input Generator Block 8x8 Single-rail to single-track converter 64 A 64x9-stage ring 8 8 4 levels STFB2_SPLIT 8x8 data d0…d7 12x STFB2_SRST 4 address a0…a3 64 B 64x9-stage ring 4 STFB2_SRST 9-stage ring 1 1 Cin Carry in Flexible and fast input generation USC Asynchronous CAD/VLSI Group
= 1,10,… = 1,100,… 0010000000 1000000000 0000100000 1000000000 0000000100 1000000000 = 1,1000,… = 3,13,… = 43,143,… = 843,1843,… Output Sampler Block 1:10 1:100 1:1000 65 65 65 1 1 1 64 bit sum + Cout 65 65x STFB2_BUCKET 65x STFB2_BUCKET 65x STFB2_BUCKET 65x STFB2_SPLIT 65x STFB2_SPLIT 65x STFB2_SPLIT 65 65 65 BB BB BB 0 0 0 30-stage ring 30-stage ring 30-stage ring Flexible and fast output sampler USC Asynchronous CAD/VLSI Group
Simulation Results: Loading Carry in • Nanosim Sampler: 10x4x4 = 160 3x B64 3x A64 Go! USC Asynchronous CAD/VLSI Group
Simulation Results: Running • Nanosim Go! Carry out Sum 112.9ns 112.9/160 = 0.706ns 1/0.706ns = 1.4 GHz USC Asynchronous CAD/VLSI Group
Simulation Results USC Asynchronous CAD/VLSI Group
3733 mm 1963 mm 801 mm 663 mm 499 mm STFB 64-bit Adder 20.5 mm2 132 pins INPUTGEN129BY9 ADDER64 SAMPLER65BY1000 1.36 mm2 105k transistors 1.3 A @ 1.4 GHz 1.13 mm2 89k transistors 1.3 A @ 1.4 GHz 0.85 mm2 62k transistors 0.3 A @ 1.4 GHz 1700 mm 5483 mm QDI Sequential Decoder (Session VI, 10:30am, Thu, Apr/22) ~6 months/man Library ~6 months/man Design 3.3 mm2 257k transistors 2.9 A @ 1.4 GHz Demonstration chip Top layout TSMC 0.25 mm MOSIS Mar/22/04 USC Asynchronous CAD/VLSI Group
Summary and Conclusions • Performance • STFB 2-D pipelining yields ultra-high-performance • Design Time • Back-end flow achieves ASIC design time • Availability • Cell library has been made freely available • Future work • Characterize and extend library • Static timing analysis and sign-off USC Asynchronous CAD/VLSI Group
Efharisto!(Thank you!) USC Asynchronous CAD/VLSI Group
Sx Sx Sx Sx R R R R L L L L RCD RCD RCD RCD A A A A STFB Standard-Cell Design • Dynamic worst-case direct-path current analysis • (STFB buffer pipeline at 2GHz) 1mm TSMC 0.25mm, widths in mm and all lengths 0.24 mm Non-overlap drive = less direct-path current than an inverter USC Asynchronous CAD/VLSI Group
1,0,0,1,0,0… 0 0 0 0 0 0 1 1 1 1 1 Input Generator Block • 9-stage ring in out go BG STFB2_FORK (fork stage) STFB2_BUFFER (buffer stage) STFB2_XOR2 (2-input xor stage) STFB2_BITGEN (bit generator) BG STFB2_MERGENC (non-conditional merge stage) USC Asynchronous CAD/VLSI Group
Et2 • Comparison STFB x WCHB STFB buffer is ~3x more efficient than WCHB buffer USC Asynchronous CAD/VLSI Group
1963 mm 801 mm 663 mm 499 mm INPUTGEN129BY9 ADDER64 SAMPLER65BY1000 1.36 mm2 105k transistors 1.3 A @ 1.4 GHz 1.13 mm2 89k transistors 1.3 A @ 1.4 GHz 0.85 mm2 62k transistors 0.3 A @ 1.4 GHz 12 In/Out, 8 Input and 3 pad’s supply pins 1700 mm Total: 51 pins 3.3 mm2 257k transistors 2.9 A @ 1.4 GHz 7 Vdd and 7 Gnd pins 7 Vdd and 7 Gnd pins Demonstration chip TSMC 0.25 mm MOSIS Mar/22/04 Top layout USC Asynchronous CAD/VLSI Group
Test chip design TSMC 0.25 mm MOSIS Mar/22/04 Top chip layout 5483 mm STFB 64-bit Adder QDI Sequential Decoder (Session VI, 10:30am, Thu) 3733 mm 20.5 mm2 132 pins USC Asynchronous CAD/VLSI Group