1 / 59

Kaushik Roy Purdue University

Design in the Nano-meter Regime: From Devices to System Architecture. Kaushik Roy Purdue University. Challenges ahead … in Si nanometer regime. Gate Leakage. Subthreshold Leakage. Gate. Drain. Source. n+. n+. Junction leakage. Bulk. Exponential Increase in Leakage.

brone
Download Presentation

Kaushik Roy Purdue University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design in the Nano-meter Regime: From Devices to System Architecture Kaushik Roy Purdue University

  2. Challenges ahead …in Si nanometer regime

  3. Gate Leakage Subthreshold Leakage Gate Drain Source n+ n+ Junction leakage Bulk Exponential Increase in Leakage 1970 1980 2000 2010 2020 100 nm 1 µm 5 µm 10 nm Silicon Micro- electronics Non-Silicon Technology Silicon Nano- electronics 50% Must stop at 50% 40% 30% Leakage Power (% of Total) 20% 10% A. Grove, IEDM 2002 0% 1.5 0.7 0.35 0.18 0.09 0.05 Technology ()

  4. Technology Trend 2003 2009 2020 Nano devices Carbon nanotube III-V devices nano-wires Spintronics Bulk-CMOS DGMOS FD/SOI FinFET Trigate PD/SOI Single gate device Multi-gate devices Design methods to exploit the advantages of technology innovations

  5. 10000 Device 1 Device 2 Source: Intel 1000 # dopant atoms 100 Channel length 10 1000 500 250 130 65 32 Technology Node (nm) Variation in Process Parameters Source: Intel Delay and Leakage Spread Inter and Intra-die Variations Random dopant fluctuation Device parameters are no longer deterministic

  6. Tech. generation Failure probability Time Life time degradation Defects Reliability Temporal degradation of performance -- NBTI

  7. Device-Aware Circuit/Architecture is Essential Right type of device with right circuit and architecture

  8. Low-Power and High-Performance VLSI Research Wireless Communications - Low Power - Coding / Modulation High Speed Arithmetic - Sharing Multiplier for Vector Scaling NBTI - Analysis - Design for Rel. Power Delivery Self-Healing/ Self-Calibration Low Complexity - Differential / Redundant Coeff. - Distributed Multiplication - Filter / Image Compression Process-Tolerant Design - Logic (Sizing, Body Bias) - Memory Carbon Nano-tubes -circuits -architecture Low Power VLSI Signal Processing Reliability, Noise & Power Del. Nano Circuits & Arch. Failure Analysis & Yield Performance/Power Aware Computing Process Variation Subthreshold, Gate, Jn. BTBT, GIDL,.. -Transistor Stacking - Multiple Vt - Dynamic Vt Leakage Control Wavelet based Idd Analysis - Idd Testing -Mixed Signal Device/Circuit Design Active Leakage Reduction Digital Sub-threshold Logic - Ultra Low Power - Self Adjusting Vt Device Modeling & Circuits - Bulk, SOI • Caches • - Reconfigurable Cache • Gated Gnd, Clocking • - Dynamic Vdd Optimal SOI Devices - DG-SOI - 3D-SOI • Electro-thermal • Design • Device/Circuit/Arch • co-design Low Leakage Memory -- Dynamic Vt - DRG Cache Professor Kaushik Roy ECE, Purdue University

  9. Memories: Leakage Reduction & Process Compensation

  10. Device-aware Circuit/Microarch: Cache Nominal Vt Ground-plane SOI Bulk Ultra-high Vt FinFET Circuit Design Issues Leakage – Sub-threshold, Gate, Junction, BTBT Stability – Read noise margin, Writability, Soft errors Delay – Decoder, Wordline, Bitline, MUX, Sense-amp, Driver Transition between active and standby modes Variations – Process, Vdd, Temperature Microarch Design Issues Array aspect ratio – # cells WL/BL Sub-array structure and selection strategy Active-Standby transition frequency, delay, energy How do you co-design?

  11. VDECAY delay VREF + - VSB + SLEEP - VGND VBIAS GEN CLK SLEEP Self-decay sleep control circuit VGND Holding Circuit VSB SLEEP VGND holding circuit Hot Cache Line SRAM Array TSLEEP Column I/O Periodic SleepGeneration Bulk Nominal Vt Source-biased (Supply Gated) Cache Nominal Vt Ground-plane SOI Bulk Ultra-high Vt FinFET SB-SRAM Circuit Design Issues • Data retention • (VGND should be • strapped) • Noise issue • Process variation tracking sleep control SB-SRAM Microarch Design Issues Use locality of reference in cache to reduce transition energy Optimum memory sub-array size selection Sleep time Tsleep selection Co-design approach leads to higher payoffs and more opportunities

  12. M1 ‘0’ VM>0 M2 ‘0’ ‘0’ Basic Idea: Supply Gating Vgs=0,Vbs=0,Vds=Vdd For M1: Vgs =-VM< 0,Vbs =-VM<0, Vds = Vdd-VM<Vdd For M2: Vgs =0,Vbs =0, Vds = VM < Vdd • Negative Vgs, • Negative Vbs- More Body effect, • Reduced Vds-Less DIBL • 2-T stack has lower subthreshold leakage

  13. Source-Biasing: Retaining Data During Inactive Mode + - VSB VGND … SLEEP VGND Holding Circuit VSB SLEEP • Sleep transistor cuts off VGND from ground during sleep mode • VGND is strapped using different circuit schemes

  14. 16K-Byte SRAM Organization X64 Decoder/Driver 512 cells X4 ... WL<3:0> ... V SB ... bitlines 4 cells BLOCK_SEL MP1 ... VGND SL SLEEP VGND Distributed Col. I/O Self- > > > 4 1 7 sleep TR cells : : : SL Predecoder decay 2 0 5 < < < A A A circuit Φ PRE • Active leakage reduction SRAM • Distributed sleep transistors • SRAM block turned on ahead of time • Self-decay circuit for low dynamic power overhead

  15. 2x16K-Byte SRAM Testchip Technology 180nm 6-metal CMOS Chip Size 3.3X2.9 mm2 Supply Voltage 1.8V Threshold Voltage NMOS: 0.53V PMOS: -0.53V Read Access Cycle 984MHz @ 1.8V, RT Active Current 0.14mW/MHz @ 1.8V Standby Current 7.27μA (16KB array) Kim, Roy, ISSCC’05

  16. Measured Leakage Reduction 8.E-06 Junction leakage Bitline leakage Cell leakage 6.E-06 1.8V, 45 C Leakage (A) 4.E-06 94.2%  2.E-06 0.E+00 Conventional This work • 94.2% total leakage reduction at VGND=0.9V • Raising VGND also reduces gate tunneling leakage

  17. WL VDD BL BLB PWELL GND Bulk Ultra-High Vt Forward-biased Cache Nominal Vt Ground-plane SOI Bulk Ultra-high Vt Strong halo, Low ISUB FBB to ↑ ION FinFET FB-SRAM Circuit Design Issues • Zero body bias in standby to reduce leakage • FBB in active-mode to improve speed • Early sub-array selection to hide body-bias transition latency FB-SRAM Microarch Design Issues Use MSB of memory address for early selection of memory sub-array Use locality of reference in cache to reduce transition energy Co-design approach gives large leakage savings

  18. M1 0.4V power supply M2 SUBSL M3 WL 31 .. MP MA ... 32 ... MN … ... 32 WL 0 .. ... L L E W P V 32x32 Forward Body-Biased Sub-array

  19. Comparison Conventional SBSRAM FBSRAM V SL V PWELL V V DD DD 0.5V 0.2V 0V 0V Active Standby Active Standby VT=270mV VT=270mV VT=350mV • SBSRAM (DRG) has been proven with Si measurements • Dynamic VDD, RBB SRAM have fundamental design issues • MEDICI: gate/BTBT leakage is also modeled

  20. 0.25 230mW Dynamic power overhead Leakage power 0.20 (selected subarray) Leakage power (unselected subarrays) 0.15 Power consumption (W) 64% total leakage reduction 0.10 83mW 84mW 0.05 0.00 Conventional SBSRAM FBSRAM 32KB Cache Total Leakage Reduction • SBSRAM and FBSRAM are designed to give iso-leakage savings • 64% total leakage reduction including overhead

  21. Process Variations & Process-Tolerance

  22. WL PL PR ‘1’ ‘0’ AXL AXR NL NR BL BR High-Vt Low-Vt Robust Design: Process Variations in On-chip SRAMs Parametric failures • Read, Write, Access, Hold Simulation example of an 64KB Cache Parametric failures can degrade SRAM yield

  23. LVT Nom. Vt HVT Inter-die Variation & Memory Failure Reg. C High AF/WF Reg. A High RF/HF Reg. B Low Failures Cell. Fail. Prob. Failure Probability Mem. Fail. Prob. BPTM 70nm Devices Inter-die Vt shift [V] Memory failure probabilities are high when inter-die shift in process is high

  24. LVT Nom. Vt HVT Region C HVT Corner Region A LVT Corner Access & Write failures dominate Read & Hold failures dominate Reduce AF & WF Reduce RF & HF Self-Repairing SRAM Array Region B Region A Region C Reduce the dominant failures at different inter-die corners to increase width of low failure region

  25. V DD Bypass Switch V V REF2 V On-chip REF1 out Leakage Monitor Calibrate Signal Comparator Body bias SRAM Array Body-Bias Generator LVT VREF1 Nom. Vt VREF2 Nom. Vt LVT HVT Self-Repair using Leakage Monitoring VOUT SRAM ARRAY BPTM 70nm Entire array leakage is monitored to detect inter-die corner and proper body-bias is applied Current Monitor Circuit

  26. VCO 16 KB block Isolated cell 64 KB LVT Array BB gen Sensor + Ref. gen. Test-Chip of Self-Repairing SRAM VCO Technology : IBM 0.13 m 128KB SRAM, leakage sensor, reference & body-bias gen Dual-Vt Triple-well tech. Number of Trans: ~ 7 million Die size: 16mm2

  27. 256KB Self-Repairing SRAM No-body-bias Self- Repair ZBB RBB FBB 256KB SRAM with No Body-Bias Yield Enhancement using Self-Repair BPTM 70nm Memory failure prob. BPTM 70nm Inter-die Vt shift [mV] Self-Repairing SRAM using body-bias can significantly improve design yield

  28. Self-repair: Architecture Level

  29. Fault-Tolerant Cache Architecture Faulty • BIST detects the faulty blocks • Config Storage stores the fault information Idea is to resize the cache to avoid faulty blocks during regular operation

  30. Effective Yield of 64K Cache Optimum r = 3 Proposed Arch. Yield without any Redundancy Conv. Yield • ECC + Redundancy yield ~ 77% • Proposed architecture + Redundancy yield ~ 94%

  31. Fault Tolerant Capability More number of saved chipsas compare to ECC ECC fails to save any chips • Proposed architecture can handle more number of faulty cells than ECC, as high as 890 faulty cells • Saves more number of chips than ECC for a givenNFaulty-Cells

  32. CPU Performance Loss • Increase in miss rate due to downsizing of cache • Average CPU performance loss over all SPEC 2000 benchmarks for a cache with 890 faulty cells is ~ 2%

  33. Logic: Active Leakage Reduction- Dual-Vt- Transistor Stacking

  34. VDD VDD-Gating Control Logic Block input Output GND-Gating Control GND Leakage Reduction: Supply Gating for Logic Pros Cons 5-20X Leakage Reduction Delay/Area Overhead Scalable Floated Output Design ease Can be applied to idle sections only How to use supply gating dynamically in active mode?

  35. Dynamic Supply Gating (DSG): An Example Pre-decoder Post-decoder How to do it for random logic? 3-to-8 row decoder

  36. Dynamic Supply Gating for General Circuits • Shannon’s expansion: Xi is referred asControl Variable f1 CF1 CF11 Control variable selection is important f xi xixj f1 MUX MUX CF2 xi CF12 f2 xj xixj' xi' inputs

  37. Simulation Results MCNC Benchmarks, 70nm Process, Vdd=1V, Temp=100°C

  38. Logic: Process Variation & Tolerance- Transistor Sizing for Yield (Statistical Design)- Transistor Sizing for Efficient Speed-Binning- Shadow Latches (Razor)- Pipeline Balancing/Imbalancing- Vdd Scaling & Critical Path Isolation (ICCAD’06)

  39. predictable and restricted to a logic section having low activation probability Design A VDD=1V S2+S3 S1 Design B S3 S2 VDD<1V Design B Design Considerations for Low Power and Robust Circuit CLK Proposed Approach: Tc S1 Number of paths path delay • Few predictable critical paths • Low activation probability of critical paths • Slack between critical and non-critical paths under variations f4 Original Circuit f3 PO OR Network PO f2 f1 decoder Inputs Inputs

  40. X1 X3 X5 X4 X2 X6 X9 X7 X8 f1 f2 f1(CF1) f2(CF1) X1 X3 X5 X6 X9 f1(CF1) f2(CF1) X4 X3 X2 X6 X7 f1 f2 f1 f2 f1(CF2) f2(CF2) X2 X4 X3 X5 X6 X9 X2 X3 X6 X1 X9 X7 f1(CF2) f2(CF2) x4 x1 Critical Path Isolation By Control Variable Selection ai: # literal count of xi bi: # literal count of xi’

  41. (Xi, Xj)= (1,1) CF32 Xi = 1 CF10 (Xi Xj Xk)= (1,0,1) CF53 (Xi, Xj)= (1,0) Original Circuit CF42 (Xi Xj Xk)= (1,0,0) MUX Network PO CF63 Xi = 0 CF20 LEVEL2 (25%) LEVEL1 (50%) LEVEL3 (12.5%) Inputs Inputs Further Isolation by Hierarchical Partitioning and Sizing Stopping conditions: area, delay constraints Advantages of Shannon decomposition • Critical paths can be isolated • Activation of errors can be predicted ahead of time • Activation probability of critical paths can be reduced

  42. Simulation Results for Pipeline-based Design CLK freeze D1, D2, D3 are decoding logic D1 D2 D3 80ps 85ps 70ps cht mux ● ● ● cm150a Inputs outputs 100 % imp in power @input switching prob = 0.2 1 % imp in power @input switching prob = 0.5 80 0.8 % Imp. in power 60 0.6 40 VDD[V] 0.4 20 0.2 0 0 cht sct pcle mux decod cm150a x2 alu2 count count cht sct pcle mux decod cm150a x2 alu2 • Avg performance penalty=5.9% for switching activity=0.5 • Avg power saving = 60%, avg area penalty = 18%

  43. Ultra Low Power Subthreshold Leakage for computation?? -- Soeleman, Kim, Roy ISLPED 2000/2001, TVLSI 2001, TVLSI 2003 -- Raychowdhury, Kim, Roy, ISLPED 04, TED 2004/2005, TVLSI 2005…

  44. Region of operation Vth 1.E-3 1.E-4 IDSα exp(VGS-VTH) and not (VGS-VTH) 1.E-5 IDS (A/µm) 1.E-6 1.E-7 Vdd<Vth 1.E-8 1.E-9 0 0.2 0.4 0.6 0.8 Region of operation VGS (Volts) Vth 1 0.9 0.8 CGATE (fF/µm) CGATE < COX 0.7 0.6 0.5 0 0.2 0.4 0.6 0.8 VGS (Volts) Subthreshold Operation

  45. Super-threshold  Sub-threshold Device optimization Circuit/Architecture optimization Wireless app. Medical app. Dev/Cir/Arch optimization is necessary Design Goal Power Power Ceiling Throughput

  46. Is scaling necessary ? • Device for sub-threshold operation??

  47. Iso-performance (3.4ns) 4 3 Average Power (Х 10-7 J) 2 1 280mV 200mV 500mV 420mV 0 250 180 130 90 Technology Node (nm) Scaling & Subthreshold Operation • Reduced L => Reduced capacitance Scaling is essential even for subthreshold operation

  48. Arestandard transistorsgood forsubthresholdoperation too?

  49. Doping Profile: Std. vs. Proposed Proposed Device Standard Device

  50. 4 3 Average Power (Х 10-7 J) 2 48% 200mV 1 280mV 500mV 420mV 180mV 0 250 180 130 90 Technology Node (nm) Proposed device vs. Std. Device @ iso-performance (3.4ns) Raychowdhury, Paul, Roy; IEEE TED, Feb’05, ISLPED’04

More Related