400 likes | 550 Views
Directions in Low-Power CAD. Dennis Sylvester University of Michigan dennis@eecs.umich.edu http://vlsida.eecs.umich.edu With acknowledgements to: Prof. David Blaauw, Dr. Sarvesh Kulkarni, Saumil Shah, Kavi Chopra. Topics. A new dual-Vth assignment formulation Dual-Vdd power distribution
E N D
Directions in Low-Power CAD Dennis Sylvester University of Michigan dennis@eecs.umich.edu http://vlsida.eecs.umich.edu With acknowledgements to: Prof. David Blaauw, Dr. Sarvesh Kulkarni, Saumil Shah, Kavi Chopra
Topics • A new dual-Vth assignment formulation • Dual-Vdd power distribution • Approaches to parametric yield optimization: statistical leakage + delay
Motivation • We require high-performance yet low-power circuits • Leakage power contributes significantly to total power • All High- Vth implementation too slow • All Low-Vth implementation too leaky • Dual- Vth processes popular • Problem Definition • Minimize • Total Circuit Power • Subject to • Circuit Delay Constraint • Sizing Constraints • Optimization Variables • Gate Sizes • Gate Threshold Voltages S. Narendra et al [ICCAD ’03] Switching Subthreshold leakage
Gate Sizing + Vth Assignment Problem Prior Work • Traditionally a discrete problem • Previous approaches • Separate Sizing and Vth Assignment • Mixed Integer Non-Linear Programming • Sensitivity-based methods (DUET, etc) • Continuous formulation [Chen, ASP-DAC ‘05] • Very reliant on discretization heuristic
Proposed Approach – Self-snapping formulation • Continuous formulation – Use of large variety of algorithms/powerful non-linear optimizers possible • Solution has almost all gates assigned to one of the two available threshold voltages • Small fraction of gates with intermediate Vth’s, can be handled heuristically • Discretization algorithm has negligible power impact and can be very simple
Proposed Approach – Mixed- Vth Gates • Consider each gate to be a parallel combination of high and low Vth gates • RC Delay Model HVt LVt Mixed Gate • Linear Power Model HVt Gate LVt Gate
Complete Dual- Vth Problem Formulation • Similar to single-Vth gate sizing problem, with simple gate delays replaced with High Vth/Low Vth parallel combinations • Minimize • Subject to:
Proof of Discretized Solution • Conceptually separate optimization process into two distinct phases: • D-Phase : Fix delays of all gates • W-Phase : Find the minimum-power sizing solution that satisfies the chosen D vector • Hypothetical separation for proof – Not implemented in actual optimization procedure
W-Phase • Proof of discrete optimal solution under arbitrary D-vector sufficient • W-Phase formulation • Minimize • Subject to:
W-Phase • Linear programming problem • n basic variables, n non-basic variables • Therefore, only n non-zero variables • Every gate snapped to either high-Vth or low-Vth • Addition of upper and lower bounds on total size leads to some non-snapped gates • Number extremely small – simple heuristic achieves good results
Practical Constraint – Fixed-Width Input Drivers • Sequential elements driving the combinational circuit • Delay of these elements affected by primary input widths • Modeled as fixed-width drivers
Extension of Discretization Analysis • m+n constraints in the optimization problem • n+m basic variables, n-m non-basic variables • Therefore, n+m positive variables • Total number of non-snapped gates bounded by number of inputs • Once again, small in number; can be handled heuristically • In practice, number of non-snapped gates found to be much less than the number of inputs
Discretization Heuristics • Iterative snapping • Round gates to closer Vth and re-optimize until non-snapped solution achieved • Single-pass Vth assignment • Fix all gates to closer Vth and re-optimize only for gate sizes • Second heuristic faster with negligible power impact
Results • Snapping properties of some circuits • # of non-snapped gates is very small • Dominated by gates at upper and lower size bounds • Approach is easily extendable to multi-Vth AND multi-Lgate
Results • Power and runtime comparisons between proposed approach and sensitivity-based algorithm at 2% timing backoff (results shown for larger circuits only) • Average: 31% leakage reduction vs. previous approaches
Topics • A new dual-Vth assignment formulation • Dual-Vdd power distribution • Approaches to parametric yield optimization: statistical leakage + delay
FF VDDH VDDL FF VDDL Swing DC Current IN FF Need for Level Conversion FF FF Multiple supply design • Relies on applying a lower supply (VDDL) to gates along non-critical paths thus reducing power while meeting timing • A flexible fine-grained VDD assignment scheme promises best power reduction • Gate-level Extended Clustered Voltage Scaling • However, physical design and power delivery are complicated
OUT IN Non-critical Critical CVS ECVS Implications of using multiple supplies Coupled issues Circuits Level shifting Algorithms VDD assignment Physical design VDD Granularity Power delivery Distribution Generation Fine-grained Islanding
Power delivery for dual-VDD circuits • Power grid integrity vital for circuit performance • Dual-VDD circuits require two supply voltages for operation • Fine-grained dual-VDD can place VDDL/VDDH gates arbitrarily on the die • Implications at the board, package and die level • Fixed resources need to be split between VDDL and VDDH • However, load on each supply is lower than on original single supply: Power supply current demanded by a dual-VDD circuit is significantly lower than the corresponding single-VDD circuit, allowing robust power delivery within available resources (decap, C4, wiring)
VDD ECVS Reduced current load on VDDL/VDDH • Gate level comparison • Avg. 54% (33%) for VDDL = 0.8V (0.6V) • Circuit level comparison • Avg. 49% (51%) and 28% (14%) for VDDH and VDDL for 0.8V (0.6V)
LpkgH RpkgH Lskt Rskt Lmb2 Rmb2 Lmb1 Rmb1 2 + RhfH Rpkg_capH RdieH RblkH VDDH Load VDDH I(VDDH) LhfH Lpkg_capH LblkH CdieH - ChfH Cpkg_capH CblkH 1 - RhfL RblkL RdieL Rpkg_capL VDDL Load VDDL I(VDDL) LhfL LblkL Lpkg_capL CdieL + ChfL CblkL Cpkg_capL 3 LpkgL RpkgL Lmb1 Rmb1 Lskt Rskt Lmb2 Rmb2 Package level results • Two VRMs on board to supply VDDL and VDDH • Ground path can be shared by VDDL and VDDH • Decoupling capacitance divided in the ratio of current loads • Similar power supply noise with same resources as single-VDD case (decoupling capacitance, C4s) Intel, “Intel Pentium 4 processor in the 432 pin/Intel 850 Chipset Platform,” 2002.
Single-VDD Dual-VDD VDDH VDDL GND VDDH + VDDL row VDDH + VDDL row VDDH + VDDL row VDDH + VDDL row Dual-VDD segregated Dual-VDD segregated VDDH + VDDL row VDDH + VDDL row VDDH + VDDL row Dual-VDD fine-grained Dual-VDD physical design alternatives Segregated placement constrains placer leading to higher core area and wirelength C. Yeh, et al., “Layout techniques supporting the use of dual supply voltages for cell-based designs,” Proc. DAC, 1999. M. Igarashi, et al., “A low-power design method using multiple supply voltages,” Proc. ISLPED, 1997.
Dual-VDD standard cells topologies Single-VDD Dual-VDD Shared-GND Dual-VDD Dual-GND 3-rail cell 4-rail cell VDDH VDDL GND (shared) VDDL GNDL VDDH GNDH VDD GND VDDL GNDL VDDH VDDL VDDH GNDH GND (shared) Dual-VDD power grid alternatives • Routing the power supply rails • Dual-VDD Dual-GND requires two separate grounds off-chip and complicates timing analysis and design of the board itself • Multi-rail standard cells can be used to realize the Dual-VDD grids allows placer to operate with no constraints
Dual-VDD on-chip power grid design • Guidelines while designing the dual-VDD grid: • Scale wires with respect to the single-VDD considering how the current demand has scaled • VDDL gates more sensitive to grid noise important since ground is shared • 120mV noise is 10% for a 1.2V gate, but 20% for a 0.6V gate • Placement of VDDL and VDDH gates assign more wiring resources to VDDL grid in areas where there is more demand for VDDL current • Consider effects that arise from the board and package level such as shared C4s • Fewer C4s leads to higher effective package R, L
Obtain current consumption of Single/Dual VDD designs (SPICE) Regional Global Obtain Dual VDD design Original Single VDD design Local Single VDD Lib file Dual VDD Lib file Break down die into “local” & “regional” areas Placement database (Cadence) Measure voltage droop/bounce Size each wire segment in each local area using effective ,β &simulate grid Calculate local,regional, global& effective & for each wiresegment VDDH VDDL GND Measure wire congestion Proposed technique D-Place • Partition the chip floorplan • Obtain eff. and as follows • Let = I(VDDH)/I(VDD) and = I(VDDL)/I(VDD) • Scale wires as follows
Peak voltage drop comparisons VDDL = 0.6V VDDL = 0.8V • D-Place grids better than single-VDD grids in AVG cases • Inferior by < 2.6% (≈15mV) in some MAX cases • 0.6V VDDL as robust as 0.8V • 0.6V also provides higher power savings • Proposed approach better by 2-7% (AVG) and 7-12% (MAX) compared to prior approaches
Voltage variation across die • Voltage drop contours • Wiring congestion similar for dual-Vdd vs. single Vdd grids • Lower current demands can lead to smaller amounts of decoupling cap; lower leakage (or use same decap for better performance) Dual-VDD grid no less robust than single-VDD grid
Topics • A new dual-Vth assignment formulation • Dual-Vdd power distribution • Approaches to parametric yield optimization: statistical leakage + delay
P P Vth Delay Leff Power Chip Performance-space Process Parameter-space Introduction Optical Proximity Effects Variation Chemical Mechanical Polishing Variations Low Leakage PoorTiming Timing Yield Loss Good Timing High Leakage Power Yield Loss This Work: Optimize the timing and power yield using gate sizing
Problem Description • Nonlinear Continuous Optimization • Objective: Maximize Timing and Power Yield Yield: A utility function defined w.r.t the JPDF of leakage and timing • Decision Variables: Gate Size Tconst Pconst • Efficient implementation requires • Computing yield as function of decision variables - gate size • Fast and Accurate Gradient computation
Power and Timing Yield Analysis (see DAC05 for more detail) Timing Analysis [Sapatnekar03, Chandu05](d, d) d Delay Correlation (1 parameter) Power Analysis (l, l) Delay and Power Bivariate JPDF (d, d, l, l, ) l Log(Leakage)
Cut Edge Time(CT) Size Up 7 Traditional Incremental Timing Cut Set SSTA: Intuition • Consider Timing Graph Required Arrival Time (RT) Arrival Time (AT) Unperturbed Sub Graph 2 6 9 Unperturbed Left Sub Graph Unperturbed Right Sub Graph 8 3 10 1 4 7 5 MaxCut Edge Time (CT) • If Forward SSTA Reverse SSTA then Cut Set SSTA will give exact same sensitivities as naïve approach that recomputes yield relating to all nodes, most being unchanged
Statistical Yield Optimization Results • Initial yield ~0-2% due to inverse correlation • Gate sizing alone provides good improvements • Combined with Lgate biasing, provides outstanding results Chopra, et al., ICCAD05
Another approach to statistical optimization • General statistical optimization • Method relies on efficient deterministic formulations and variation space sampling to drive statistical optimization • Applicable to many mainstream VLSI design problems: gate sizing, Vth assignment, Leff biasing as well as potential new levers
BB controller Statistically Optimized Body Bias Clusteringfor Post-Silicon Tuning • Concept:Speed up critical gates using FBBand slow down non-critical gatesusing RBB to meet timing andpower constraints • Traditional view:Centralized body bias generatorcontrolling different die regions • Ineffective for compensating intra-die variations • Highly suboptimal power Critical Non-critical
Critical DELAY POWER Correlated Coarse Body Bias Assignment ONE BIAS FOR ALL GATES • Simplified assignment minimizing routing overheads • Biasing dictated by placement instead of gate criticality • Disregards complex dependence of gate criticality on: • Circuit topology • Correlations in process variations • Effective in tightening delay but leads to high power Important to cluster gates to leverage ABB effectively
Generate sample scenarios Solve BB assignmentfor each scenario Scenario ‘2’ Scenario ‘1’ Leff_4.2 Leff_4.1 ρi,j Gate BB-PDF 4 4 7 7 3 3 Leff_5.1 Leff_5.2 Leff_7.2 Leff_7.1 Leff_3.1 Leff_3.2 5 5 Leff_2.2 Leff_2.1 Scenario ‘x’ Leff_4.x DETERMINISTICALLY optimize each scenario (i.e., tune each gate for each die scenario) Leff_1.2 Leff_1.1 Leff_6.1 Leff_6.2 2 2 4 7 6 6 1 1 3 Leff_5.x Leff_7.x Leff_3.x 5 Leff_2.x Leff_1.x Leff_6.x 2 6 1 Clustering Post-silicon tuning Proposed New Optimization Framework Generate PDFs of optimal actions
Results vs. Traditional Dual-Vth • Delay • 3-9X tighter σ • Leakage power • Dual-Vth vs. 2-4 ABB clusters • Avg. 28-38% (51-59%) lower μ (95th) • Area • Capo generates contiguous regions of similarly clustered cells while minimally displacing cells • 5-8% increase in wirelength and area
A few conclusions • Parametric yield is a critical design objective going forward • Requires accurate estimation and fast optimization approaches to this key metric • Envision all tools in 4-6 years being yield-driven, rather than timing or power alone • Lots of room for improvement in many ‘well-studied’ CAD problems today • Recent examples; dual-Vth+sizing, placement (Cong, et al)