450 likes | 579 Views
Containing the Nanometer ‘Pandora Box’: Design Techniques for Variation Aware Low Power Systems. Kaushik Roy, Georgios Karakonstantis, Abhijit Chatterjee. Era of Miniaturization. Increasing demand for small multifunctional mobile platforms consisting of heterogeneous components.
E N D
Containing the Nanometer ‘Pandora Box’:Design Techniques for Variation Aware Low Power Systems Kaushik Roy, GeorgiosKarakonstantis, Abhijit Chatterjee
Era of Miniaturization • Increasing demand for small multifunctional mobile • platforms consisting of heterogeneous components General Processor on Chip RF Mixed Signal Memory on Chip Application Specific Processors Camera, Video WLAN
120 15mm2Die 100 Leakage 80 Dynamic Power(W) 60 Source: Intel 40 20 0 250n 180n 130n 90n 65n 45n Scaling Challenges: Power Large power dissipation Large power density • Increasing battery gap • Rapidly increasing processor power consumption • Slowly increasing battery capacity Need for Efficient Use of Available Energy
Device 1 Device 2 Leff1<Leff2 Scaling Challenges: Process Variations Random dopant fluctuation Variation in channel length Inter and Intra-die Variations Leakage Spread and Delay Variation (Intel) Device parameters are no longer deterministic
The Nanometer ‘Pandora Box’ Exponential Power Growth High Performance Process Variations Soft Errors Short Channel Effects Technology scaling beyond 90nm unlocks the Nanometer ‘Pandora Box’ Reduced Yield/Quality
Addressing the Issues (Logic) Low Power: Reduce supply voltage (Voltage Over-Scaling -VOS) Robustness (delay variations): Increase supply voltage Delay failures Tc Path Delay Clk Power Clk Vddnom Vddnom Voltage Vdd Vdd Increased Power Tc Robustness under delay variations Path Delay Power Low Power and Robustness have contradictory design requirements Voltage Medicine? Can we address them jointly? Common Symptoms: Delay Failures
Addressing the Issues • Logic (General Purpose, Application Specific) System Algorithmic Noise Tolerance Significance Driven Approach Algorithm Architecture Error Detection & Correction (RAZOR, Intel) Variable Latency Units (CRISTA, Telescopic ) Circuit • Memory (stability failures) • Circuit-Architecture Co-Design (Redundancy, ECC, ABB,…) Device • Mixed Signal (loss of performance due to guardbands) Cross layer design is necessary for highly optimized systems 7
Outline • Scaling Challenges • Variations, Power Logic • Application Specific Systems • Significance Driven Approach • Algorithmic Noise Tolerance • System Level Techniques • Robust Low Power Memory • Circuit – Bit Cell Level • Architecture Level • Mixed Signal • Conclusion • General Purpose Processors • Error Detection and Correction • Prediction Based techniques
clk clk D1 Q1 D1 Q1 Main 0 0 Main 1 1 Flip-Flop Flip - Flop Error_L Error_L Shadow Shadow Latch Latch comparator comparator Error Error RAZOR FF RAZOR FF clk_del Error Detection and Correction • Tune Vdd by monitoring the error rate during circuit operation • Eliminates the need for voltage/clock margins • Shadow latch samples the delayed signal • A comparator and a metastability • detector identify and validate • any timing error • 64% energy savings with 3% • performance and energy overhead • RAZORII (JSCC ‘09) circumvents the tight timing constraints of RAZOR • flip-flops through micro-architecture techniques such as replay • Intel provides an overview of dynamic variation aware and low power • techniques at the micro-architecture level (JSCC ‘09) * Dan Ernst, et. al “ Razor: Circuit-Level Correction of Timing Errors for Low- Power Operation,” IEEE Micro, 2004.
delay target delay target Complete Failure # of paths # of paths S path delay path delay Predictable and rare by design delay target delay target # of paths # of paths S path delay path delay Prediction based (CRISTA) Conventional System Nominal Scaled CRISTA based design Nominal Scaled Meet the delay target while considering variation and VOS induced delay errors
CRISTA – Generic Logic • Shannon expansion and gate sizing • CP activated when x1’x2x3=1 with • prop.=12.5% • Long paths are activated rarely • and are evaluated in two cycles * S. Ghosh, et al, “CRISTA: A New Paradigm for Low-Power, Variation-Tolerant, and Adaptive Circuit Synthesis Using Critical Path Isolation,” TCAD, 2007. • 40% less power with 9% area overhead for a two stage pipelined ALU • Applied to circuits (variable latency units) as well as at the • micro-architectural level (Trifecta *TVLSI ‘10)
FA FA CLK Tc VDD=1V A B (Delay Failure) S1 LONG Lat. C S2 SHORT Lat. D CRISTA: Variable Latency Adder Case a0 b0 a1 a2 b1 a3 b2 b3 Cin Co,0 Co,1 Co,2 Co,3 FA FA Stretch Clock clk v S S S S 0 1 2 3 • Delay of circuit depends upon input data and carry propagation • A(a0..a3) = 1111 & B (b0..b3) = 0001 • Classify operations into long • and short latency • Inputs susceptible to failure are • given 2 cycles • Utilize slack S1 and S2 for tackling • VOS and variation induced delay errors • Predictor depending on number of monitored inputs trade-offs overhead and penalty
Obtain best quality possible, in presence of delay failures under VOS or parametric variations Application Specific Systems System Under Parametric Variations and Nominal Vdd Nominal Corner Slow Corner System Under Voltage Over-Scaling and Nominal Corner Nominal Vdd Scaled Vdd
Significance Driven Approach All computations “do not contribute” equally to output quality Significant Computations Less Significant Computations Adjust Complexity for minimum Quality degradation under delay errors Algorithm Minimize sharing Tight timing Maximize Sharing for reducing any area overhead Architecture Slack to tackle Delay Errors due to Vdd Scaling/Process Variation Ensure Correct Operation Under Delay Errors Energy Efficient & Robust DSP Blocks
Application to DCT Input Block 1D- intermediate DCT coefficients 1D- intermediate DCT coefficients Input Block w w w w w w w w x 0 0 32 16 56 8 24 40 48 w x 1 1 w x 2 2 Significant w x 3 3 1D - DCT w x 4 4 w x 5 5 Not-So Significant w x 6 6 w x Significance Decrease 7 7 Final DCT coefficients High Frequency Low Frequency z y 0 0 Crucial z y 1 1 Transpose Memory z y Significant Significant 2 2 Less Crucial 1 z y 3 3 Transposed Memory 1D - DCT z y Not-So Significant 4 4 z y 5 5 Not-So Significant z y Less-Crucial 2 6 6 z y 7 7 High Frequency Final DCT coefficients *G. Karakonstantis, et al, “ Process-Variation Resilient & Voltage Scalable DCT Architecture for Robust Low-Power Computing,” TVLSI,2010.
z0 z0 z1 z1 z2 z2 z3 z3 z4 z4 z5 z5 z6 z6 z7 z7 Scalable DCT Architecture Dc Dlc1 Dlc2 Dc Dlc1 Dlc2 Clk FAIL Clk Target • Under delay failures only • less-crucial computations • are affected • Low Power (55% savings) • with Graceful Quality • degradation (33dB t0 23dB) Crucial Less Crucial 1 Less Crucial 2 16
Algorithmic Noise Tolerance Input MAIN Block (8-bit ) Reduced Precision Replica (6-bit ) • Redundancy Based • Addition of a reduced precision • replica of the main block • Error Control • Estimates and • Corrects Potential Errors • Threshold determined • at design time • Use of linear arithmetic units • Applied to the design of various • DSP blocks (FFT, FIR, Viterbi) leading to • 20-50% power savings with graceful quality loss Yp Ya >Th Compare Error Control Yout * B. Shim, et al, “Reliable Low-Power Digital Signal Processing via Reduced Precision Redundancy,” TVLSI, 2004.
SDA vs ANT ANT SDA • Challenge: Identification of • significant computations based on • application specific characteristics • Small area and power overhead • Low power overhead (3%) at • nominal voltage • ~25% more power savings in DCT • Finer granularity => More flexible • and able to adapt to conditions • Challenge: Design of reduced • overhead estimators based on • application specific characteristics • Need extra hardware for • approximate computation • Extra hardware translates to • less power savings even at scaled • voltages Algorithm/architecture co-design can lead to energy efficient architectures with minimum area overhead utilizing the inherent error resiliency of ASIC systems
Approximate Computation • Probabilistic Computing* • Recognition, Mining and Synthesis (RMS) applications have • inherently statistical behavior and are based on iterations • No single ‘golden’ result • Subsequent iterations may • compensate for errors in • previous stages • Scalable Effort** • SVM machine with more • than 40% power savings • under acceptable error • rates • Stochastic Processors **V. K. Chippa, et al, “Scalable Effort Hardware Design: Exploiting Algorithmic Resilience for Energy Efficiency,” DAC, 2010. *K. V. Palem, et al “Sustaining moore’s law in embedded computing through probabilistic and approximate design: retrospects and prospects,” CASES, 2009.
Error Resilient System Architecture * L. Leem, et al, “ERSA: Error-Resilient System Architecture For Probabilistic Applications,” DATE , 2010. • Execute control-intensive (error free - significant) in Reliable Core • Execute data operations (errors can be tolerated – less-significant) • in Less Reliable Cores • Applied to RMS applications and LDPC decoding • 90% accuracy is maintained even under 2x10-4 error/cycle/core
Variation Aware Power Management • Exploit the variable workload of applications over time to adjust the • voltage in various power domains on-chip, while considering variations
Outline • Scaling Challenges • Variations, Power Logic • Application Specific Systems • Significance Driven Approach • Algorithmic Noise Tolerance • System Level Techniques • Robust Low Power Memory • Circuit – Bit Cell Level • Architecture Level • Mixed Signal • Conclusion • General Purpose Processors • Error Detection and Correction • Prediction Based techniques
Memory Failures • RDF and LER impact memory more than logic • Memory stability is quantified in terms of read, write and access failure probability • Read Failures (PRF) • Negative Read SNM, Flip of data • Write Failures (PWF) • Access Failures • Cell Failure probability modeled as the union of all failure probabilities
Robust Memory Design: Circuit • New bit cells that isolate read from write • 8T bit-cells have better read stability at the cost of 30% area overhead • 10T bit-cells better stability at lower Vdd • Schmitt Trigger improves both read, write (L. Chang et al. VLSI sym. ’03) Read SNM (6T vs. 8T) • Circuit Level Solutions for 6T based memories • Up-size bit-cells, ABB • Read and write contradict
Robust Memory Design: Architecture • Circuit/Architecture Co-design • Hybrid memory with preferential storage for video applications combining 8T and 6T cells • 46% power savings (@10MHz) with 11% area overhead • Access time requirements met at 600mV *J. Chang, et al, “A voltage-scalable & process variation resilient hybrid SRAM architecture for MPEG-4 video processors,” DAC, 2009. • Architecture Level • Addition of redundant rows and columns • Error Correction Codes (ECC)
Outline • Scaling Challenges • Variations, Power Logic • Application Specific Systems • Significance Driven Approach • Algorithmic Noise Tolerance • System Level Techniques • Robust Low Power Memory • Circuit – Bit Cell Level • Architecture Level • Mixed Signal • Conclusion • General Purpose Processors • Error Detection and Correction • Prediction Based techniques
Mixed-Signal: Adaptive Wireless Systems Channel/Signal Quality Health Adaptation Control: Hardware/Software System: RF Front End Processor
Tuning Methodology Analysis and control engine (ACE)
BIST for Multiple Specs NOTE: Model building across process and tuning knobs!! Production Phase
Process Tuning Approach Test Stimulus RF to low- frequency conversion Test Response Diagnosis Two different techniques for tuning
Test Stimulus and Test Response Test stimulus (Multi-tone) Transmitter Output (RF signal: 2.4GHz) Test response Sampled by on-board ADC (low frequency envelope) Multiple Instances Process Perturbations Sensor at Tx output
Adaptive LNA TSMC .18um CMOS Design
Experimental Results: Large parameter analysis Large parameter instances Small parameter instances
Experimental Results: Large parameter tuning Non Tunable
Experimental Results: BIST AT-BIST Results Across Process and Tuning Knobs for Tx
207 possible knob combinations (P1) for yield recovery • Power conscious knob combination (P1) : 0.5724W • Converged Knob combination (P1) : 0.5724W Experimental Results: Nominal Specs One-Instance (P1)
Mixed-Signal • Concurrent Tuning of Multiple Specifications • Estimation of multiple specs using MARS • Fast convergence: 3 to 5 iterations for 4 knobs • Power Optimum convergence for large and small parameter deviations • 30% Improvement in yield • Ongoing Work • Hardware Analysis • Receiver Analysis • Concurrent tuning of transmitter and receiver
Conclusion • Cross Layer Design Techniques that facilitate • Voltage Over-Scaling and • Tolerance to Parametric Variations • Application to the design of energy efficient Logic (Digital and Mixed Signal) and Memory Blocks • Combination of the presented techniques can allow the design of low power and robust systems in the nano-scale as well as in the post-silicon era