670 likes | 716 Views
Micro transductors ’08 Low Power VLSI Design 2. Dr.-Ing. Frank Sill Department of Electrical Engineering, Federal University of Minas Gerais, Av. Antônio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil franksill@ufmg.br http://www.cpdee.ufmg.br/~frank/. Agenda. Recap
E N D
Micro transductors ’08Low Power VLSI Design 2 Dr.-Ing. Frank Sill Department of Electrical Engineering, Federal University of Minas Gerais, Av. Antônio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil franksill@ufmg.br http://www.cpdee.ufmg.br/~frank/
Agenda • Recap • Power reduction on • Gate level • Architecture level • Algorithm level • System level Micro transductors ‘08, Low Power 2
Recap: Problems of Power Dissipation • Continuously increasing performance demands • Increasing power dissipation of technical devices • Today: power dissipation is a main problem • High Power dissipation leads to: • Reduced time of operation • Higher weight (batteries) • Reduced mobility • High efforts for cooling • Increasing operational costs • Reduced reliability Micro transductors ‘08, Low Power 2
CL Recap: Consumption in CMOS • Voltage (Volt, V) Water pressure (bar) • Current (Ampere, A) Water quantity per second (liter/s) • Energy Amount of Water 1 0 Energy consumption is proportional to capacitive load! Micro transductors ‘08, Low Power 2
Approach 1 Approach 2 Approach 1 Approach 2 Recap: Energy and Power Power is height of curve Watts time Energy is area under curve Watts time Energy = Power * time for calculation = Power * Delay Micro transductors ‘08, Low Power 2
Recap: Power Equations in CMOS P = α f CL VDD2 + VDD Ipeak (P01 + P10)+ VDD Ileak Short-circuit power (≈10 % today and decreasing absolutely) Leakage power (≈20 – 50 % today and increasing) Dynamic power (≈ 40 - 70% today and decreasing relatively) Micro transductors ‘08, Low Power 2
Recap: Levels of Optimization Micro transductors ‘08, Low Power 2 nach Massoud Pedram
Recap: Logic Restructuring • Chain implementation has a lower overall switching activity than tree implementation for random inputs • BUT:Ignores glitching effects • Logic restructuring: changing the topology of a logic network to reduce transitions AND: P01 = P0 * P1 = (1 - PAPB) * PAPB 3/16 0.5 A Y 0.5 (1-0.25)*0.25 = 3/16 A B W 0.5 15/256 7/64 = 0.109 X B F 15/256 0.5 0.5 C C F 0.5 D D Z 0.5 0.5 3/16 = 0.188 Source: Timmernann, 2007 Micro transductors ‘08, Low Power 2
Recap: Input Ordering Beneficial: postponing introduction of signals with a high transition rate (signals with signal probability close to 0.5) (1-0.2x0.1)*(0.2x0.1)=0.0196 (1-0.5x0.2)*(0.5x0.2)=0.09 0.2 0.5 B A X X C B F F 0.1 A 0.2 C 0.5 0.1 AND: P01 = (1 - PAPB) * PAPB Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Unit Delay Recap: Glitching A X B Z C ABC 101 000 X Z Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Design Layer: Gate Level • Basic elements: • Logic gates • Sequential elements (flipflops, latches) • Behavior of elements is described in libraries Micro transductors ‘08, Low Power 2
fcircuit=1 fcircuit=10 fcircuit=2 fcircuit=20 fcircuit=5 Dynamic Power and Device Size • Device Sizing (= changing gate width) • Affects input capacitance Cin • Affects load capacitance Cload • Affects dynamic power consumption Pdyn • Optimal fanout factor f for Pdynis smaller than for performance (especially for large loads) • e.g., for Cload=20, Cin=1 • fcircuit = 20 • fopt_energy = 3.53 • fopt_performance = 4.47 • For Low Power: avoid oversizing (f too big) beyond the optimal 1.5 1 normalized energy 0.5 0 1 2 3 4 5 6 7 fanout f Source: Nikolic, UCB Micro transductors ‘08, Low Power 2
VDD versus Delay and Power • Delay (td) and dynamic power consumption (Pdyn) are functions of VDD Pdyn td Micro transductors ‘08, Low Power 2
Multiple VDD • Main ideas: • Use of different supply voltages within the same design • High VDD for critical parts (high performance needed) • Low VDD for non-critical parts (only low performance demands) • At design phase: • Determine critical path(s) (see upper next slide) • High VDD for gates on those paths • Lower VDD on the other gates (in non-critical paths) • For low VDD: prefer gates that drive large capacitances (yields the largest energy benefits) • Usually two different VDD (but more are possible) Micro transductors ‘08, Low Power 2
VDDH VDDL Vout Vin Multiple VDD cont’d • Level converters: • Necessary, when module at lower supply drives gate at higher supply (step-up) • If gate supplied with VDDL drives a gate supplied with VDDH then PMOS never turns off • Possible implementation: • Cross-coupled PMOS transistors • NMOS transistor operate on reduced supply • No need of level converters for step-down change in voltage • Reducing of overhead: • Conversions at register boundaries • Embedding of inside flipflop Micro transductors ‘08, Low Power 2
Data Paths • Data propagate through different data paths between registers (flipflops - FF) • Paths mostly differ in propagation delay times • Frequency of clock signal (CLK) depends on path with longest delay critical path Paths Path Micro transductors ‘08, Low Power 2
G1 ready with evaluation all inputs of G2 all Inputs of G1 arrived arrived delay of G1 Data Paths: Slack A B Y C time Slack for G1 Micro transductors ‘08, Low Power 2
Connected with VDDL Connected with VDDH Multiple VDD in Data Paths • Minimum energy consumption when all logic paths are critical (same delay) • Possible Algorithm: clustered voltage-scaling • Each path starts with VDDH and switches to VDDL (blue gates) when slack is available • Level conversion in flipflops at end of paths Micro transductors ‘08, Low Power 2
Design Layer: Architecture Level • Also known as Register transfer level (RTL) • Base elements: • Register structures • Arithmetic logic units (ALU) • Memory elements • Only behavior is described (no inner structure) Micro transductors ‘08, Low Power 2
R e g Functional unit clock disable Clock Gating • Most popular method for power reduction of clock signals and functional units • Gate off clock to idle functional units • Logic for generation of disable signal necessary • Higher complexity of control logic • Higher power consumption • Critical timing critical for avoiding of clock glitches at OR gate output • Additional gate delay on clock signal Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Clock Gating cont’d • Clock-Gating in Low-Power Flip-Flop D D Q CLK Source: Agarwal, 2007 Micro transductors ‘08, Low Power 2
Combinational logic PI PO Flip-flops Clock activation logic Latch CLK Clock Gating cont’d • Clock gating over consideration of state in Finite-State-Machines (FSM) Source: L. Benini and G. De Micheli, Dynamic Power Management, Boston: Springer, 1998. Micro transductors ‘08, Low Power 2
Without clockgating 30.6mW DEU VDE With clock gating 8.5mW MIF DSP/ HIF 0 5 10 15 20 25 Power [mW] 896Kb SRAM Clock Gating: Example • 90% of FlipFlops clock-gated • 70% power reduction by clock-gating MPEG4 decoder Source: M. Ohashi,Matsushita, 2002 Micro transductors ‘08, Low Power 2
Recap: VDD versus Delay and Power Dynamic Power can be traded by delay Pdyn td Micro transductors ‘08, Low Power 2
A Reference Datapath Combinational logic Output Register Input Register Cref CLK Supply voltage = Vref Total capacitance switched per cycle = Cref Clock frequency = fClk Power consumption: Pref = CrefVref2fclk Source: Agarwal, 2007 Micro transductors ‘08, Low Power 2
Register Register Register Register Parallel Architecture Supply voltage: VN ≤ Vref N = Deg. of parallelism Each copy processes every Nth input, operates at reduced voltage Comb. Logic Copy 1 fclk/N Comb. Logic Copy 2 Output Input N to 1 multiplexer fclk/N fclk Comb. Logic Copy N Multiphase Clock gen. and mux control fclk/N CK Source: Agarwal, 2007 Micro transductors ‘08, Low Power 2
CLK A/N A/N A/N Data Data Pipelined Architecture • Reduces the propagation time of a block by factor N Voltage can be reduced at constant clock frequency • Constant throughput • Functionality: Area A CLK CLK Micro transductors ‘08, Low Power 2
Parallel Architecture: Example • Reference Data path (for example) • Critical path delay Tadder + Tcomparator (= 25 ns) fref = 40 MHz • Total capacitance being switched = Cref • VDD = Vref = 5V • Power for reference datapath = Pref = Cref Vref2 fref A B Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Parallel Architecture: Example cont’d • The clock rate can be reduced by half with the same throughput fpar = fref / 2 • Vpar = Vref / 1.7, Cpar = 2.15 Cref • Ppar = (2.15 Cref) (Vref / 1.7)2 (fref / 2) = 0.36 Pref Area = 1476 x 1219 µ2 Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Pipelined Architecture: Example • fpipe = fref, , Cpipe = 1.1 Cref , Vpipe = Vref / 1.7 • Voltage can be dropped while maintaining the original throughput • Ppipe = CpipeVpipe2 fpipe = (1.1 Cref) (Vref/1.7)2 fref = 0.37 Pref Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Approximate Trend Source: G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998. Micro transductors ‘08, Low Power 2
A B B Latch Multiplier Multiplier C C condition condition Guarded Evaluation • Reduction of switching activity by adding latches at inputs • Latch preserves previous value of inputs to suppress activity • Could also use AND gates to mask inputs to zero = forced zero A Micro transductors ‘08, Low Power 2
Precomputation • Identify logical conditions at inputs that are invariant to the output • Since those inputs don’t affect output, disable input transitions • Trade area for energy Precomputed inputs R1 Combination logic f(X) Outputs Gated inputs R2 Load disable g(X) g(X) Precomputation logic Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Precomputation: Design Issues • Design steps 1. Selection of precomputation architecture 2. Determination of precomputed and gated inputs (Register R1 should be much smaller than R2) 3. Search good implementation for g(X) 4. Evaluation of potential energy savings based on input statistics (if savings not sufficient go to step 2 or 3 and try again) • Also works for multiple output functions where g(X) is the product of gj(X) over all j Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
An R1 Bn n-bit binary value comparator A > B An-1 A > B Bn-1 R2 A1 B1 Load disable An = Bn Precomputation: Example • Binary Comparator Can achieve up to 75% power reduction with 3% area overhead and 1 to 5 additional gate delays in worst case path Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Ripple Carry Carry Select 0 1 FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA Variable/Fixed Width Carry Skip Carry Look-ahead FA FA FA FA Adder Design • Various algorithms exist to implement an integer adder • Ripple, select, skip (x2), Look-ahead, conditional-sum. • Each with its own characteristics of timing and power consumption. Source: Mendelson, Intel Micro transductors ‘08, Low Power 2
Adder Design • Adders differ in Energy and delay • Different adders for different applications • Also true for other units (multiplier, counter, …) Delay Energy ( pJ ) ( nSec ) Ripple Carry 117 54 . 27 Constant Width Carry Skip 109 28 . 38 Variable Width Carry Skip 126 21 . 84 Carry Lookahead 171 17 . 13 Carry Select 216 19 . 56 Conditional Sum 304 20 . 05 Source: Callaway, Swartzlander “Estimating the power consumption of CMOS adders” - 11th Symposium on Computer Arithmetic, 1993. Proceedings. Micro transductors ‘08, Low Power 2
Bus Power • Buses are significant source of power dissipation • 50% of dynamic power for interconnect switching (Magen, SLIP 04) • MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003) • Caused by: • High switching activities • Large capacitive loading Wout Xout Yout Zout Bus receivers Bus Bus drivers Ain Bin Cin Din Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Bus Power Reduction • For an n-bit bus: Pbus = n* αfClkCloadVDD2 • Alternative bus structures • Segmented buses (lower Cload) • Charge recovery buses • Bus multiplexing (lower fClk possible) • Minimizing bus traffic (n) • Code compression • Instruction loop buffers • Minimization of bit switching activity (fclk) by dataencoding • Minimize voltage swing (VDD2) using differential signaling Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Local bus architecture Global bus architecture Reducing Shared Resources • Shared resources incur switching overhead • Local bus structures reduce overhead Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
B Segmented Bus B Reducing Shared Resources cont’d • Bus segmentation • Another way to reduce shared buses • Control of bus segment by controller blocks (B) Shared Bus Source: Evgeny Bolotin – Jan 2004 Micro transductors ‘08, Low Power 2
Design Layer: Algorithm Level • Base elements: • Functions • Procedures • Processes • Control structures • Description of design behavior Micro transductors ‘08, Low Power 2
Coding styles • Use processor-specific instruction style: • Variable types • Function calls style • Conditionalized instructions (for ARM) • Follow general guidelines for software coding • Use table look-up instead of conditionals • Make local copies of global variables so that they can be assigned to registers • Avoid multiple memory look-ups with pointer chains Micro transductors ‘08, Low Power 2
Source-code Transformations • Minimize power-consuming activity: • Computation • Communication • Storage A*B+A*C A*(B+C) receive (A) for (c = 1..N) B=c*A for (c = 1..N) receive (A) B=c*A for (c = 1..N) B[c] = A[c]*D[c] for (c = 1..N) F[c] = B[c]-1 for (c = 1..N) F[c] = A[c]*D[c]-1 Micro transductors ‘08, Low Power 2
14000 12000 10000 Others 8000 Functional Unit Switched Capacitance (nF) Pipeline Registers 6000 Register File 4000 2000 0 bubble.c heap.c quick.c Datapath Energy Consumption Algorithms can differ in power dissipation Source: Irwin, 2000 Micro transductors ‘08, Low Power 2
Adaptive Dynamic Voltage Scaling (DVS) • Slow down processor to fill idle time • More Delay lower operational voltage • Runtime Scheduler determines processor speed and selects appropriate voltage • Transitions delay for frequencies ~150s • Potential to realize 10x energy savings Active Idle Idle 3.3 V Active Active 2.4 V Micro transductors ‘08, Low Power 2
Adaptive DVS: Example • Task with 100 ms deadline, requires 50 ms CPU time at full speed • Normal system gives 50 ms computation, 50 ms idle/stopped time • Half speed/voltage system gives 100 ms computation, 0 ms idle • Same number of CPU cycles but: E = C (VDD/2)2 = Eref / 4 • Dynamic Voltage Scaling adapts voltage to workload T1 T2 T1 T2 Same work, lower energy Speed Idle Task Task Time Time Micro transductors ‘08, Low Power 2
ALU M M E E M M MP 3 Design Layer: System Level • Basic Elements: • Complex modules • Processors • Calculation and control units • Sensors Micro transductors ‘08, Low Power 2
Dynamic Power Management • Systems are: • Designed to deliver peak performance, but … • Not needing peak performance most of the time • Components are idle sometimes • Dynamic power management (DPM): • Puts idle components in low-power non-operational states when idle • Power manager: • Observes and controls the system • Power consumption of power manager is negligible Micro transductors ‘08, Low Power 2
Processor Sleep Modes • Software power control - power management DOZE Most units stopped except on-chip cache memory (cache coherency) NAP Cache also turned off, PLL still on, time out or external interrupt to resume SLEEPPLL off, external interrupt to resume Deeper sleep mode requires more latency to resume Deeper sleep mode consumes less power Micro transductors ‘08, Low Power 2