370 likes | 565 Views
Advanced Microarchitecture. Lecture 15: Power. Basic Power Review. Power = Voltage × Current Voltage is usually a constant (we’ll talk about voltage scaling later) Current varies Depends on the block (cache vs. ALU vs. decoder …) Depends on the application ( int vs. FP vs. multimedia)
E N D
Advanced Microarchitecture Lecture 15: Power
Basic Power Review • Power = Voltage × Current • Voltage is usually a constant • (we’ll talk about voltage scaling later) • Current varies • Depends on the block (cache vs. ALU vs. decoder …) • Depends on the application (int vs. FP vs. multimedia) • Depends on the program phase • Another form: • i = Cdv/dt vi dt = Cvdv P = ½CV2 • Power = Energy of each capacitor × avg times (dis)charged/time to (dis)charge • = bAll Blocks½CbV2ab/tc = ½V2f b Cbab = ½aCV2f C = Total Capacitance a = average activity factor Lecture 15: Power
Static Power • We talked about this in Lecture 1 • Two types of static power • Leakage through the channel (sub-threshold conductance) • Leakage through the gate/oxide (tunneling) • Pstatic = Psub + Poxide • Ptotal = Pdynamic + Pstatic = ½aCV2f + K1We-VT/nVq(1-e-V/Vq) + K2W(V/Tox)2e-aTox/V Lecture 15: Power
Trading Power for Performance • P = ½aCV2f, f V P V3 • To a first order, Perf f Perf V P V3 … we get a cubic decrease in (dynamic) power consumption Power For a linear decrease in voltage (and performance) Rule of thumb: for small DV/Df, 1% performance for every 3% power Voltage http://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf Lecture 15: Power
Limits of Trading Voltage scaling can take the supply voltage down only so far • Vdd – VT > VNoise Margin • Vdd cannot be scaled below VT + VNoise Margin P V3 Power noise Voltage/Frequency Vdd VT Below this, we can only use frequency scaling (decrease f, but keep V constant), which provides only linear power reduction (½CV2f) Gnd Noise can cause transistor to accidentally switch! Lecture 15: Power
DVFS • Dynamic Voltage/Frequency Scaling • Someone tracks performance demands, idleness, etc. • “Someone” is typically the OS with hardware support • … but you could have a hardware only-approach • Under thermal emergencies, the HW takes over regardless of what voltage/frequency the OS asks for • Goal: consume minimum power necessary while still meeting performance demands • Can also do just DVS or DFS Lecture 15: Power
Clock Gating • CMOS logic is also called “static” logic: • If the inputs don’t change, neither do the outputs (or any other intermediate nodes) • Therefore, to reduce dynamic power in CMOS circuits, don’t let the inputs change if you don’t need to! 8644 1976 1234 5903 9087 Latch doesn’t grab new value, so its output (block’s input) doesn’t change CMOS Block 8644 1976 1234 5903 CMOS Block Power dissipated Clock gate this block? Lecture 15: Power
Example: ALU Note, this logic consumes its own power opcode Clock-gating Logic opcode + Based on opcode, the logic clock-gates all but the one required unit opcode shift + All units consume power, but only one output is useful shift one result logic one result logic comp comp × × Lecture 15: Power
Logic Timing • To properly clock-gate, you must know you’re going to gate the cycle before (otherwise it’ll be too late as the clock edge will have already arrived) Payload RAM Opcode ValueL ValueE + logic Clock-gating Logic comp Lecture 15: Power
Logic Timing • Not all blocks can be easily gated • may be difficult to know whether gating should be applied ahead of time • likely true for critical path circuits: e.g., gating select logic probably difficult since bidders not known until last moment • computation of gating condition may be complex • value-based (is input zero?) • multi-value based (are all inputs zero?) • multi-condition based (are all RS entries not bidding?) Lecture 15: Power
Clock Gating Dynamic Logic • CMOS logic toggles only when input changes • Dynamic logic may consume power regardless If A (or B) equals 1 and does not change, then sequence is: precharge X to 1, evaluate discharges X to 0, precharge X to 1, evaluate … X X CMOS NOR gate Gating inputs is not enough; need to ensure CLK is disabled. N-Domino NOR gate pictures from http://6004.csail.mit.edu/6.371/handouts/L11.pdf Lecture 15: Power
Clock Gating is for Dynamic Power • Even if gates not toggling, they continue to leak Vdd Vdd gate leakage Off On gate leakage subthreshold leakage 1 0 subthreshold leakage gate leakage On Off gate leakage Gnd Gnd Lecture 15: Power
intermediate node has V > 0 V R V/2 R 0 Higher threshold voltage decreases leakage current VB=0 Higher VSB increases VT VSV/2 Reducing Leakage: Stacking 1 1 channel leakage channel leakage 0 0 higher resistance vs. Higher resistance increases gate latency Lecture 15: Power
Body Bias Effect Less Channel Leakage VS Channel Leakage VS Larger VSB VB VB WARNING: This is a GROSSLY simplified explanation!!! If you’re interested in low-power circuits and microarchitecture, you should go read up on some real semiconductor/electronics literature. Lecture 15: Power
Dual VT Devices • Manufacture two types of transistors: • Low VT gates: fast, high leakage • High VT gates: slow, low leakage (typically 10x less) • Designer chooses what kind to use • Pro: • less area than stacking (one high-VT gate = one low-VT gate in area, stacking requires multiple gates) • Con: • Manufacturing process needs to provide two device types Lecture 15: Power
Use Only Where Appropriate • Stacking and higher VT both slow down the gates • Analyze circuits and… • apply one or both techniques to gates not on the critical path • apply to longest path if timing permits (i.e., this circuit is not a frequency limiter) Stack or use high-VT gates here Critical path gates Lecture 15: Power
0 0 1 1 Off Off On On Off Off On On 0 0 1 1 Off On On Of Off Off On On Standby Input Vectors • The amount of leakage depends on the clock-gated inputs to the gate 2 off transistors in parallel 1 off transistor in leakage path 1 off transistor in leakage path 2 off transistors in leakage path Lecture 15: Power
Standy Input Vectors • When clock-gating a block • disable latch clock (as usual) • load leakage-minimizing input vector (stored elsewhere) 1 1 Can cause spurious transitions that consume more dynamic power 1 1 1 Clock gate • How to determine best input vector for n-input gate? Lecture 15: Power
Variant: Embedded Dual-VT • Instead of at the gate-level, choose high-VT vs. low-VT at the transistor-level • Can be used if some transitions are more important than others • “more important” can be speed or power • Combine with setting input sleep vectors • make the off transistors high-VT if possible to further reduce leakge High-VT devices Low-VT devices Lecture 15: Power
Power Gating • If you turn off the power, then the gates can’t leak Vdd X off 0 1 X Vdd Gnd Virtual Vdd Off On This gating transistor is a beast… it needs to be big enough to supply the necessary current when not-gated, also needs to be low leakage (high VT gate) 0 Off Gnd Gnd Gating transistor also called “sleep” transistor Lecture 15: Power
Vdd Virtual Vdd Virtual VGnd Both paths cut off now Gnd Power Gating Vdd Off Virtual Vdd After gating, residual charge in system will continue to leak Lecture 15: Power
Turn-On/Turn-Off Latency • Sleep transistors are slow high VT devices • Depending on size of block covered by sleep transistor, virtual Vdd/Gnd may have a lot of capacitance to charge/discharge delay to wakeup ALU Vdd ALU asleep time R ADD inst ready to execute ADD exec Virt. Vdd C Wakeup delay can cause significant performance penalties when units unavailable Moderate R, Large C Large RC (slow) Lecture 15: Power
Turn-On/Turn-Off Latency • In some situations, can know early enough ahead (crude pipeline) fadd exec fetch decode FP inst decoded! FPU Immediately send wakeup to FPU Hopefully by the time the fadd makes it to the OOO core, gets scheduled, and makes it to the FPU, the turn-on has completed Lecture 15: Power
power-off front-end units (fetch, decode, etc.) Turn-On/Turn-Off Latency • In some cases it’s much harder pipeline full/stalled (maybe due to D$ miss to main memory) miss serviced, back-end starts moving again; front-end starts wake up back-end gets starved because front-end wakeup is too slow and can’t refill the pipeline But it’s hard to start the power-on early because we don’t know when the memory request will be fulfilled (and whether that will cause the back-end to drain) Lecture 15: Power
Turn-On/Turn-Off Power • (Dis)Charging Virtual Vdd/Gnd consumes quite a bit of energy/power P = ½aCV2f • Worst-case: charge up as soon as you’re done discharging Virt. Vdd time Go to sleep! Done discharging, now wakeup! We just wasted 2×½×CVirt Vdd×Vdd2 Watts to discharge and then recharge the virtual Vdd And we spent zero cycles fully asleep, so we didn’t save any/much leakage power Lecture 15: Power
Extra energy spent Energy reduction Too little sleep… ends up costing more energy than doing nothing Sleep interval > break-even length Turn-On/Turn-Off Power • Must stay asleep for some time, just to break even! Energy to recharge Virtual Vdd/Gnd Zero energy consumed while sleeping Energy consumed from leakage (no sleeping) Energy to discharge Virtual Vdd/Gnd Energy consumed time Minimum sleep-time for energy break-even Lecture 15: Power
Ishower Ishower - Ijohn Pressure Drop Ijohn Flush! Turn-On/Turn-Off Noise • Instantly turning on the sleep transistor to recharge virtual Vdd causes very large current spike (di/dt) Solution: progressive turn-on; recharge virtual Vdd slowly, which limits Ijohn (i.e., Irecharge) to keep pressure drop (supply noise) under control Water Tank Slowing down recharge increases performance penalty when recharge is late Current for recharging virtual Vdd Lecture 15: Power
Example: Intel Core (not Core 2) Relative Power Consumption • OS power management (OSPM) • algorithm monitors CPU load over some window of time • computes target performance point, requests from CPU • CPU is expected to modify operating voltage/frequency to match OSPM’s request Voltage and frequency scaling Frequency scaling only • OS can choose different power saving states (C0 – Cn) • C0: active state (no power saving) • Ci: higher i more power savings, but longer recovery time http://download.intel.com/technology/itj/2006/volume10issue02/vol10_art03.pdf Lecture 15: Power
Example: Core Idle States • C0: Active • C1 (processor-centric measures) • instruction execution halted, clocks are gated • C2: CPU does not access bus w/o chipset’s consent • allows bus to be put in low-power mode • C3: CPU disables PLLs (clock generators) • C4: CPU lowers voltage to minimum level while still being able to retain state (e.g., cache contents) • DC4: “Deep” C4 (next slide) Lecture 15: Power
Example: Core Sleep State • Upon entering C4, flush L2 cache to main memory • Don’t do it all at once! • If C4 period is short, then you waste more power due to flushing • Can have performance impact on wakeup since cache will be cold • Flush only part of the L2 (1/8 to 1/2) by ways • once a complete way has been flushed, power gate it with sleep transistors (discussed later) • Do this upon each entry into C4 state • When L2 shrunk to 0 bytes, enter DC4 • Greatly reduce voltage since there’s no state to retain • No need to wakeup cache for snoops • Chipset directs snoop traffic directly to memory • Typically expand cache to minimum of two ways on exit from DC4 Lecture 15: Power
Example: Core Duo • Many shared resources • PLL, power supply, L2 cache • Can’t (easily) run cores at different clock speeds with a single PLL • Can’t run cores at different voltages with a single power supply • Can’t turn off L2 cache just because one core is idle • External interface complications • OS sees two separate CPUs • one C-state per core • Platform views the whole processor as a single entity for power-management (for C2 state and higher) OS can request C-states on a per-core basis Platform sees only a single C-state (the lower of the two) Lecture 15: Power
Turbo-Mode “Intel Dynamic Acceleration Technology” • If one core is in deep-sleep, it’s not consuming much power • Idea: use DVFS in reverse to increase voltage/freqency power limit Deliver more performance when running a single program and not worried about battery life (plugged in to wall) core 1 core 0 relative performance power core 0 core 0 Both cores in C0 Core 0 in C0 Core 1 in DC4 Core 0 in C0 Core 1 in DC4 Lecture 15: Power
Provide a way to explicitly bias VB Set VBBN < 0 makes VSB > 0 for this NFET Variable VT Devices • Pros: • significant standby leakage reduction • memory elements retain state • no transistor sizing/partitioning required • dynamically tunable VT at runtime • Cons: • requires expensive triple-well fabrication process • body-biasing effect decreases with technology scaling VB=0 Higher VSB increases VT VSV/2 Earlier body-bias effect from stacked transistors due to higher source voltage Since VBBN < 0, also called “reverse biasing” Kao et al., Embedded Tutorial: Subthreshold Leakage Modeling and Reduction Techniques, ICCAD 2002 Lecture 15: Power
Vfwd-bias VSB < 0 VT decreases transistors are faster (but consume more power) Body-Biased Cache • Super-high VT for caches (very slow) • Use selective forward-body biasing during access to read/write at a reasonable speed Very-high VT devices (very low leakage, slow access speed) VBBN 0 0 0 0 0 Vfwd-bias 0 0 0 0 A few cache lines go into high leakage mode, but only very briefly (during access). The rest of the time, it consumes very little leakage power. Access Access Completed Lecture 15: Power
GALS • Different blocks have different performance needs • and this varies in time • Idea: clock different blocks at different speeds • Apply voltage/frequency scaling to blocks/groups-of-blocks • e.g., FP units can be slowed down (or maybe even completely turned off) for integer applications • Block consumes less power when it doesn’t have to operate in max-performance mode • GALS = Globally Asynchronous, Locally Synchronous Lecture 15: Power
GALS Example Baseline Processor GALS Processor http://www.ece.cmu.edu/~dianam/conferences/isca02.pdf Lecture 15: Power
FIFO between domains must “speak” both voltages Vdd1 Vdd2 Voltage Issues: 1.5V 0.75V “1” 0V “0” 0V GALS Issues • How to communicate between clock domains? Timing Issues: Asynchronous FIFO Design [Chelcea and Nowick] Producer can clear empty, but it gets cleared on clk2 Consumer clears the full signal, but it occurs on clk1 “1” (0.75V) 0.75V =0/1? Lecture 15: Power