620 likes | 632 Views
ELEC 5270/6270 Spring 2015 Low-Power Design of Electronic Circuits Power Aware Microprocessors. Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng.auburn.edu
E N D
ELEC 5270/6270 Spring 2015Low-Power Design of Electronic CircuitsPower Aware Microprocessors Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr15/course.html ELEC5270/6270 Spr 15, Lecture 8
SIA Roadmap for Processors (1999) Untrue predictions. Source: http://www.semichips.org ELEC5270/6270 Spr 15, Lecture 8
Power Reduction in Processors • Hardware methods: • Voltage reduction for dynamic power • Dual-threshold devices for leakage reduction • Clock gating, frequency reduction • Sleep mode • Architecture: • Instruction set • hardware organization • Software methods ELEC5270/6270 Spr 15, Lecture 8
Performance Criteria • Throughput – computations per unit time. • Performance is inverse of time – increasing CPU time indicates lower performance. • Power – computations per watt. • Energy efficiency – performance/joule. ELEC5270/6270 Spr 15, Lecture 8
SPEC CPU2006 Benchmarks • Standard Performance Evaluation Corporation (SPEC) • http://www.spec.org • Twelve integer and 17 floating point programs, CINT2006 and CFP2006. • Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra Enterprise 2 system with a 296 MHz UltraSPARC II processor. • It takes about 12 days to run all benchmarks on reference system. • CINT2006 and CFP2006 metrics are the geometric means of SPEC ratios: • Peak metric – each program is individually optimized (aggressive compilation). • Base metric – common optimization for all programs. ELEC5270/6270 Spr 15, Lecture 8
SPEC CINT2006 Results • http://www.spec.org/cpu2006/results/cint2006.html • Dell Inc., PowerEdge R610 • CPU: Intel Xeon X5670, 2.93 GHz • Number of chips 2, cores 12, threads/core 2 • Performance metric 36.6 base, 39.4 peak • Dell Inc. PowerEdge M905 • CPU: AMD Opteron 8381 HE, 2.50 GHz • Number of chips 4, cores 16, threads/core 1 • Performance metric 15.8 base, 19.1 peak ELEC5270/6270 Spr 15, Lecture 8
SPEC CFP2006 Results • http://www.spec.org/cpu2006/results/cfp2006.html • Dell Inc., PowerEdge R610 • CPU: Intel Xeon X5670, 2.93 GHz • Number of chips 2, cores 12, threads/core 2 • Performance metric 42.5 base, 45.8 peak • Dell Inc. PowerEdge M905 • CPU: AMD Opteron 8381 HE, 2.50 GHz • Number of chips 4, cores 16, threads/core 1 • Performance metric 17.4 base, 21.5 peak ELEC5270/6270 Spr 15, Lecture 8
Other Benchmarks • LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers. • SPECPOWER_ssj2008 measures power and performance of a computer system. • The initial benchmark addresses the performance of server-side Java; additional workloads are planned. • http://www.spec.org/benchmarks.html#power ELEC5270/6270 Spr 15, Lecture 8
Second Quarter 2010 SPECpower_ssj2008 Results • http://www.spec.org/power_ssj2008/results/res2010q2/ • Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7 • CPU: AMD Opteron 6174, 2.2GHz • Number of chips 2, cores 12, threads/core 2 • Total memory 16GB • ssj operations @ 100% 888,819 • Average power @ 100% 271 W • Average power @ active idle 101 W • Overall ssj operations per watt 2,355 ELEC5270/6270 Spr 15, Lecture 8
Second Quarter 2010 SPECpower_ssj2008 Results • http://www.spec.org/power_ssj2008/results/res2010q2/ • May 19, 2010: Dell Inc., PowerEdge R610 • CPU: Intel Xeon X5670, 2.93 GHz • Number of chips 2, cores 12, threads 2 • Total memory 12GB • ssj operations @ 100% 914,076 • Average power @ 100% 244 W • Average power @ active idle 62.3 W • Overall ssj operations per watt 2,938 ELEC5270/6270 Spr 15, Lecture 8
Energy SPEC Benchmarks • Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = ──────────── Average power D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The Hardware/Software Interface, 4th Edition, Morgan Kaufmann Publishers (Elsevier), 2009, ELEC5270/6270 Spr 15, Lecture 8
Energy Efficiency • Efficiency averaged on n benchmark programs: n Efficiency = (Π Efficiencyi)1/n i=1 where Efficiencyi is the efficiency for program i. • Relative efficiency: Efficiency of a computer Relative efficiency = ───────────────── Eff. of reference computer ELEC5270/6270 Spr 15, Lecture 8
SPEC2000 Relative Energy Efficiency Always max. clock Laptop adaptive clk. Min. power min. clock ELEC5270/6270 Spr 15, Lecture 8
Voltage Scaling • Dynamic: Reduce voltage and frequency during idle or low activity periods. • Static: Clustered voltage scaling • Logicon non-critical paths given lower voltage. • 47% power reduction with 10% area increase reported. • M. Igarashi et al., “Clustered Voltage Scaling Techniques for Low-Power Design,” Proc. IEEE Symp. Low Power Design, 1997. ELEC5270/6270 Spr 15, Lecture 8
Processor Utilization Throughput = Operations / second Compute-intensive processes Maximum throughput Low throughput (background) processes Throughput System idle Time ELEC5270/6270 Spr 15, Lecture 8
Examples of Processes • Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. • Low throughput: data entry, screen updates, low bandwidth I/O data transfer. • Idle: no computation, no expected output. ELEC5270/6270 Spr 15, Lecture 8
Effects of Voltage Reduction • Voltage reduction increases delay, decreases throughput: • Slow reduction in throughput at first • Rapid reduction in throughput for VDD≤ Vth • Time per operation (TPO) increases • Voltage reduction continues to reduce power consumption: • Energy per operation (EPO) = Power × TPO ELEC5270/6270 Spr 15, Lecture 8
Energy per Operation (EPO) 1.0 0.5 0.0 EPO Power TPO 1 2 3 4 5 VDD / Vth ELEC5270/6270 Spr 15, Lecture 8
Dynamic Voltage and Clock T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors, Springer, 2002, pp. 35-36. ELEC5270/6270 Spr 15, Lecture 8
Example: Find Minimum Energy Mode • Processor data (rated operation): • 2 GHz clock • 1.5 volt supply voltage • 0.5 volt threshold voltage • Power consumption • 50 watts dynamic power • 50 watts static power • Maximum clock frequency for V volt supply (alpha-power law): f α (V – VTH)/V ELEC5270/6270 Spr 15, Lecture 8
Alpha-Power Law Model • Variation of delay with supply voltage: delay α VDD /(VDD – VTH )α VTH = Threshold voltage α = 1 for short-channel devices, ≈ 2 for long-channel devices • T. Sakurai and A. R. Newton, “Delay analysis of series-connected MOSFET circuits,” IEEE Journal of Solid-State Circuits, Vol. 26, pp.122–131, Feb. 1991. • T. Sakurai and A. R. Newton, “A simple MOSFET model for circuit analysis,” IEEE Transaction on Electron Devices, Vol. 38, No. 4, pp.887–894, Apr. 1991. • T. Sakurai, “High-speed circuit design with scaled-down MOSFETs and low supply voltage (invited),” Proc. IEEE ISCAS, pp.1487–1490, Chicago, May 1993. • T. Sakurai, “Alpha-Power Law MOS Model,” IEEE Solid-State Circuits Society Newsletter, Vol. 9, No. 4, pp. 4–5, Oct. 2004. ELEC5270/6270 Spr 15, Lecture 8
Example Cont. • Dynamic power: Pd = CV2f = C(1.5)2×2×109 = 50W C = 11.11 nF, capacitance switching/cycle Pd = 11.11 V2f • Dynamic energy per cycle: Ed = Pd/f = 11.11 V2 ELEC5270/6270 Spr 15, Lecture 8
Example Cont. • Clock frequency: f = k (V – VTH)/V = k (1.5 – 0.5)/1.5 = 2 GHz k = 3 GHz, a proportionality constant f = 3(V – 0.5)/V GHz ELEC5270/6270 Spr 15, Lecture 8
Example Cont. • Static power: Ps = k’ V2 = k’ (1.5)2= 50W k’ = 22.22 mho, total leakage conductance Ps = 22.22 V2 • Static energy per cycle: Es = Ps/f = 22.22 V3/[3(V – 0.5)] = 7.41 V3/(V – 0.5) ELEC5270/6270 Spr 15, Lecture 8
Example Cont. • Total energy per cycle: E = Ed + Es = 11.11 V2 + 7.41 V3/(V – 0.5) • To minimize E, ∂E/∂V = 0, or 5V2 – 4.6V + 0.75 = 0 • Solutions of quadratic equation: V = 0.679 volt, 0.221 volt • Discard second solution, which is lower than the threshold voltage of 0.5 volt. ELEC5270/6270 Spr 15, Lecture 8
Example: Result ELEC5270/6270 Spr 15, Lecture 8
Cycle Efficiency • Cycle efficiency is a rating similar to the maximum clock frequency rating. • Analogy: • Cycle efficiency is similar to miles per gallon (mpg) • Maximum clock frequency is similar to miles per hour (mph) • Reference: A. Shinde and V. D. Agrawal, “Managing Performance and Efficiency of a Processor,” Proc. 45th IEEE Southeastern Symp. System Theory, March 2013. ELEC5270/6270 Spr 15, Lecture 8
Performance in Time • Performance is measured with respect to a program. • Performance = D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008. ELEC5270/6270 Spr 15, Lecture 8
Performance in Energy (Efficiency) • Efficiency is measured with respect to a program. • Efficiency D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008. ELEC5270/6270 Spr 15, Lecture 8
Two Performances • Time performance • Energy performance D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008. ELEC5270/6270 Spr 15, Lecture 8
Time Performance • Speed of a processor is measured in cycles per second or clock frequency (f). • Clock period (1/f) is the time per cycle. • Execution time of a program using C clock cycles = C/f • Time performance = 1/(execution time) = f/C ELEC5270/6270 Spr 15, Lecture 8
Energy Performance • Energy efficiency of a processor may be measured in cycles per joule or cycle efficiency (η). • 1/η is energy per cycle (EPC). • Energy dissipated by a program using C clock cycles = C × EPC = C/η • Energy performance = η/C ELEC5270/6270 Spr 15, Lecture 8
Characterizing Device Technology Speed and Efficiency • Consider 90nm CMOS technology. • Use predictive technology model (PTM). • Example circuit: Eight-bit ripple carry adder. • Nominal voltage = 1.2 volts. • Simulation for varying operating conditions (VDD = 100mV through 1.2V) using Spice: • With random vectors for energy per cycle (EPC = 1/η). • With critical path vectors for clock period (1/f). • Reference: W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm Early Design Exploration,“ IEEE Trans. Electron Devices, vol. 53, no. 11, pp. 2816–2823, 2006. ELEC5270/6270 Spr 15, Lecture 8
Energy per Cycle of 8-Bit Adder • K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. ELEC5270/6270 Spr 15, Lecture 8
Cycle Time of 8-Bit Adder • K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. ELEC5270/6270 Spr 15, Lecture 8
Pentium M processor • Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” Proc.International Symp. Low Power Electronics and Design, 2007, pp. 219-224. • VDD = 1.2V • Maximum clock rate = 1.8GHz • Critical path delay, td = 1/1.8GHz = 555.56ps • Power consumption = 120W • EPC = 120/(1.8GHz) = 66.67nJ ELEC5270/6270 Spr 15, Lecture 8
Cycle Efficiency and Frequency ELEC5270/6270 Spr 15, Lecture 8
Example • For a program that executes in 1.8 billion clock cycles. ELEC5270/6270 Spr 15, Lecture 8
Cycle Efficiency • New energy performance rating: Cycle efficiencyη; unit is cycles per joule. • Clock frequency f in cycles per second is a similar rating for time performance. • Similarity to other popular ratings: • η → mpg • f → mph • Two ratings allow effective time and energy management of an electronic system. ELEC5270/6270 Spr 15, Lecture 8
Problem of Process Variation in Nanometer Technologies Clock specification Power specification From a presentation:Power Reduction using LongRun2 in Transmeta’s Efficon Processor, by D. Ditzel May 17, 2006 Number of chips Nominal voltage Higher voltage operation Lower voltage operation Yield loss due to high leakage Yield loss due to slow speed Lower Vth Vth Higher Vth ELEC5270/6270 Spr 15, Lecture 8
Clock Distribution H-Tree Fanout, λ = 4 Tree depth, s = logλN No. of flip-flops = N clock ELEC5270/6270 Spr 15, Lecture 8
Clock Network Power Pclk = CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . . stages – 1 1 = CLVDD2f Σ ─ n = 0 λn where CL = total load capacitance of N flip-flops (a flip-flop is assumed similar to a clock buffer) λ = constant fanout at each stage in distribution network • Clock consumes about 40% of total processor power, because • Clock is always active • Makes two transitions per cycle, (α = 2) • Clock gating is useful; inhibit clock to unused blocks ELEC5270/6270 Spr 15, Lecture 8
Upper Bound on Clock Power Pclk = CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . . ∞ 1 ≤ CLVDD2f Σ ─ n = 0 λn ≤ CLVDD2f . 1/(1 – 1/ λ) ≤ CLVDD2f . λ /(λ – 1) ≤ 1.333 CLVDD2f , because λ = 4 ELEC5270/6270 Spr 15, Lecture 8
Properties of H-Tree • Balanced clock skew. • Small delay and power consumption. • Requires fine-tuning for complex layout. ELEC5270/6270 Spr 15, Lecture 8
Clock Power and Delay • Unit size buffer or inverter delay = d • Total dynamic power supplied to N flip-flops, P = CLVDD2f • Total power consumption of clock network: ELEC5270/6270 Spr 15, Lecture 8
Clock Network Examples D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998. ELEC5270/6270 Spr 15, Lecture 8
Architecture Level: Pipeline Gating • A pipeline processor uses speculative execution. • Incorrect branch prediction results in pipeline stalls and wasted energy. • Idea: Stop fetching instructions if a branch hazard is expected: • If the count (M) of incorrect predictions exceeds a pre-specified number (N), then suspend fetching instruction for some k cycles. • Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25th Annual International Symp. Computer Architecture, June 1998. ELEC5270/6270 Spr 15, Lecture 8
Slack Scheduling • Application: Superscalar, out-of-order execution: • An instruction is executed as soon as the required data and resources become available. • A commit unit reorders the results. • Delay the completion of instructions whose result is not immediately needed. • Example of RISC instructions: • add r0, r1, r2; (A) • sub r3, r4, r5; (B) • and r9, r1, r9; (C) • or r5, r9, r10; (D) • xor r2, r5, r11; (E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000. ELEC5270/6270 Spr 15, Lecture 8
Slack Scheduling Example ELEC5270/6270 Spr 15, Lecture 8
Slack Scheduling Re-order buffer Scheduling logic Low-power execution units Slack bit ELEC5270/6270 Spr 15, Lecture 8