Power Management for Chip-level Multiprocessing Processors

Power Management for Chip-level Multiprocessing Processors Kai Ma

Background • To get better performance 1. Scale frequency (fast) 2. On-chip resource replication (parallel) Chip-MultiProcessing vs Simultaneous MultiThreading

SMT vs CMP

Other justification for CMP • Memory wall, ILP wall, Power wall • Higher cache coherency circuitry rate • Signal integrity • Future: Many cores (many specialized cores )

Power management for CMP • Reduce operating costs for energy and cooling • Prolong battery life for portable and embedded systems • Reduce cooling requirement • Meet scalable performance target • Heat dissipation and hotspot

Outline 1. An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget Canturk Isci*, Alper Buyuktosunoglu*, Chen-Yong Cher*, Pradip Bose* and Margaret Martonosi *IBM T.J. Watson Research Center Department of Electrical Engineering Yorktown Heights Princeton University 2. Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors Radu Teodorescu and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

Outline • An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget 1. Contribution 2. Global Power Management 3. Global Power Management Policies: core modes, power and performance matrix 4. Experimental Result and Evaluation 5. Conclusion 6. Critique

Contribution • Introduce a global power management • Develop a static power management analysis tool • Evaluate different policies for CMP power management

Global Power Management • Monitor the power and set working mode of each core

Global Power Management Policies • Priority: Slow down the core runs low priority task • PullhiPushLo: Speedup the low power core and slow down the high power core. • MaxBIPS: Predict and choose power mode combination

Core Power Modes • Underlying mechanism: DVFS • Overhead: Order of microseconds • Performance Degradation: Elapsed execution time for benchmark

Power and BIPS Matrices

Experimental Methodology • SPEC CPU2000 benchmark • A trace-based CMP analysis tool is incorporated with IBM’s Turandot simulator • Mode switch (500ns) and Statistics collection (50ns) • During mode switch, no instruction execution, power is consumed

Static vs Dynamic

Policy and Budget Curve

Power Saving

Power Management Result

Trends under CMP Scaling • The difference between MaxBIPS and oracle decreases with core number increasing • Increasing core numbers has smaller impact on MaxBIPS • CMP scales favor static per-core management over chip-wide DVFS

Conclusion • Global management is preferred • Dynamic management is preferred • MaxBIPS is efficient

Critique • MaxBIPS: Prediction is superlinearly dependent on the number of modes and core • Power performance estimation matrix: transition penalty • Not consider temperature

Outline Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors 1. Background 2. Contribution 3. Algorithm 4. System Implementation 5. Evaluation 6. Conclusion 7. Critique

Background For CMP, with-in die process variation impacts: • Static power consumption • Maximum frequency

Contribution • Propose variation-aware algorithms for application scheduling • Complement these algorithms with variation-aware DVFS

CMP Configuration • High level frequency and DVFS policy

Algorithms

Linear Programming • A technique for optimization of a linear objective function, subject to linear equality and linear inequality constraints • c and b are known vectors, A is a known matrix, x represents variables vector

Power Mode Selection: LinOpt • TP : average throughput • N: core number • i : from 1 to N • a(i) : constant depends on the thread and core • v(i): core voltage • b(i) and c(i): constants introduced to approximate power-voltage relation • Object function: • Constraints:

Power Mode Selection: SAnn • Use annealing algorithm to solve the power mode selection problem • SAnn searches all possible combination of core voltage • Compare to LinOpt: More accurate but more costly

System Implementation • Algorithm runs on a core or a power management unit • At OS scheduling interval, OS assigns threads to cores by using VarF&AppIPC • Every 10ms, the LinOpt algorithm runs and sets the cores to correct power

Profiling for Implementation

Evaluation Methodology • Variation:Varius model • Power: SESC + Wattch+HotLeakage • Temperature: HotSpot • Critical Path Model: 1.Calculation path delay: Multiplier like unit 2.Memory: SRAM 3.Interconnection: Cacti 4.Gate delay: Alpha-power law

Workload • SPEC • Run different applications on different cores • 12 billion instructions

Metrics • Total power • Average frequency of active cores • Throughput • Energy delay-square product (consider Time-to-solution and energy consumption) • Weighted throughput: application’s IPC normalized to the application’s IPC at reference conditions

Evaluation • Power and frequency variation on one die

Uniform Frequency & No DVFS • As the thread number increases, there is no less used core for thread mapping

NoUniform Frequency & No DVFS • Different cores run at different frequencies, by selecting less used core, they may end up with lower frequency ones.

NoUniFreq+DVFS • Throughput: VarF&AppIPC+LinOpt is effective • Power: throughput gains are high when power targets are low

LinOpt Granularity • Deviation between power consumed and power target decreases as interval between LinOpt run increases

Conclusion • With-in die variation substantially impacts static power consumed and maximum frequency • Variation-aware algorithms are proposed and analyzed, LinOpt is efficient

Critique • How to decouple thread mapping and power mode selection • Static power consumption and dynamic power consumption should be discussed separately • Thread mapping takes place once, thread migration should be considered

Comparison

Power Management for Chip-level Multiprocessing Processors

Power Management for Chip-level Multiprocessing Processors

Presentation Transcript

On-Chip Optical Communication for Multicore Processors

Single-Chip Multi-Processors (CMP)

Instruction-Level Parallel Processors

Research Accelerator for MultiProcessing

Ensemble-level Power Management for Dense Blade Servers

MPOC “Many Processors, One Chip”

Power Management Features in Intel Processors

Low Power Processors

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors

Temperature-constrained Power Control for Chip-level Multiprocessors

Research Accelerator for MultiProcessing

Adaptive Single-Chip Multiprocessing

Power Control for Chip Multiprocessors

PAMA Power Aware Multiprocessing Architecture

Multiprocessing Memory Management

Chip Level Multithreading (CMT)

Single-Chip Multi-Processors (CMP)

High-level Power Simulation for DVS-aware Processors

Chip Level