Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon1,2 Jae Young Jang2 Jae W. Lee2 Nam Sung Kim2,3 1 2 3

Single-chip heterogeneous processors • Compared to systems based on discrete components • Lower communication overhead • Lower power consumption • Lower cost (less silicon) • Emerging application friendly (sequential + parallel processing) Samsung’s Exynos Intel’s Sandy Bridge AMD’s Llano Sources: AMD, Intel, and Samsung

Challenges • SCHP’s performance: limited by power budget • Total chip power budget • CPU/GPU power budget • Multiprogrammed workload • Workload-aware power allocation • Considering characteristics and metrics How can optimize overall performance within limited power budget?

Outline • Motivation • Target platform: SCHP + MW • Workload-aware power allocation • Characteristics of programs • Evaluation Metrics • Methodology • Power configuration • Benchmark programs • Evaluation • Algorithm • Conclusion

Target platform: SCHP + MW • 4-core CPU + 16-SM GPU • Multiple V/F domains  DVFS • 2 programs running • Hardware resources evenly divided Multiprogrammed Workload GPU0 V/F domain CPU V/F domain (per-core) GPU0 Program 1 CPU Core0 CPU Core1 GPU1 V/F domain CPU Core2 CPU Core3 Program 2 GPU1 Memory Controllers MCs V/F domain

Workload-aware power allocation • Characteristics of programs • Non-uniform performance sensitivities • Evaluation metrics • Throughput vs. Energy efficiency Allocating more power to mri-q Normalized throughput Power allocation (using the same HW)

Methodology: shared power budget Output Energy Efficiency Throughput Power Configuration 22.4 34.2 34.2 22.4 46.4 46.4 24.8 24.8 16.8 16.8 31.2 31.2 41.6 62.8 62.8 41.6 11.2 11.2 17.4 17.4 • Can change the power budget for CPU 1 CPU 2 GPU 1 GPU 2 • Total chip power budget = 100 W • CPU power budget = 80 W • GPU power budget = 64 W • Baseline configuration • Evenly divided (25 W for each CPU/GPU group)

Methodology: benchmark programs • Used 6 benchmark programs. • Divided into 3 groups depending on characteristics

Evaluation: case study 1 (compute- vs. memory-bound) 19% throughput improvement 32% energy efficiency improvement • Allocating more power to compute-bound • Optimal points vary depending on metrics.

Evaluation: case study 2 (memory- vs. memory-bound) 10% throughput improvement 32% energy efficiency improvement • Equally allocated power • Again, optimal point depends on • Evaluation metric • Workload characteristics (compute- or memory-bound)

Evaluation: variation of optimal configuration • Depending on programs’ characteristics and evaluation metrics

Evaluation: performance improvement from optimal power allocation • Achieved significant improvement • 12% for throughput • 18% for energy efficiency

Algorithm for throughput maximization calculate (slope) wait(regular_time) abs(sp1-sp2) < threshold YES alloc(equally) Normalized throughput NO sp1 > sp2 YES alloc(p1_more) Power allocation NO alloc(p2_more)

Algorithm for energy efficiency maximization final = min_power • Gradient search from the minimum power allocation MAX = max( EE(final), EE(final, p1++), EE(final, p2++) ) EE(final) == MAX exit EE(final, p1++) > EE(final, p2++) final = (final, p2++) final = (final, p1++)

Conclusion • We propose a solution for optimal power allocation • Workload-aware power allocation • By using program characteristics and evaluation metrics • Significant performance improvement achieved • 12% for throughput • 18% for energy efficiency • Run-time algorithms effectively find (near-)optimal power allocation

Backup slides

Simulator • Integrated CPU + GPU simulator • H. Wang, V. Sathish, R. Singh, M. Schulte and N. Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors," in PACT, 2012. • http://cpu-gpu-sim.ece.wisc.edu/ • gem5 + GPGPU-Sim • Adaptive power allocation for multiprogrammed workload • Per-core V/F domains for CPU • 2 V/F domains for GPU

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors

Presentation Transcript

On-Chip Optical Communication for Multicore Processors

Single-Chip Multi-Processors (CMP)

Analysis of Database Workloads on Modern Processors

Heterogeneous Multi-Core Processors

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks

Heterogeneous Networks for Smart Grid Communication Architecture and Optimal Traffic Allocation

On Power and Multi-Processors

Packet Chaining: Efficient Single-Cycle Allocation for On-Chip Networks

On Optimal Single-Item Auctions

Single-Chip Multiprocessor

On Power and Multi-Processors

Core Architecture Optimization for Heterogeneous Chip Multiprocessors

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures

On-Chip Photonic Communications for High Performance Multi-Core Processors

Scenario-Oriented Design for Single Chip Heterogeneous Multiprocessors

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors

Power Management for Chip-level Multiprocessing Processors

Single-Chip Multi-Processors (CMP)

Storage Allocation for Embedded Processors

On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors

Heterogeneous Multi-Core Processors