190 likes | 412 Views
Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors. Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim 2,3. 1. 2. 3. Single-chip heterogeneous processors. Compared to systems based on discrete components Lower communication overhead
E N D
Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon1,2 Jae Young Jang2 Jae W. Lee2 Nam Sung Kim2,3 1 2 3
Single-chip heterogeneous processors • Compared to systems based on discrete components • Lower communication overhead • Lower power consumption • Lower cost (less silicon) • Emerging application friendly (sequential + parallel processing) Samsung’s Exynos Intel’s Sandy Bridge AMD’s Llano Sources: AMD, Intel, and Samsung
Challenges • SCHP’s performance: limited by power budget • Total chip power budget • CPU/GPU power budget • Multiprogrammed workload • Workload-aware power allocation • Considering characteristics and metrics How can optimize overall performance within limited power budget?
Outline • Motivation • Target platform: SCHP + MW • Workload-aware power allocation • Characteristics of programs • Evaluation Metrics • Methodology • Power configuration • Benchmark programs • Evaluation • Algorithm • Conclusion
Target platform: SCHP + MW • 4-core CPU + 16-SM GPU • Multiple V/F domains DVFS • 2 programs running • Hardware resources evenly divided Multiprogrammed Workload GPU0 V/F domain CPU V/F domain (per-core) GPU0 Program 1 CPU Core0 CPU Core1 GPU1 V/F domain CPU Core2 CPU Core3 Program 2 GPU1 Memory Controllers MCs V/F domain
Workload-aware power allocation • Characteristics of programs • Non-uniform performance sensitivities • Evaluation metrics • Throughput vs. Energy efficiency Allocating more power to mri-q Normalized throughput Power allocation (using the same HW)
Outline • Motivation • Target platform: SCHP + MW • Workload-aware power allocation • Characteristics of programs • Evaluation Metrics • Methodology • Power configuration • Benchmark programs • Evaluation • Algorithm • Conclusion
Methodology: shared power budget Output Energy Efficiency Throughput Power Configuration 22.4 34.2 34.2 22.4 46.4 46.4 24.8 24.8 16.8 16.8 31.2 31.2 41.6 62.8 62.8 41.6 11.2 11.2 17.4 17.4 • Can change the power budget for CPU 1 CPU 2 GPU 1 GPU 2 • Total chip power budget = 100 W • CPU power budget = 80 W • GPU power budget = 64 W • Baseline configuration • Evenly divided (25 W for each CPU/GPU group)
Methodology: benchmark programs • Used 6 benchmark programs. • Divided into 3 groups depending on characteristics
Outline • Motivation • Target platform: SCHP + MW • Workload-aware power allocation • Characteristics of programs • Evaluation Metrics • Methodology • Power configuration • Benchmark programs • Evaluation • Algorithm • Conclusion
Evaluation: case study 1 (compute- vs. memory-bound) 19% throughput improvement 32% energy efficiency improvement • Allocating more power to compute-bound • Optimal points vary depending on metrics.
Evaluation: case study 2 (memory- vs. memory-bound) 10% throughput improvement 32% energy efficiency improvement • Equally allocated power • Again, optimal point depends on • Evaluation metric • Workload characteristics (compute- or memory-bound)
Evaluation: variation of optimal configuration • Depending on programs’ characteristics and evaluation metrics
Evaluation: performance improvement from optimal power allocation • Achieved significant improvement • 12% for throughput • 18% for energy efficiency
Algorithm for throughput maximization calculate (slope) wait(regular_time) abs(sp1-sp2) < threshold YES alloc(equally) Normalized throughput NO sp1 > sp2 YES alloc(p1_more) Power allocation NO alloc(p2_more)
Algorithm for energy efficiency maximization final = min_power • Gradient search from the minimum power allocation MAX = max( EE(final), EE(final, p1++), EE(final, p2++) ) EE(final) == MAX exit EE(final, p1++) > EE(final, p2++) final = (final, p2++) final = (final, p1++)
Conclusion • We propose a solution for optimal power allocation • Workload-aware power allocation • By using program characteristics and evaluation metrics • Significant performance improvement achieved • 12% for throughput • 18% for energy efficiency • Run-time algorithms effectively find (near-)optimal power allocation
Simulator • Integrated CPU + GPU simulator • H. Wang, V. Sathish, R. Singh, M. Schulte and N. Kim, "Workload and Power Budget Partitioning for Single-Chip Heterogeneous Processors," in PACT, 2012. • http://cpu-gpu-sim.ece.wisc.edu/ • gem5 + GPGPU-Sim • Adaptive power allocation for multiprogrammed workload • Per-core V/F domains for CPU • 2 V/F domains for GPU