Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster

Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented by: Huaxia Xia CSAG, CSE of UCSD

Introduction • Power-aware Computing • HPC Uses Large-scale Systems, Has High Power Consumption • Two extremes: • Performance-at-all-costs • Low-performance but more energy efficient • This paper targets to save energy with little performance penalty

Related Work • Server/Desktop Systems • Minimize the number of servers needed to handle the load, and set other servers into low-energy state (standby or power-off) • Set node voltage independently • Disk: • Modulate the speed of disks dynamically • Improve cache policy • Aggregate disk accesses to have burst requests • Mobile Systems • Energy-aware OS • Voltage-changeable CPU • Disk spindown • Memory • Network

Assumptions • HPC Applications • Performance is the Primary Concern • Highly Regular and Predictable • CPU has Multiple “Gears” • Variable Frequency • Variable Voltage • CPU is a Major Power Consumer • Energy consumption of disks/memory/network is not considered

Methodology: Profile-Directed • Get Program Trace • Divide the Program into Blocks • Merge the Blocks into Phases • Search the Best Gear for Each Phase Heuristically

Divide Codes into “Blocks” • Rule 1: Any MPI operation demarcates a block boundary. • Rule 2: If the memory pressure changes abruptly, a block boundary occurs at this change. • Use operations per miss (OPM) as a measure of the memory pressure

Merge “Blocks” into “Phases” • Two adjacent blocks are merged into a phase if their corresponding memory pressure is within the same threshold • OPM in Trace of LU (Class C):

Data Collection • Use MPI-jack • Intercept any MPI call transparently • Can execute arbitrary codes before/after an intercepted call • Insert pseudo MPI calls at non-MPI phase boundaries • Collect information of time, operations, L2 misses • Question: Mutual Dependence? • Trace data  Block boundaries

Solution Search (1) • Metrics: Energy-Time Tradeoff • Normalized energy and time • Total system energy • A larger negative number indicates a near vertical slope and a significant energy saving • Question: How to measure energy consumption accurately?

Solution Search (2) • Phase Prioritization • Sort the phases in the order of OPM (lowhigh) • Question: why is sorting necessary? • “Novel” Heuristic Search • Find the local optimal gear for each phase one by one • Running time is at most n×g

Solution Search (3)

Experiments • 10 AMD Athlon-64 CPUs • Frequency-scalable: 800-2000MHz • Voltage-scalable: 0.9-1.5V • 1GB main memory • 128KB L1 cache, 512KB L2 cache • 100Mb/s network • CPU Consumes 45-55% of Overall System Energy • Benchmarks: NAS Parallel Benchmarks (NPB)

Results: Multiple Gear Benefit • IS: 16% energy saving with 1% extra time • BT: 10% energy saving with 5% extra time • MG: 11% energy saving with 4% extra time

Results: Single Gear Benefit • CG: 8% energy saving with 3% extra time • SP: 15% energy saving with 7% extra time The order of phases matters!

Results: No Benefit

Conclusions and Future Work • Use Profile-directed Method to Achieve Good Energy-Time Tradeoff for HPC Applications • Future work: • Enhance profile-directed techniques • Consider Inter-node bottlenecks • Automate the entire process

Discussion • How important is power consumption to HPC? • 10% energy  ?  5% time • Is Profile-directed method practical? • Effective for applications that run repeatedly • How much degree of automatic? • Is OPM (Operations Per Miss) a good metric to find phases? • Key Purpose: to identify CPU utilization • Other options: Instructions Per Second, CPU Usage • Is OPM a good metric to sort phases?

Using Multiple Energy Gears in MPI Programs on a Power-Scalable Cluster