Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep. 8, 2006

Growing energy demand • Energy efficiency is a big concern • Increased power density of microprocessors • Cooling cost for heat dissipation • Power and performance tradeoff • Dynamic voltage and frequency scaling (DVFS) • Supported by newer microprocessors • Cubic drop on power consumption • Power  frequency × voltage2 • CPU is the major power consumer : 35~50% of total power

Power-performance tradeoff • Cost vs. Benefit • Power  performance • Increasing execution time vs. decreasing power usage • CPU scaling is meaningful only if benefit > cost P1 E = P1 * T1 Benefit Power P2 E = P2 * T2 Cost T1 T2 Time

Power-performance tradeoff (cont’) • Cost > Benefit • NPB EP benchmark • CPU-bound application • CPU is on critical path • Benefit > Cost • NPB CG benchmark • Memory-bound application • CPU is NOT on critical path 0.8Ghz 1.2 1.6 1.0 2.0 1.4 1.8

Motivation 1 • Cost/Benefit is code specific • Applications have different code regions • Most MPI communications are not critical on CPU • P-state transition in each code region • High voltage and frequency on CPU intensive region • Low voltage and frequency on MPI communication region

Time and energy performance of MPI calls • MPI_Send • MPI_Alltoall

Motivation 2 • Most MPI calls are too short • Scaling overhead by p-state change per call • Up to 700 microseconds in p-state transition • Make regions with adjacent calls • Small interval of inter MPI calls • P-state transition occurs per region Fraction of intervals Fraction of calls Call length (ms) MPI calls interval (ms)

R1 R2 R3 Reducible regions user MPI library A B C D E F G H I J time

Reducible regions (cont’) • Thresholds in time • close-enough (τ): time distance between adjacent calls • long-enough (λ): region execution time δ < τ δ < τ δ > λ user MPI library A B C D E F G H I J time

How to learn regions • Region-finding algorithms • by-call • Reduce only in MPI code: τ=0, λ=0 • Effective only if single MPI call is long enough • simple • Adaptive 1-bit prediction by looking up its last behavior • 2 flags : begin and end • composite • Save patterns of MPI calls in each region • Memorize the begin/end MPI calls and # of calls

P-state transition errors • False-positive (FP) • P-state is changed in the region top p-state must be used • e.g. regions terminated earlier than expected • False-negative (FN) • Top p-state is used in the reducible region • e.g. regions in first appearance

Optimal transition top p-state reduced p-state top p-state Simple reduced p-state Composite top p-state reduced p-state P-state transition errors (cont’) A B A B A B Program execution users MPI library FN FN

Optimal transition top p-state reduced p-state top p-state Simple reduced p-state Composite top p-state reduced p-state P-state transition errors (cont’) A A A A A A Program execution users MPI library FP FP FP F N FN

Selecting proper p-state • automatic algorithm • Use composite algorithm to find regions • Use hardware performance counters • Evaluation of CPU dependency in reducible regions • A metric of CPU load: micro-operations/microsecond (OPS) • Specify p-state mapping table

Implementation • Use PMPI • MPI profiling interface • Intercept pre and post hooks of any MPI call transparently • MPI call unique identifier • Use the hash value of all program counters in call history • Insert assembly code in C

Results • System environment • 8 or 9 nodes with AMD Athlon-64 system • 7 p-states are supported: 2000~800Mhz • Benchmarks • NPB MPI benchmark suite • C class • 8 applications • ASCI Purple benchmark suite • Aztec • 10 ms in thresholds (τ, λ)

Benchmark analysis • Used composite for region information

Taxonomy • Profile does not have FN or FP

Overall Energy Delay Product (EDP)

Comparison of p-state transition errors • Breakdown of execution time Simple Composite

τ evaluation • SP benchmark

τ evaluation (cont’) MG CG BT LU

Conclusion • Contributions • Design and implement an adaptive p-state transition system in MPI communication phases • Identify reducible regions on the fly • Determine proper p-state dynamically • Provide transparency to users • Future work • Evaluate the performance with other applications • Experiments on the OPT cluster

State transition diagram • Simple begin == 1 else else OUT IN end == 1 not “close enough” “close enough”

State transition diagram (cont’) • Composite else not “close enough” OUT not “close enough” end of region operation begins reducible region “close enough” IN REC else “close enough” pattern mismatch

Performance

Benchmark analysis • Region information from composite with τ = 10 ms

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

Presentation Transcript

Just in time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs

User- and Process-Driven Dynamic Voltage and Frequency Scaling

Dynamic Voltage Scaling

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

AMPI: Adaptive MPI

Adaptive MPI

Impact of Adaptive Voltage Scaling on Aging-Aware Signoff

Reducing ATE Test Time BY Voltage and Frequency SCALING

Memory Power Management via Dynamic Voltage/Frequency Scaling

AMPI: Adaptive MPI

Dynamic Voltage and Frequency Scaling Circuits with Two Supply Voltages

Dynamic Voltage Scaling

Adaptive MPI

Adaptive MPI

Methodology for Electromigration Signoff in the Presence of Adaptive Voltage Scaling

Frequency to Voltage Converter and Voltage to Frequency Converter

Power-Aware SoC Test Optimization through Dynamic Voltage and Frequency Scaling

Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES

Adaptive MPI Tutorial

Dynamic Voltage Scaling