1 / 28

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs. Min Yeol Lim Computer Science Department Sep. 8, 2006. Growing energy demand. Energy efficiency is a big concern Increased power density of microprocessors Cooling cost for heat dissipation

isleen
Download Presentation

Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep. 8, 2006

  2. Growing energy demand • Energy efficiency is a big concern • Increased power density of microprocessors • Cooling cost for heat dissipation • Power and performance tradeoff • Dynamic voltage and frequency scaling (DVFS) • Supported by newer microprocessors • Cubic drop on power consumption • Power  frequency × voltage2 • CPU is the major power consumer : 35~50% of total power

  3. Power-performance tradeoff • Cost vs. Benefit • Power  performance • Increasing execution time vs. decreasing power usage • CPU scaling is meaningful only if benefit > cost P1 E = P1 * T1 Benefit Power P2 E = P2 * T2 Cost T1 T2 Time

  4. Power-performance tradeoff (cont’) • Cost > Benefit • NPB EP benchmark • CPU-bound application • CPU is on critical path • Benefit > Cost • NPB CG benchmark • Memory-bound application • CPU is NOT on critical path 0.8Ghz 1.2 1.6 1.0 2.0 1.4 1.8

  5. Motivation 1 • Cost/Benefit is code specific • Applications have different code regions • Most MPI communications are not critical on CPU • P-state transition in each code region • High voltage and frequency on CPU intensive region • Low voltage and frequency on MPI communication region

  6. Time and energy performance of MPI calls • MPI_Send • MPI_Alltoall

  7. Motivation 2 • Most MPI calls are too short • Scaling overhead by p-state change per call • Up to 700 microseconds in p-state transition • Make regions with adjacent calls • Small interval of inter MPI calls • P-state transition occurs per region Fraction of intervals Fraction of calls Call length (ms) MPI calls interval (ms)

  8. R1 R2 R3 Reducible regions user MPI library A B C D E F G H I J time

  9. Reducible regions (cont’) • Thresholds in time • close-enough (τ): time distance between adjacent calls • long-enough (λ): region execution time δ < τ δ < τ δ > λ user MPI library A B C D E F G H I J time

  10. How to learn regions • Region-finding algorithms • by-call • Reduce only in MPI code: τ=0, λ=0 • Effective only if single MPI call is long enough • simple • Adaptive 1-bit prediction by looking up its last behavior • 2 flags : begin and end • composite • Save patterns of MPI calls in each region • Memorize the begin/end MPI calls and # of calls

  11. P-state transition errors • False-positive (FP) • P-state is changed in the region top p-state must be used • e.g. regions terminated earlier than expected • False-negative (FN) • Top p-state is used in the reducible region • e.g. regions in first appearance

  12. Optimal transition top p-state reduced p-state top p-state Simple reduced p-state Composite top p-state reduced p-state P-state transition errors (cont’) A B A B A B Program execution users MPI library FN FN

  13. Optimal transition top p-state reduced p-state top p-state Simple reduced p-state Composite top p-state reduced p-state P-state transition errors (cont’) A A A A A A Program execution users MPI library FP FP FP F N FN

  14. Selecting proper p-state • automatic algorithm • Use composite algorithm to find regions • Use hardware performance counters • Evaluation of CPU dependency in reducible regions • A metric of CPU load: micro-operations/microsecond (OPS) • Specify p-state mapping table

  15. Implementation • Use PMPI • MPI profiling interface • Intercept pre and post hooks of any MPI call transparently • MPI call unique identifier • Use the hash value of all program counters in call history • Insert assembly code in C

  16. Results • System environment • 8 or 9 nodes with AMD Athlon-64 system • 7 p-states are supported: 2000~800Mhz • Benchmarks • NPB MPI benchmark suite • C class • 8 applications • ASCI Purple benchmark suite • Aztec • 10 ms in thresholds (τ, λ)

  17. Benchmark analysis • Used composite for region information

  18. Taxonomy • Profile does not have FN or FP

  19. Overall Energy Delay Product (EDP)

  20. Comparison of p-state transition errors • Breakdown of execution time Simple Composite

  21. τ evaluation • SP benchmark

  22. τ evaluation (cont’) MG CG BT LU

  23. Conclusion • Contributions • Design and implement an adaptive p-state transition system in MPI communication phases • Identify reducible regions on the fly • Determine proper p-state dynamically • Provide transparency to users • Future work • Evaluate the performance with other applications • Experiments on the OPT cluster

  24. State transition diagram • Simple begin == 1 else else OUT IN end == 1 not “close enough” “close enough”

  25. State transition diagram (cont’) • Composite else not “close enough” OUT not “close enough” end of region operation begins reducible region “close enough” IN REC else “close enough” pattern mismatch

  26. Performance

  27. Benchmark analysis • Region information from composite with τ = 10 ms

More Related