1 / 36

Satoshi Imamura Hiroshi Sasaki Naoto Fukumoto Koji Inoue Kazuaki Murakami Kyu shu University

Optimizing Power-Performance Trade-off for Parallel Applications through Dynamic Core and Frequency Scaling. Satoshi Imamura Hiroshi Sasaki Naoto Fukumoto Koji Inoue Kazuaki Murakami Kyu shu University. Many-core Processors. Multi-core processor is currently mainstream

Download Presentation

Satoshi Imamura Hiroshi Sasaki Naoto Fukumoto Koji Inoue Kazuaki Murakami Kyu shu University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Power-Performance Trade-offfor Parallel Applications throughDynamic Core and Frequency Scaling Satoshi ImamuraHiroshi SasakiNaoto Fukumoto Koji InoueKazuaki Murakami Kyushu University

  2. Many-core Processors • Multi-core processor is currently mainstream • Core counts on a chip increase as technology size shrinks • Many-core processor era is coming • 10s and 100s of cores on a chip • Execute a multi-threaded program for high performance TILERA ”TILE-Gx100” ブロック図 http://www.tilera.com/products/processors/TILE-Gx_Family

  3. Challenge of Many-core • Demand for low power consumption • Ex: Large scale data centers • Reduce peak power consumption by power capping Programs need to be efficiently executedunder power consumption constraint

  4. Two Knobs to DeterminePerformance • CPU frequency&the number of cores • Characteristics of multi-threaded programs differ among/within programs • Sensitivity to CPU frequency • Parallelism Need to choose the proper configuration according to the kind of programs and their behaviors

  5. Experimental Environment 32-core AMD four socket system C0 C1 C2 C3 CPU0 CPU1 L2 L2 L2 L2 Shared L3 CPU2 CPU3 Memory controller Conventional execution & Power constraint: The power when all 32 cores run on 0.8 GHz

  6. Characteristics among Programs blackscholes dedup x264

  7. Characteristics within a Program better 4 8 12 16 32 4 8 12 16 32 4 8 12 16 32 4 8 12 16 32 4 8 12 16 32 IPS:Instructions Per Second

  8. Our Goal • Maximize performance of parallel programs on many-core under power consumption constraint • Variety of characteristics among/within programs • Sensitivity to CPU frequency • Scalability to core counts • Choose the optimal trade-off point between core counts and CPU frequency dynamically

  9. Overview of DCFS(Dynamic Core and Frequency Scaling) blackscholes • Optimize core counts and CPU frequency dynamically according to characteristics of program • High parallelism • Parallel processing with the maximum available core counts • Medium/low parallelism • Restrict the number of active cores • Reallocate the power budget to increase CPU frequency dedup

  10. DCFS Algorithm • Two phases • In Training phase • Change the configuration of core counts and CPU frequency periodically • Measure IPS during execution with each configuration • Estimate the optimal configuration using measured IPS • In Execution phase • Execute with the optimal configuration • Detect behavior changes of executed program Execution phase Execution phase Execution phase Execution phase Execution phase Execution time Training Phase

  11. How to find the best configuration • Find the best core counts for each CPU frequency • Decrement core counts until IPS declines • Select the configuration with the highest IPS x264

  12. Evaluation Result • DCFS-3, DCFS-10: • Our proposed technique without detecting behavior changes • Execution with the configuration estimated in Training phase for constant 3 or 10 seconds • DCFS-WD: • Our proposed technique with detecting behavior changes Middle/low parallelism High parallelism

  13. Evaluation Result • Almost no performance improvement for high parallelism programs • Execution with all cores maximizes performance • Performance degradation due to overhead of Training phase Middle/low parallelism High parallelism

  14. Evaluation Result Middle/low parallelism High parallelism • Almost no performance improvement despite of middle/low parallelism • Two most memory-bound programs in PARSEC* • Small performance improvement by increasing CPU frequency * Bienia, C. et al, “PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors”, IISWC 2008.

  15. Evaluation Result Middle/low parallelism High parallelism • Performance improvement for middle/low parallelism programs • 35% improvement for dedup • 20% improvement on average for four programs • 6% improvement on average for all programs

  16. Conclusions • Challenge of many-core processors • Maximizing performance under power constraint • Proposed technique: DCFS • Optimize core counts and CPU frequency dynamically • Detect behaviorchanges of executed program • Evaluation • Max 35% performance improvement • 6% performance improvement for ten benchmarks • No performance improvement for high parallelism and memory-bound programs

  17. Future Work • Improve the algorithm of our technique to find the best configuration and to detect behavior changes • Evaluate under different power consumption constraints • Evaluate on different platforms

  18. Thank you for your attention.I would appreciate if you could ask me questions slowly.

  19. blackscholes bodytrack canneal dedup ferret freqmine

  20. streamcluster swaptions vips x264

  21. Backup Slides

  22. Experimental Environment 32-core AMD four socket system CPU0 CPU1 C0 C1 C2 C3 L2 L2 L2 L2 CPU2 CPU3 Shared L3 Memory controller

  23. Power Constraint Assumption Conventional execution • Power consumption constraint (): • The power when all cores run on minimum available CPU frequency • Max CPU frequency is decided by core counts under

  24. How to Determine Max CPU Frequency • ThePower consumption constraint • The power consumption when N cores run • Choose maximum CPU frequency and supply voltage according to this inequation in each core count : The switching activity of the circuit, : Total number of cores, : Capacitance per core, : Minimum operating frequency, :Minimum supply voltage

  25. Implementation of DCFS • Training phase • Change the configuration periodically • Execute with each configuration for a short period (“Training period”) • Measure IPS as indicator of performance • Compare measured IPS to estimate the optimal configuration • Execution phase • Execute with the optimal configuration • Measure IPS periodically to detect phase changes of program • No need of static analysis and modification of programs

  26. Detail Implementation of DCFS • Periodical reading performance counters • Use Linux “perf-tools” • Thread allocation to the specified core • Use Linux standardAPI “sched_setaffinity(2)” • Training period: 30 ms • Measure IPS every 1 second to detect phase changes • IPS increases or decreases by more than 10%

  27. The Way to Change Core Counts Thread Thread Thread Thread Thread Thread Thread Thread • Use “Thread Packing*” • Change core counts while the number of threads is constant • No need to modify source codes⇒ Easy implementation Idle Idle Core Core Idle Idle Idle Idle Thread Thread Thread Thread Thread Thread Thread Thread Core Core Core Core Core Core Core Core *Cochran, R. et al, “Pack & Cap: Adaptive DVFS and Thread Packing Under Power Caps”, Micro,2011

  28. Benchmarks • 10 benchmarks from PARSEC 2.1* • Input set size: native *Bienia, C. et al, “The PARSEC benchmark suite: Characterization and architectural implications”, PACT, 2008

  29. Analysis of canneal & streamclster canneal streamcluster • Two most memory-bound programs in PARSEC* • Small performance improvement by increasing CPU frequency * Bienia, C. et al, “PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors”, IISWC 2008.

  30. Analysis ofdedup • DCS@0.8GHz: Control only core counts dynamically • 4% overhead of Training phase • DCFS achieves high performance by scaling both core counts and CPU frequency 30 30 31 32 32 32 32 32 32 31 30 4 8 6 6 8 8 8 8

  31. Experiment Environment (Xeon)

  32. Maximum CPU Frequency and Supply Voltage for Each Core Counts (Xeon)

  33. Evaluation Result (Xeon) • Performance decrement for all programs except swaptions • Great or moderate scalability • All 12 cores execution maximizes the performance⇒ Performance decrement due to overhead of Training phase • For swaptions: High performance only when executed with power of two core counts⇒ Execution with eight cores maximizes the performance

  34. Analysis offerret • Performance improvement by increasing core counts • Execution with all cores maximizes performance • Performance degradation due to overhead of Training phase

  35. 64コア AMD 4ソケットシステム C0 C1 C7 CPU0 CPU1 L2 L2 L2 共有L3キャッシュ CPU2 CPU3 メモリコントローラ ・・・ ・・・

  36. Thread Thread Thread Thread Thread Thread Thread Thread Idle Idle Core Core Idle Idle Idle Idle スレッド スレッド スレッド スレッド スレッド スレッド スレッド スレッド コア コア コア コア コア コア コア コア スレッド スレッド スレッド スレッド スレッド スレッド スレッド スレッド 休止 休止 休止 休止 コア 休止 コア 休止

More Related