1 / 15

Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems

Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems. 1 Gaurav Dhiman , 1 Vasileios Kontorinis , 1 Dean Tullsen , 1 Tajana Rosing , 2 Eric Saxe, 2 Jonathan Chew ISLPED 2010, Austin 1 UC San Diego 2 Oracle Corp.

aross
Download Presentation

Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1Gaurav Dhiman, 1Vasileios Kontorinis, 1Dean Tullsen, 1Tajana Rosing, 2Eric Saxe, 2Jonathan Chew ISLPED 2010, Austin 1UC San Diego 2Oracle Corp.

  2. Chip Multiprocessors/multicore architectures are pervasive in modern systems Hierarchy of asymmetric resource sharing among cores: Memory bandwidth Last level cache Pipeline Threads scheduled across these cores share these resources: Resource requirements Relative placement Overall performance and power efficiency Introduction

  3. Modern OS capture the resource sharing asymmetries However, the balancing based on thread count: Resource usage? Resource requirement? Motivation

  4. The difference between best and worst schedule as high as 70% on a ‘balanced’ system! The threads that share last level cache makes a big difference High contention deteriorates performance and power efficiency Motivation gzip art art art

  5. Default scheduler exhibits high variance: Due to ping pong between best and worst schedules Frequent pre-emptions by high priority ‘transient threads’ Not an OS specific problem: Lack of information available to the OS scheduler Motivation

  6. Highlight the inability of modern OS to extract full power efficiency from the parallel architectures Lack of resource utilization knowledge accessible to scheduler Identify characteristics of threads that affect resource sharing efficiency and metrics to capture them Uncover and provide solution to ‘transient threads’ Short running kernel threads that impede stable scheduling Extend the scheduler to incorporate this logic into the load balancing fabric Implement a prototype “Workload Characteristics Aware (WCA) Scheduler” Contributions

  7. High priority short running threads Run in order of us Example: java, fsflush, nscd etc. Have little impact on runtimes of long running threads Mislead the OS load balancer Transient Threads Artificial Load Almost idle

  8. Identification: Spend most of the time blocked vs running Maintain ratio in the thread data structure Flag as transient if ratio < 1% Resolution: Load balance only non transient threads Transient Threads Artificial Load Almost idle

  9. Two requirements: Identify cache sensitive threads Reconstruct the load balancer to balance them Identification metrics: LLCRPI Cache weight = 2 Highest degree of sensitivity IPC Cache weight = 1 Medium degree of sensitivity Non sensitive Cache weight = 0 No sensitivity Maintain cache weight of each thread dynamically Cache Sensitivity

  10. Cache Sensitivity • Enhance the load representation structure • Balance # of threads, CW and IPC

  11. Implemented the system on OpenSolaris Transient thread characterization Cache sensitivity characterization Load balancing algorithm Tested on an Intel Xeon E5430 based machine Workloads using 12 SPEC 2K benchmarks 3 thread combinations Present the toughest case for the scheduler Compare results against default scheduler: Average weighted Perf/Watt: Captures both system level power consumption and performance # of thread migrations in the system and execution time stability Methodology

  12. ~14% average improvement in Perf/Watt Overall Results

  13. Significant speedup at roughly the same power budget Better opportunities for idle power savings Power Efficiency

  14. Stability Analysis • 91% reduction in migration rate of threads • Stable and predictable schedules and run-times • 89% reduction in execution time std deviation

  15. Identify limitations of modern OS to extract full power efficiency from modern CMP architectures Highlight characteristics of threads that affect cache sharing efficiency and metrics to capture them Identify ‘transient threads’ as an impediment to stable scheduling Extend the scheduler to incorporatecache and transient thread managementinto the load balancing fabric Prototype scheduler implementation improves Perf/Watt by up to 30% Conclusions

More Related