1 / 45

Unobtrusive power proportionality for Torque: Design and Implementation

Unobtrusive power proportionality for Torque: Design and Implementation. Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert Goto. Introduction. What is power proportionality ?

norman
Download Presentation

Unobtrusive power proportionality for Torque: Design and Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert Goto

  2. Introduction • What is power proportionality ? • Performance-power ratio at all performance levels is equivalent to that at the maximum performance level • Servers consume a high percentage of their max power even idle • Hence, power proportionality => switch off idle servers

  3. NapSAC – Krioukovet.al. Computational “Spinning Reserve” IPS Requests Load Distribution Scheduling Power Management WikiPedia Request Rate Power CPS 2011

  4. The need for power proportionality of IT equipment in Soda Hall Cluster Room Power: 120-130kW (~25%) Total HVAC for cluster rooms : 75-85kW(~15%) Soda Hall Power : 450-500kW

  5. PSI Cluster PSI Cluster: 20-25kW (~5% of Soda) Cluster Room Power: 120-130kW (~25% of Soda) Total HVAC for PSI Cluster room : 20-25kW(~5% of Soda) Total HVAC for cluster rooms : 75-85kW(~15% of Soda)

  6. The PSI Cluster • PSI Cluster Consumes ~20-25kW of power irrespective of workload. Contains about 110 servers. • Recently server faults have reduced the size of the cluster to 78 servers. (The faulty servers mostly are powered on all the time) • Used mainly by NLP, Vision, AI and ML graduate students. • It is an HPC Cluster running Torque

  7. PSI Cluster

  8. Possible Energy savings Can save ~ 50% of the energy

  9. Current state :

  10. Result: 10 kW We save 49% of the energy

  11. What is Torque? • Tera-scale Open-source Research and QUEue manager • Built upon original Portable Batch System (PBS) project • Resource manager: Manages availability of, and requests for, compute node resources • Used by most academic institutions throughout the world for batch processing.

  12. Maui Scheduler • Job scheduler • Implements and manages: • Scheduling policies • Dynamic priorities • Reservations • Fairshare

  13. Sample Job Flow • Script submitted to TORQUE specifying required resources • Maui periodically retrieves from TORQUE list of potential jobs, available node resources, etc. • When resources become available, Maui tells TORQUE to execute certain jobs on particular nodes • TORQUE dispatches jobs to the PBS MOMs (machine oriented miniserver) running on the compute nodes - pbs_mom is the process that starts the job script • Job status changes reported back to Maui, information updated

  14. Why are we building power-proportional Torque ? • To shed load in Soda Hall • To investigate why production clusters don’t implement power proportionality • To integrate power-proportionality into a software used in many clusters throughout the world

  15. Desirables from an unobtrusive power proportionality feature • Avoid modifications to torque source code • Only use existing torque interfaces • Make the feature completely transparent to end users • Maintain system responsiveness • Centralized • No dependence resource manager/scheduler version

  16. Analysis of the psi cluster • Logs : • Active and Idle Queue Log • Job placement statistics • Logs exist for 68 days in Feb-April,2011 • Logs were recorded once every minute • Logs contain information of ~169k jobs , ~40 users

  17. Type of servers in the psi cluster • Each server class is further divided according to various features • Not all servers listed above are switched on all the time

  18. CDF of server idle duration TAKEAWAY 1: Most idle periods are small

  19. Contribution of server idle period to total TAKEAWAY 2: To save energy, tackle the large idle periods

  20. CDF of job durations INTERACTIVE (50,500s) BATCH TAKEAWAY 3: Most jobs are long. Hence slight increase in queuing time wont hurt

  21. Summary of takeaways • Small server idle times, though numerous, contribute very less to total server idle time. • Power proportionality algorithm need not be aggressive in switching of servers • Waking servers takes 5 min. Considered to the running time of a job, it is negligible

  22. Loiter Time vs Energy Savings

  23. Design of unobtrusive Power Proportionality for Torque

  24. Using Torque interfaces What useful state information does torque/maui maintain ? • Maintains the state(active/offline/down) of each server, and jobs running on it. • Obtained through “pbsnodes” command • Maintains a list of running and queued jobs • Obtained through “qstat” command • Maintains job constraints and scheduling details of each job • Obtained through “checkjob” command

  25. First implementation- State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME • If idle job can be scheduled on server Offline Waking • Server not waking • Server_offline_time >OFFLINE_LOITER_TIME • No job has been scheduled on server Problematic Server Down • Idle job exists

  26. Does not work ! • Each job is submitted to a specific queue, • Must ensure right server wakes up.

  27. Next implementation-State machine for each server Active • Server has woken up Problematic Server • Server_idle_time > LOITER_TIME • If idle job can be scheduled on server • Server not waking Offline Waking • Server_offline_time > OFFLINE_LOITER_TIME • No job has been scheduled on server • Idle job exists • Server belongs to desired queue Down

  28. Still did not work ! • Each job has specific constraints which torque takes into account while scheduling • Job constraints can be obtained through “checkjob” command.

  29. Next implementation-State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME Problematic Server • If idle job can be scheduled on server • Server not waking Offline Waking • Idle job exists • Server belongs to desired queue • Server satisfies job constraints • Server_offline_time > OFFLINE_LOITER_TIME • No job has been scheduled on server Down

  30. Scheduling problem: Job submission characteristics • Users tend to submit multiple jobs at a time (often >20) • Torque has its own fairness mechanisms, which wont schedule all the jobs even if there are free servers. • To accurately predict which jobs Torque will schedule, and not to switch on extra servers, we should emulate the Torque scheduling logic ! • Ties Power Proportionality feature to specific Torque Policy • Solution : Switch on only a few servers at a time to check if torque schedules the idle job

  31. Next implementation-State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME Problematic Server • If idle job can be scheduled on server • Server not waking Offline Waking • Idle job exists • Server belongs to desired queue • Server satisfies job constraints • Switch on only a few servers at a time • Server_offline_time > OFFLINE_LOITER_TIME • No job has been scheduled on server Down

  32. Maintain responsiveness/headroom • The Debug cycle usually contains the users running short jobs and validating the output • If no server satisfying job contraints are switched on, a user might have to wait a long time to validate if his job is running • If jobs throw errors, he might have to wait for an entire server power cycle to run his modified job • Solution : • Group servers according to features. • In each group, have a limited numbers of servers as spinning reserve all the time

  33. Final implementation-State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME Problematic Server • If idle job can be scheduled on server • Server not waking Offline Waking • Idle job exists • Server belongs to desired queue • Server satisfies job constraints • Switch on only MAX_SERVERS at a time • Switch on server to maintain headroom • Server_offline_time >OFFLINE_LOITER_TIME • No job has been scheduled on server • Switching off servers leaves no headroom Down

  34. But the servers don’t wake up !!! • Each server has to bootstrap a list of service, such as network file systems, work directories, portmapper, etc • Often these bootstraps fail, and hence servers are left in an undesired state ( e.g with no home directories mounted to write user output to ! ) • Solution : • Have a health-check script on each server • Check for proper configurations of useful services, and make server available for scheduling only if health-check succeeds.

  35. Power Proportional Torque at a glance: • Completely transparent to user • Did not modify torque source code • 1000 line python script which runs only on torque master server • Halts servers through ssh • Wake servers through wake-on-lan • Separates scheduling policy from mechanism. • It allows torque to dictate the scheduling policy.

  36. Deployment • Deployed on 57 of the 78 active nodes in the psi cluster. Total number of cores = 150 • Servers were classified into 5 groups based on features. • HEADROOM_PER_GROUP = 3 • MAX_SERVERS_TO_WAKE_AT_A_TIME = 5 • LOITER_TIME = 7 minutes • OFFLINE_LOITER_TIME = 3 minutes

  37. Average Statistics • Deployed since last week • ~800 jobs analyzed • Avg utilization of cluster = 40% • % Energy saved = 49%

  38. Results:

  39. HVAC power savings

  40. Number of servers powered on at a time: Headroom

  41. Expected vs Actual savings

  42. Submission vs Execution profile

  43. CDF of job queue time as a percentage of job length

  44. Conclusions – what we achieved • Power proportionality is easy to achieve for torque without changing any source code at all • The script could be run on any standard torque cluster to save energy. • Switching servers back on in a consistent state is the single biggest roadblock to deployment of script. • We saved a max of ~17kW of power is Soda Hall (~3%). This was only half the psi cluster !

More Related