160 likes | 336 Views
Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs. Ian C. Smith. Overview. Quick description of the University of Liverpool Condor Pool Power saving at Liverpool A home-grown approach to dealing with power-saving PCs Power management using Condor 7.4.X
E N D
Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs Ian C. Smith
Overview • Quick description of the University of Liverpool Condor Pool • Power saving at Liverpool • A home-grown approach to dealing with power-saving PCs • Power management using Condor 7.4.X • Implementing Condor power management • Results • Future directions
University of Liverpool Condor Pool • Contains around 300 machines running the University’s Managed Windows (XP, soon Windows 7) Service. • Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine. • Single combined submit host / central manager running on Sun V445 SMP server. • Currently running Condor 7.0.2 on execute hosts (moving to 7.2.x soon). • Policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours • Jobs are killed rather than suspended
Power saving at Liverpool • We have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. • Original power saving policy was to “power-off” machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity • Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • Makes extensive use of PowerMAN system from Data Synergy comprising: • service which forces machines into a low-power state and reports machine activity to Management Reporting Platform • Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser
A home grown approach to power management • Two main problems to deal with: • how to ensure Condor jobs are not evicted by hibernating PCs • how to wake up dormant PCs to run Condor jobs on-demand • PowerMAN service prevents job eviction: • can provide PowerMAN with a list of “protected programs” which ensures that the machine remains active if running • include condor_starter process as a protected program (only present while a Condor job is running). • Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power: • NICs must be remain powered-up during hibernation • NICs must be capable of waking machines on receipt of a “magic packet” • network must be able to route “magic packets” – not a problem for us but YMMV
Adapting Condor for use with power-saving PCs • cron runs on the submit host which periodically examines the state of the queue (condor_status -schedd) and the pool (condor_status) • if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines • find out the number of powered up machines machines in each “teaching centre” (classroom) • estimate the number of hibernating machines in each teaching centre from total number of machines in each • sort centres from highest number of available machines to lowest • wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up) • MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)
Problems with the home-grown approach • Assumes that any job can run on any machine: • users cannot choose particular teaching centres or machines in their job Requirements • ideally, pool needs to be homogenous • errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate again) • cron includes a “sanity check” for this • Can only estimate number of hibernating machines in each centre • Same machines get woken up first
Power management in Condor 7.4.X • Condor daemons can now place an execute host in a low-power state according to a given policy • Execute hosts signals it is about to enter low-power state to the Condor central manager • Central manager records persistent offline ClassAds for hibernating machines • Negotiator can perform matchmaking with offline ClassAds • Matches are passed to condor_rooster • condor_rooster pipes information to condor_power which wakes up machines using WoL
Implementing Condor power management • Still use PowerMAN to power-down inactive PCs rather than using Condor • Need a way of advertising available offline machines to the condor_collector • If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are form the subset O = P – A • cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay) • Finding P (the total set of machines which are out there) turns out to be a very difficult problem
How do we determine which machines are available to Condor • Try waking them up ! • Wake up all machines in each teaching centre once a week using WoL • After wakeup call, wait a few minutes and test each machine in turn with: condor_status –direct <hostname> • Sanity check similar to UNIX ping Record which machines respond and publish ClassAds for them
Unforeseen problems • Not all woken up machines begin to run jobs • number of wakeups is limited by our “roll-your-own” version of condor_power • condor_rooster originally attempted to wake up all offline machines which matched job requirements • Included another limit in our condor_powerscript (number of wakeups must be < no of idle jobs) • Condor 7.4.3 should fix this, 7.5.3 adds ROOSTER_MAX_UNHIBERNATE configuration option • Wanted to wake up machines in random order so same machines not used repeatedly • Found that condor_negotiatorignored Rank values • Used condor_powerscript to implement this (“shuffles the deck”) • Should be fixed in 7.5.3 using ROOSTER_UNHIBERNATE_RANKconfig option Need a way of advertising available offline machines to the condor_collector • If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are the subset O = P – A • cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay) • Finding P (the machines which are out there) turns out to be a very difficult problem
Unforeseen problems / cont’d • Condor continued to wakeup machines after jobs removed (or complete) • Use Unhibernate = CurrentTime – MachineLastMatchTime < 300 not Unhibernate =!= Undefined • Difficult to distinguish Unclaimed offline machines from online ones in condor_status: • Also difficult to distinguish in Condor View graphs • to see all offline machines • $ condor_status –constraint Offline==True • to see all powered-up machines • $ condor_status –constraint Offline=!=True
Future Directions • Condor power management will allow us to expand the pool to include even low-spec machines • If machines are not needed or are unsuitable they need not be woken up • Rank can be used so that newer (more energy efficient machines) used first • We would like a more accurate way of determining which machines are available. One possible method: • Record the amount of time since each machine last appeared in the pool and/or ran a job • Confidence in waking a PC can be described by a monotonically decreasing function of this • May still need to wake machines for testing occasionally • Encourage users to incorporate their own checkpointing code to reduce “badput” and energy wastage (see Liverpool Condor website for details).
Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk