150 likes | 335 Views
Experiences with running MATLAB jobs on a power-saving Condor Pool. Ian C. Smith. University of Liverpool Condor Pool. Contains around 300 machines running the University’s Managed Windows (XP) Service.
E N D
Experiences with running MATLAB jobs on a power-saving Condor Pool Ian C. Smith
University of Liverpool Condor Pool • Contains around 300 machines running the University’s Managed Windows (XP) Service. • Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine. • Software updates via a weekly re-imaging process. • Single combined submit host / central manager running on Sun V440 SMP server. • Restricted access to submit host for registered Condor users. • Currently running Condor 7.0.2 (moving to 7.2.x soon). • Policy is to run jobs only if a least 10 minutes of inactivity and low load average during office hours and at anytime outside of office hours.
MATLAB advantages • Originally developed for linear algebra algorithm development but now contains many built-functions geared to different disciplines divided into toolboxes. • Intuitive interactive environment allows rapid code development. • Simple but powerful file I/O: save <filename>, load <filename> (useful for checkpointing). • Allows users to create their own functions stored as M-files. • “Standalone” applications can be built from M-files: • can run on platforms without MATLAB installed • do not need a licence to be able to run • can include all toolbox functions • APIs available for FORTRAN and C codes (“MEX files”)
MATLAB disadvantages • Even standalone applications can run slower than equivalent C or FORTRAN implementations. • Standalone applications aren’t quite what they may seem: • more than just an .exe – several files need to be packaged and deployed • need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting .exe) • luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive) • Run-time errors can be difficult to trace when MATLAB jobs are run under Condor: • need to run under Condor on local PC • configure with USE_VISIBLE_DESKTOP=True to see pop-up messages • Jobs submitted in a UNIX environment but code developed under Windows.
Minor MATLAB irritations • Output files occasionally go missing: • specify all required files using transfer_output_files • identify problem jobs with condor_q –held • resubmit with condor_release –all • Jobs sometimes run “forever”: • use condor_vacate to move job to another machine • less of a problem during term time as jobs usually get evicted by logins • Difficult to reproduce these problems: • happen quite rarely ( < 1 in ~1000 jobs) • many jobs based on stochastic methods
MATLAB Research Applications • Predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science). • Modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science). • Testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics). • Simulation of the infection of a bacterial cell by a virus (Mathematical Sciences). • Modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy).
Power-saving at Liverpool • Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. • Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity • Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • Makes extensive use of PowerMAN system from Data Synergy comprising: • service which forces machines into a low-power state and reports machine activity to Management Reporting Platform • Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser
Adapting Condor for use with power-saving PCs • Two main problems: • how to ensure Condor jobs are not evicted by hibernating/powered-off PCs • how to wake up dormant PCs to run Condor jobs on-demand • Originally used Microsoft system service to power-down PCs after 30 min inactivity: • runs .bat file which checks if a user is logged in and shuts machine down if not • doesn’t detect owner of Condor job as a logged-in user • need to check for presence of condor_exe.bat • PowerMAN service now prevents job eviction: • can provide PowerMAN with a list of “protected programs” • ensures that system remains active if a protected program is running • include condor_starter process as a protected program (only present while a Condor job is running).
Adapting Condor for use with a power-saving PCs • Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power: • NICs must be remain powered-up during hibernation/power-off • NICs must be capable of waking machines on receipt of a “magic packet” • network must be able to route “magic packets” • cron runs on the submit host which examines state of queue (condor_q) and pool (condor_status): • if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines • find number of powered up machines machines in each “teaching centre” (classroom) • estimate the number of hibernating machines in each teaching centre from total number of machines in each • sort centres from highest number of available machines to lowest • wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up) • MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)
Automatic wake up issues • Assumes that any job can run on any machine: • users cannot choose particular teaching centres or machines in their job Requirements • ideally, pool needs to be homogenous • errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate) • cron now includes a “sanity check” for this • Large clusters of jobs can cause condor scheduler to become overloaded: • condor_q times out so cron cannot determine queue state • only a transient problem – load eventually drops off and condor_q responds again • Can only estimate number of hibernating machines in each centre • May wake up more machines than needed
Automatic wake up in action – Condor pool machine statistics
Recent and Future Developments • Recently moved to a policy of hibernating machines after 10 minutes of inactivity • submit host / central manager needs to work harder to get jobs running before recently woken machines go back to hibernation • move execute hosts from Owner to Unclaimed state after just 5 minutes idle • update activity timer every 1 minute (default is 5 minutes) • increase number of scheduler and negotiator cycles using SCHEDD_INTERVAL=60, NEGOTIATOR_INTERVAL=60 • around 25 % machines still hibernate after first wakeup • see a ramp up in machines running Condor jobs over about an hour • little impact on Condor users • energy wastage offset by savings with user logouts
Recent and Future Developments • Migrating to Condor 7.2 shortly • Has some interesting power-management features • Automatic power-down on execute hosts could provide a useful “safety net” but PowerMAN likely to remain primary power management tool • Can retain records of ClassAds of machines in low-power state • could be useful in matchmaking jobs to powered-down machines • matchmaking logic already in Condor • nice if Condor could use this to provide a list of machines to wake-up on demand • ... and wake them up with condor_wakeup ? • would like to ensure that powered-down machines are still out there (not broken, permanently turned off, not listening etc) • also useful to see powered-off machines represented in condor_status output • Couple of extra “wishes” • allow jobs to claim all slots on a machine (useful if they have large memory requirements) • provide a “logged-in user” machine ClassAd attribute
Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk