290 likes | 482 Views
Harvesting unused clock cycles with Condor. Ian C. Smith*. *Advanced Research Computing The University of Liverpool. Overview. what is Condor ? High Performance versus High Throughput Computing Condor fundamentals setting up and running a Condor Pool
E N D
Harvesting unused clock cycles with Condor Ian C. Smith* *Advanced Research Computing The University of Liverpool
Overview • what is Condor ? • High Performance versus High Throughput Computing • Condor fundamentals • setting up and running a Condor Pool • The University of Liverpool Condor Pool • example applications
What is Condor ? • a specialized system for delivering High Throughput Computing • a harvester of unused computing resources • developed by Computer Science Dept at University of Wisconsin in late ‘80s • free and (now) open source software • widely used in academia and increasing in industry • available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS
HPC vs HTC (1) • High Performance Computing (HPC) • delivers large amounts of computing power over relatively short periods of time (peak FLOPS ratings important) • can also provide lots of memory, large amounts of fast (parallel) storage • fairly exotic hardware, may need plenty of TLC • large capital outlay on hardware • need to run specialised parallel (MPI) codes to get the benefit (can run serial codes but these are a poor use of resources) • users run relatively small numbers of parallel jobs • essential for certain time-critical applications
HPCvsHTC (2) • High Throughput Computing (HTC) • allows many computational tasks to be completed over a long period of time (peak FLOPS ratings not so important) • users more concerned with running large numbers of jobs over a long time span than a few short burst computations • makes use of existing commodity hardware (e.g. desktop PCs) • small capital outlay on hardware possible • limited memory and storage available generally • mostly aimed at running concurrent serial jobs (although MPI and PVM are supported by Condor)
Types of Condor application • large numbers of independent calculations typically (“pleasantly parallel”) • data parallel applications – split large datasets into smaller parts and analyse independently • biological sequence analysis • processing of census data • optimisation problems • microprocessor design and testing • applications based on Monte Carlo methods • radiotherapy treatment analysis • epidemiological studies
A “typical” Condor pool Submit/execute host Submit host Central manager Execute hosts Execute hosts
A “typical” Condor pool Submit/execute host Submit host Central manager ClassAds ClassAds ClassAds ClassAds Execute hosts Execute hosts
A “typical” Condor pool Submit/execute host Submit host Central manager Match Info Match Info Match Info Match Info Execute hosts Execute hosts
A “typical” Condor pool Submit/execute host Submit host Central manager Jobs Jobs Execute hosts Execute hosts
A “typical” Condor pool Submit/execute host Submit host Central manager Output Output Execute hosts Execute hosts
ClassAds and Matchmaking • ClassAds are a fundamental part of Condor • similar to classified advertisements in a paper • “Job Ads” represent jobs to Condor (similar to “wanted” ads) • “Machine Ads” represent compute resources in a Condor Pool (similar to “for sale” ads) • Condor central manager matches Machine Ads to Job Ads and hence machines to jobs • Job Ads are created using submit description files
Simple submit description file # simple submit description file # (anything following a # is comment and is ignored by Condor) # this would be used for Windows XP based execute hosts universe = vanilla executable = example.exe # what to run output = stdout.out$(PROCESS)# job`s standard output log = mylog.log$(PROCESS)# log job`s activities transfer_input_files = common.txt, myinput$(PROCESS).txt # input files needed requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" )# what machines to run on queue 2# number of jobs to queue
Requirements and Rank • Requirements expression determines where (and when) a job will run e.g. • Rank is used to express a preference Requirements = ( OpSys==“WINNT51” ) && # Windows XP OS wanted ( Arch==“Intel” ) && \ # Intel/compatible processor ( Memory >= 2000 ) && \ # want a least 2GB memory and • ( Disk >= 33554432 ) && \ # at least 32 GB of free disk • ( HAS_MATLAB == TRUE ) && \ # must have MATLAB installed • ( ( ClockMin > 1020 ) || \ # only run jobs after 5 pm OR ... • ( ClockMin == 6 ) || ( ClockDay == 0) ) # at weekends Rank = Kflops# run on machines with best floating point performance first
Job submission and monitoring [einstein@submit ~]$condor_submitexample.sub Submitting job(s). 2 job(s) submitted to cluster 100. [einstein@submit ~]$ condor_q -- Submitter: submit.chtc.wisc.edu : <128.104.55.9:51883> : submit.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 sagan 7/22 14:19 172+21:28:36 R 0 22.0 checkprogress.cron 2.0 heisenberg 1/13 13:59 0+00:00:00 I 0 0.0 env 3.0 hawking 1/15 19:18 0+04:29:33 R 0 0.0 script.sh 4.0 hawking 1/15 19:33 0+00:00:00 R 0 0.0 script.sh 5.0 hawking 1/15 19:33 0+00:00:00 H 0 0.0 script.sh 6.0 hawking 1/15 19:34 0+00:00:00 R0 0.0 script.sh ... 96.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh 97.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh 98.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh 99.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh 100.0 einstein 4/5 13:55 0+00:00:00 I 0 0.0 cosmos 557 jobs; 402 idle, 145 running, 1 held [einstein@submit ~]$
Condor policies • Condor supports a wide range of policies for when to start jobs e.g. • run jobs only outside office hours • run jobs only if load average on host is small and there has been no recent activity • run jobs at any time on one core (at low priority) • run jobs only submitted by certain users • also a wide choice of what to do when a job is about to be interrupted e.g. • suspend the job for a limited time then let it resume • checkpoint the job and migrate it to another machine • kill off the job immediately
UNIX or Windows execute hosts ? (1) • UNIX • Condor’s natural environment • not widely installed on desktop machines (but depends on institution...) • supports the Condor “standard universe” containing many useful features • checkpointing allows jobs to be migrated from one machine to another without loss of useful work • Remote Procedure Calls give transparent access to files on submit host • streaming of standard output (stdout) from jobs to submit host • Network filesystems work well making installation and configrationmuch easier • leverages large amount of scientific and engineering codes which have been developed under UNIX
UNIX or Windows execute hosts ? (2) • Windows • world’s most widely installed OS – rich source of execute hosts • many commercial 3rd party applications run on Windows • using shared (network) filesystems can be difficult under Condor • only supports the “vanilla” Condor universe • no checkpointing – evicted jobs may waste a lot of cycles • all input and output files need to be transferred to/from execute host • output streaming not supported • may be difficult to port “legacy” UNIX codes (although Cygwin and Co-Linux can make life easier) • Windows support from the U-W Condor Team tends to lag behind UNIX
Setting up a Condor pool • best to start off small and build up pool slowly • need to understand Condor fundamentals: • role of Condor processes and how they interact • life-cycle of jobs • ClassAds and Matchmaking • avoid firewalls if possible (may be easier said than done ...) • talk to central IT services (particularly network and PC teams) • submit hosts may need to be fairly high spec if large numbers of jobs are to be run - ideally want • multi-core/processor machine (quad core at least) • plenty of memory (say 8 GB or more) • large fast access filestore (e.g. 1 TB RAID)
Where to go for help • Read The Fine Manual ! • log files contain a lot of useful information • take a look at the presentations, tutorials and “how-to recipes”on the Condor website: (www.cs.wisc.edu/condor) • search the condor-users mail list archive: (lists.cs.wisc.edu/archive/condor-users) • subscribe to the condor-users mail list • join the Campus Grids SIG: (wikis.nesc.ac.uk/escinet/Campus_Grids) • commercial support is also available (e.g. Cycle Computing)
University of Liverpool Condor Pool • contains around 400 machines running the University’s Managed Windows Service (currently XP but moving to Windows 7 soon) • most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine • single submission point for Condor jobs provided by Sun Solaris V445SMP server • policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours • job will be killed off if running when a user logs in to a PC • web interface for specific applications • support for running large numbers of MATLAB jobs
Condor service caveats • only suitable for DOS-based applications running in batch mode • no communication between processes possible (“pleasantly parallel” applications only) • statically linked executables work best (although can cope with DLLs) • all files needed by application must be present on local disk (cannot access network drives) • shorter jobs more likely to run to completion (10-20 min seems to work best) • very long running jobs can accommodated using Condor DAGMan or user level check-pointing (details available soon on the Condor website)
Running MATLAB jobs under Condor • many users prefer to create applications using MATLAB rather than traditional compiled languages (e.g. FORTRAN, C) • need to create standalone application from M-file(s) using MATLAB compiler • standalone application can run without a MATLAB license • run-time libraries still need to be accessible to MATLAB jobs • nearly all toolbox functions available to standalone applications • simple (but powerful) file I/O makes checkpointing easier • see Liverpool Condor website for more information
Power-saving and Green IT at Liverpool • we have around 2 000 centrally managed classroom PCs across campus which were powered up overnight, at weekends and during vacations. • original power-saving policy was to power-off machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity • policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • 3rd party power management software (PowerMAN) prevents machines hibernating whilst Condor jobs are running • Condor’s own power management features allows machines to be woken up automatically according to demand
Condor-G and Grid Computing • Condor-G is an extension to Condor allowing job submission to remote resources using Globus • provides familiar Condor-like interface to users hiding the underlying middleware complexity • we have used Condor-G to give users grid access to a variety of HPC resources: • local HPC clusters (UL-Grid) • NW-Grid resources at Daresbury Lab, Lancaster and Manchester • National Grid Service facilities • Grid Computing Server tools provide a batch environment similar to that of cluster systems (e.g. Sun Grid Engine) • Web portal removes the need for command line use completely
Radiotherapy example • 3D model of normal tissue was developed in which complications are generated when ‘irradiated’ [1] • aim is to provide insight into connection between dose-distribution characteristics, different organ architectures and complication rates beyond that of analytical methods • code written in MATLAB and compiled into standalone executable • set of 800 simulations took ~ 36 hours to run on Condor pool • would require 4-5 months of computing time on a single PC • several dozen sets of simulations have since been completed [1]Rutkowska E., Baker C.R. and Nahum A.E. Mechanistic simulation of normal-tissue damage in radiotherapy—implications for dose–volume analyses. Phys. Med. Biol. 55 (2010) 2121–2136.
Personalised Medicine example • project is a Genome-Wide Association Study • aims to identify genetic predictors of response to anti-epileptic drugs • try to identify regions of the human genome that differ between individuals (referred to as SNPs) • 800 patients genotyped at 500 000 SNPs along the entire genome • test statistically the association between SNPs and outcomes (e.g. time to withdrawl of drug due to adverse effects) • very large data-parallel problem – ideal for Condor • divide datasets into small partitions so that individual jobs run for 15-30 minutes • batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute time on Condor but ~ 5 weeks on a single PC
Epidemiology example • researchers have simulated the consequences of an incursion of H5N1 avian influenza into the British poultry flock [2] • Monte Carlo type method - highly parallel • original code written in MATLAB and compiled into standalone application • individual simulations take only 10-15 minutes to run – ideal for Condor • require ~ 10 000 - 20 000 simulations per scenario • would have needed several years of compute time on single machine, on Condor needed a few weeks [2] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and Christley R.M. Epidemiological consequences of an incursion of highly pathogenic H5N1 avian influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28
Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk