Ian C. Smith*

Harvesting unused clock cycles with Condor Ian C. Smith* *Advanced Research Computing The University of Liverpool

Overview • what is Condor ? • High Performance versus High Throughput Computing • Condor fundamentals • setting up and running a Condor Pool • The University of Liverpool Condor Pool • example applications

What is Condor ? • a specialized system for delivering High Throughput Computing • a harvester of unused computing resources • developed by Computer Science Dept at University of Wisconsin in late ‘80s • free and (now) open source software • widely used in academia and increasing in industry • available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS

HPC vs HTC (1) • High Performance Computing (HPC) • delivers large amounts of computing power over relatively short periods of time (peak FLOPS ratings important) • can also provide lots of memory, large amounts of fast (parallel) storage • fairly exotic hardware, may need plenty of TLC • large capital outlay on hardware • need to run specialised parallel (MPI) codes to get the benefit (can run serial codes but these are a poor use of resources) • users run relatively small numbers of parallel jobs • essential for certain time-critical applications

HPCvsHTC (2) • High Throughput Computing (HTC) • allows many computational tasks to be completed over a long period of time (peak FLOPS ratings not so important) • users more concerned with running large numbers of jobs over a long time span than a few short burst computations • makes use of existing commodity hardware (e.g. desktop PCs) • small capital outlay on hardware possible • limited memory and storage available generally • mostly aimed at running concurrent serial jobs (although MPI and PVM are supported by Condor)

Types of Condor application • large numbers of independent calculations typically (“pleasantly parallel”) • data parallel applications – split large datasets into smaller parts and analyse independently • biological sequence analysis • processing of census data • optimisation problems • microprocessor design and testing • applications based on Monte Carlo methods • radiotherapy treatment analysis • epidemiological studies

A “typical” Condor pool Submit/execute host Submit host Central manager Execute hosts Execute hosts

A “typical” Condor pool Submit/execute host Submit host Central manager ClassAds ClassAds ClassAds ClassAds Execute hosts Execute hosts

A “typical” Condor pool Submit/execute host Submit host Central manager Match Info Match Info Match Info Match Info Execute hosts Execute hosts

A “typical” Condor pool Submit/execute host Submit host Central manager Jobs Jobs Execute hosts Execute hosts

A “typical” Condor pool Submit/execute host Submit host Central manager Output Output Execute hosts Execute hosts

ClassAds and Matchmaking • ClassAds are a fundamental part of Condor • similar to classified advertisements in a paper • “Job Ads” represent jobs to Condor (similar to “wanted” ads) • “Machine Ads” represent compute resources in a Condor Pool (similar to “for sale” ads) • Condor central manager matches Machine Ads to Job Ads and hence machines to jobs • Job Ads are created using submit description files

Simple submit description file # simple submit description file # (anything following a # is comment and is ignored by Condor) # this would be used for Windows XP based execute hosts universe = vanilla executable = example.exe # what to run output = stdout.out$(PROCESS)# job`s standard output log = mylog.log$(PROCESS)# log job`s activities transfer_input_files = common.txt, myinput$(PROCESS).txt # input files needed requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" )# what machines to run on queue 2# number of jobs to queue

Requirements and Rank • Requirements expression determines where (and when) a job will run e.g. • Rank is used to express a preference Requirements = ( OpSys==“WINNT51” ) && # Windows XP OS wanted ( Arch==“Intel” ) && \ # Intel/compatible processor ( Memory >= 2000 ) && \ # want a least 2GB memory and • ( Disk >= 33554432 ) && \ # at least 32 GB of free disk • ( HAS_MATLAB == TRUE ) && \ # must have MATLAB installed • ( ( ClockMin > 1020 ) || \ # only run jobs after 5 pm OR ... • ( ClockMin == 6 ) || ( ClockDay == 0) ) # at weekends Rank = Kflops# run on machines with best floating point performance first

Job submission and monitoring [einstein@submit ~]$condor_submitexample.sub Submitting job(s). 2 job(s) submitted to cluster 100. [einstein@submit ~]$ condor_q -- Submitter: submit.chtc.wisc.edu : <128.104.55.9:51883> : submit.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 sagan 7/22 14:19 172+21:28:36 R 0 22.0 checkprogress.cron 2.0 heisenberg 1/13 13:59 0+00:00:00 I 0 0.0 env 3.0 hawking 1/15 19:18 0+04:29:33 R 0 0.0 script.sh 4.0 hawking 1/15 19:33 0+00:00:00 R 0 0.0 script.sh 5.0 hawking 1/15 19:33 0+00:00:00 H 0 0.0 script.sh 6.0 hawking 1/15 19:34 0+00:00:00 R0 0.0 script.sh ... 96.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh 97.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh 98.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh 99.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh 100.0 einstein 4/5 13:55 0+00:00:00 I 0 0.0 cosmos 557 jobs; 402 idle, 145 running, 1 held [einstein@submit ~]$

Condor policies • Condor supports a wide range of policies for when to start jobs e.g. • run jobs only outside office hours • run jobs only if load average on host is small and there has been no recent activity • run jobs at any time on one core (at low priority) • run jobs only submitted by certain users • also a wide choice of what to do when a job is about to be interrupted e.g. • suspend the job for a limited time then let it resume • checkpoint the job and migrate it to another machine • kill off the job immediately

UNIX or Windows execute hosts ? (1) • UNIX • Condor’s natural environment • not widely installed on desktop machines (but depends on institution...) • supports the Condor “standard universe” containing many useful features • checkpointing allows jobs to be migrated from one machine to another without loss of useful work • Remote Procedure Calls give transparent access to files on submit host • streaming of standard output (stdout) from jobs to submit host • Network filesystems work well making installation and configrationmuch easier • leverages large amount of scientific and engineering codes which have been developed under UNIX

UNIX or Windows execute hosts ? (2) • Windows • world’s most widely installed OS – rich source of execute hosts • many commercial 3rd party applications run on Windows • using shared (network) filesystems can be difficult under Condor • only supports the “vanilla” Condor universe • no checkpointing – evicted jobs may waste a lot of cycles • all input and output files need to be transferred to/from execute host • output streaming not supported • may be difficult to port “legacy” UNIX codes (although Cygwin and Co-Linux can make life easier) • Windows support from the U-W Condor Team tends to lag behind UNIX

Setting up a Condor pool • best to start off small and build up pool slowly • need to understand Condor fundamentals: • role of Condor processes and how they interact • life-cycle of jobs • ClassAds and Matchmaking • avoid firewalls if possible (may be easier said than done ...) • talk to central IT services (particularly network and PC teams) • submit hosts may need to be fairly high spec if large numbers of jobs are to be run - ideally want • multi-core/processor machine (quad core at least) • plenty of memory (say 8 GB or more) • large fast access filestore (e.g. 1 TB RAID)

Where to go for help • Read The Fine Manual ! • log files contain a lot of useful information • take a look at the presentations, tutorials and “how-to recipes”on the Condor website: (www.cs.wisc.edu/condor) • search the condor-users mail list archive: (lists.cs.wisc.edu/archive/condor-users) • subscribe to the condor-users mail list • join the Campus Grids SIG: (wikis.nesc.ac.uk/escinet/Campus_Grids) • commercial support is also available (e.g. Cycle Computing)

University of Liverpool Condor Pool • contains around 400 machines running the University’s Managed Windows Service (currently XP but moving to Windows 7 soon) • most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine • single submission point for Condor jobs provided by Sun Solaris V445SMP server • policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours • job will be killed off if running when a user logs in to a PC • web interface for specific applications • support for running large numbers of MATLAB jobs

Condor service caveats • only suitable for DOS-based applications running in batch mode • no communication between processes possible (“pleasantly parallel” applications only) • statically linked executables work best (although can cope with DLLs) • all files needed by application must be present on local disk (cannot access network drives) • shorter jobs more likely to run to completion (10-20 min seems to work best) • very long running jobs can accommodated using Condor DAGMan or user level check-pointing (details available soon on the Condor website)

Running MATLAB jobs under Condor • many users prefer to create applications using MATLAB rather than traditional compiled languages (e.g. FORTRAN, C) • need to create standalone application from M-file(s) using MATLAB compiler • standalone application can run without a MATLAB license • run-time libraries still need to be accessible to MATLAB jobs • nearly all toolbox functions available to standalone applications • simple (but powerful) file I/O makes checkpointing easier • see Liverpool Condor website for more information

Power-saving and Green IT at Liverpool • we have around 2 000 centrally managed classroom PCs across campus which were powered up overnight, at weekends and during vacations. • original power-saving policy was to power-off machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity • policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • 3rd party power management software (PowerMAN) prevents machines hibernating whilst Condor jobs are running • Condor’s own power management features allows machines to be woken up automatically according to demand

Condor-G and Grid Computing • Condor-G is an extension to Condor allowing job submission to remote resources using Globus • provides familiar Condor-like interface to users hiding the underlying middleware complexity • we have used Condor-G to give users grid access to a variety of HPC resources: • local HPC clusters (UL-Grid) • NW-Grid resources at Daresbury Lab, Lancaster and Manchester • National Grid Service facilities • Grid Computing Server tools provide a batch environment similar to that of cluster systems (e.g. Sun Grid Engine) • Web portal removes the need for command line use completely

Radiotherapy example • 3D model of normal tissue was developed in which complications are generated when ‘irradiated’ [1] • aim is to provide insight into connection between dose-distribution characteristics, different organ architectures and complication rates beyond that of analytical methods • code written in MATLAB and compiled into standalone executable • set of 800 simulations took ~ 36 hours to run on Condor pool • would require 4-5 months of computing time on a single PC • several dozen sets of simulations have since been completed [1]Rutkowska E., Baker C.R. and Nahum A.E. Mechanistic simulation of normal-tissue damage in radiotherapy—implications for dose–volume analyses. Phys. Med. Biol. 55 (2010) 2121–2136.

Personalised Medicine example • project is a Genome-Wide Association Study • aims to identify genetic predictors of response to anti-epileptic drugs • try to identify regions of the human genome that differ between individuals (referred to as SNPs) • 800 patients genotyped at 500 000 SNPs along the entire genome • test statistically the association between SNPs and outcomes (e.g. time to withdrawl of drug due to adverse effects) • very large data-parallel problem – ideal for Condor • divide datasets into small partitions so that individual jobs run for 15-30 minutes • batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute time on Condor but ~ 5 weeks on a single PC

Epidemiology example • researchers have simulated the consequences of an incursion of H5N1 avian influenza into the British poultry flock [2] • Monte Carlo type method - highly parallel • original code written in MATLAB and compiled into standalone application • individual simulations take only 10-15 minutes to run – ideal for Condor • require ~ 10 000 - 20 000 simulations per scenario • would have needed several years of compute time on single machine, on Condor needed a few weeks [2] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and Christley R.M. Epidemiological consequences of an incursion of highly pathogenic H5N1 avian influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28

Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk

Ian C. Smith*

Ian C. Smith*

Presentation Transcript

The Canonical Life

From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges

Massively Parallel Processors

Transmission Lines and Waveguides 4. The Smith Chart

Investigating Privacy Breaches under HITECH and HIPAA

Pleasures of Knowledge

Early Paleozoic Earth History

This Presentation Developed By Drew R. Smith

Purifying the Nation— Joseph Smith

Overview of the Leahy-Smith America Invents Act

Presenters

A General Introduction to Biomedical Ontology

DRUG IDENTIFICATION & SYMPTOMOLOGY

Management 377 Competitive Strategy Prof. Rick Smith

The Ontology of Experiments

1 （ 2009· 山东）假设你是李华，曾在美国学习半年，现已回国。你想联系你的美国老师 Mr.Smith ，但没有其联系方式。请根据以下要点给你的美国同学 Tom 写一封信：

How to build an ontology 2

Ian C. Smith*

Ian C. Smith*

Presentation Transcript

The Canonical Life

From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges

Massively Parallel Processors

Transmission Lines and Waveguides 4. The Smith Chart

Investigating Privacy Breaches under HITECH and HIPAA

Pleasures of Knowledge

Early Paleozoic Earth History

This Presentation Developed By Drew R. Smith

Purifying the Nation— Joseph Smith

Overview of the Leahy-Smith America Invents Act

Presenters

A General Introduction to Biomedical Ontology

DRUG IDENTIFICATION &amp; SYMPTOMOLOGY

Management 377 Competitive Strategy Prof. Rick Smith

The Ontology of Experiments

1 （ 2009· 山东） 假设你是李华，曾在美国学习半年，现已回国。 你想联系你的美国老师 Mr.Smith ，但没有其联系方 式。请根据以下要点给你的美国同学 Tom 写一封信：

How to build an ontology 2

DRUG IDENTIFICATION & SYMPTOMOLOGY

1 （ 2009· 山东）假设你是李华，曾在美国学习半年，现已回国。你想联系你的美国老师 Mr.Smith ，但没有其联系方式。请根据以下要点给你的美国同学 Tom 写一封信：