470 likes | 627 Views
Introduction to research computing using Condor. Ian C. Smith*. *Advanced Research Computing University of Liverpool. Overview. what is Condor and what can it be used for ? typical Condor pool operation University of Liverpool Condor Pool support for MATLAB and R applications
E N D
Introduction to research computing using Condor Ian C. Smith* *Advanced Research Computing University of Liverpool
Overview • what is Condor and what can it be used for ? • typical Condor pool operation • University of Liverpool Condor Pool • support for MATLAB and R applications • some research computing examples • quick introduction to UNIX with a walk-through example
What is Condor ? • a specialized system for delivering High Throughput Computing • a harvester of unused computing resources • developed by Computer Science Dept at University of Wisconsin in late ‘80s • free and (now) open source software • widely used in academia and increasing in industry • available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS
Types of Condor application • typically - large numbers of independent calculations (“pleasantly parallel”) • data parallel applications – split large datasets into smaller parts and process them in parallel • biological sequence analysis (e.g. BLAST) • processing of field trial data • optimisation problems • microprocessor design and testing • applications based on Monte Carlo methods • radiotherapy treatment analysis • epidemiological studies
A “typical” Condor pool Desktop PC Condor Server login and upload input data Execute hosts Execute hosts
A “typical” Condor pool Desktop PC Condor Server jobs jobs Execute hosts Execute hosts
A “typical” Condor pool Desktop PC Condor Server results results Execute hosts Execute hosts
A “typical” Condor pool Desktop PC Condor Server download results Execute hosts Execute hosts
University of Liverpool Condor Pool • contains around 700 classroom PCs running the CSD Managed Windows 7 Service (mostly 64 bit from next year) • most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots per PC (total of 1400 job slots) • single job submission point for Condor jobs provided by powerful UNIX server • jobs continue to run while classroom PCs are unused but ... • if load (or memory use) becomes significant, job will be killed and usually any results will be lost (job will start again from scratch) • tools provided for running large numbers of MATLAB and R jobs
Condor caveats • only suitable for non-interactive applications • no communication between jobs possible • all files needed by application must be present on local disk • shorter jobs more likely to run to completion (10-20 min seems to work best) • long running jobs can be run if save/restore mechanism (checkpointing) is built into them • tricky to begin with but usually worth the initial effort
Running MATLAB jobs under Condor • need to create standalone application from M-file(s) using MATLAB compiler • standalone application can run without a MATLAB license • run-time libraries still need to be accessible to MATLAB jobs • nearly all toolbox functions available to standalone applications • simple (but powerful) file input/output makes checkpointing easier • tools available to simplify job submission - see Liverpool Condor website for more information
Running R jobs under Condor • limited support at present • R is installed on-the-fly as part of the job • currently only R version 2.6.2 available with standard packages • tools available to simplify job submission • checkpointingmay be possible for long running jobs
Personalised Medicine example • project is a Genome-Wide Association Study • aims to identify genetic predictors of response to anti-epileptic drugs • try to identify regions of the human genome that differ between individuals (referred to as SNPs) • 800 patients genotyped at 500 000 SNPs along the entire genome • test statistically the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects) • very large data-parallel problem using R – ideal for Condor • divide datasets into small partitions so that individual jobs run for 15-30 minutes • batch of 26 chromosomes (2 600 jobs) required ~ 5 hours wallclock time on Condor but ~ 5 weeks on a single PC
Radiotherapy example • large 3rd party application code which simulates photon beam radiotherapy treatment using Monte Carlo methods • tried running simulation on 56 cores of high performance computing cluster but no progress after 5 weeks • divided problem into 250 then 5 000 and eventually 50 000 Condor jobs • required ~ 2 600 days of cpu time (equivalent to ~ 3.5 years on dual core PC) • Condor simulation completed in less than one week • average run time was ~ 70 min • only ~ 10 % of compute time wasted due to evictions
Condor service prerequisites • will need a Sun UNIX service account (contact CSD helpdesk@liv.ac.uk) and a Condor account (http://www.liv.ac.uk/csd/registration/eScienceform.pdf) • to login in to the Condor server: • on MWS use PuTTy: Install University Applications | Internet | PuTTy 0.60 • Mac/Linux: open terminal window and use ssh • off campus: use Apps Anywhere (PuTTy is in Utilities group) • to upload/download files to/from the Condor server: • on MWS use CoreFTPLite: Install University Applications | Internet | CoreFTP LE2.1 • Mac/Linux: open terminal window, use sftp/scp • off campus: need to use virtual private network (VPN), then FTP
Condor server directory tree / or ‘root’ /condor_data /usr /bin /sbin /home /tmp
Condor server directory tree / /condor_data /usr /bin /sbin /tmp /home /home/smithic /home/jim /home/fred login ‘home’directories
Condor server directory tree / /usr /bin /sbin /home /tmp /condor_data /condor_data/smithic /condor_data/jim ‘home’directories for Condor
MATLAB Condor example calculate the sum of p matrix-matrix products: • each product calculation is independent and can be performed in parallel • MATLAB M-file (product.m): function product load input.mat; C=A*B; save( 'output.mat', 'C' ); quit;
Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory
Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples
Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab#now in /condor_data/smithic/matlab
Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.matproduct.m
Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.matproduct.m [smithic@ulgp5 matlab]$ matlab_buildproduct.m#create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503.
Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m product.exe [smithic@ulgp5 matlab]$ matlab_buildproduct.m#create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503. [smithic@ulgp5 matlab]$ condor_q#get Condor queue status -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap
Job submission example [smithic@ulgp5 multiple]$ cd /condor_data/smithic#change directory [smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples [smithic@ulgp5 smithic]$ cdmatlab #now in /condor_data/smithic/matlab [smithic@ulgp5 matlab]$ ls#list files input0.mat input2.mat input4.mat product input1.mat input3.mat product.m product.exe [smithic@ulgp5 matlab]$ matlab_buildproduct.m#create standalone executable Submitting job(s). 1 job(s) submitted to cluster 503. [smithic@ulgp5 matlab]$ condor_q#get Condor queue status -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#job has finished when gone from queue -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m
Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5
Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5 [smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobs Submitting job(s)..... 5 job(s) submitted to cluster 511.
Job submission example [smithic@ulgp5 matlab]$ ls input0.mat input2.mat input4.mat product.bat product.exe.manifest product.sub input1.mat input3.mat product product.exe product.m [smithic@ulgp5 matlab]$ cat product #display file contents executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat total_jobs=5 [smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobs Submitting job(s)..... 5 job(s) submitted to cluster 511. [smithic@ulgp5 matlab]$ condor_q#get status of jobs -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.1 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.2 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.3 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.4 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 5 jobs; 0 idle, 5 running, 0 held
Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held
Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held [smithic@ulgp5 matlab]$ ls#check output files input0.mat input3.mat output1.mat output4.mat product.exe product.sub input1.mat input4.mat output2.mat product product.exe.manifest input2.mat output0.mat output3.mat product.bat product.m
Job submission example [smithic@ulgp5 matlab]$ condor_q#some jobs completed, one still running -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc 1 jobs; 0 idle, 1 running, 0 held [smithic@ulgp5 matlab]$ condor_q#all jobs complete -- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held [smithic@ulgp5 matlab]$ ls input0.mat input3.mat output1.mat output4.mat product.exe product.sub input1.mat input4.mat output2.mat product product.exe.manifest input2.mat output0.mat output3.mat product.bat product.m [smithic@ulgp5 matlab]$ zip output.zip output*.mat #bundle output files
Summary • Condor can speed up processing by running large numbers of jobs in parallel • shorter jobs work best but can deal with jobs of arbitrary length • user-written codes easiest to run (MATLAB, R, C/C++, FORTRAN etc) • commercial 3rd party software may work • needs to run on standard MWS PC without user interaction • all Condor jobs submitted via central UNIX server
Further Information Condor http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk other research computing services http://www.liv.ac.uk/csd/research/ arc-support@liverpool.ac.uk