150 likes | 336 Views
Condor. Tugba Taskaya-Temizel 6 March 2006. What is Condor Technology?.
E N D
Condor Tugba Taskaya-Temizel 6 March 2006
What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities such as job management, scheduling policy, priority scheme, resource monitoring and management (Thain, et al. 2005). They offer the following features: • ClassAds: A framework to match the resources with the specified job descriptions. • Job Checkpoint and Migration: For some particular applications, it is possible to resume the application from its last state using a checkpoint file. This provides a means of fault tolerance. For example, in the case of a failure in a machine, the job can be safely transferred to another machine. • Remote System Calls: Condor supports I/O related jobs (processes, executables) which require processing input files and generating output files. By using this way, the files will automatically be transferred to the remote machines, hence you are not required to transfer the files manually by yourself or have a shared file system.
Condor in our Department • There are 30 machines in our departmental Condor pool in which 19 of them are Linux based (concorde01-concorde06, tornado01-13) and 11 of them have NT operating system. The number of CPUs is 107. To connect to one of the Condor machines, type: telnet concorde03.mcs.surrey.ac.uk • In order to inspect the Condor pool, you can run: condor_status • The output will be: Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@concorde0 LINUX INTEL Owner Idle 1.000 251 0+00:30:56 vm2@concorde0 LINUX INTEL Owner Idle 1.000 251 0+21:45:42 vm3@concorde0 LINUX INTEL Claimed Busy 0.320 251 0+00:14:40 vm4@concorde0 LINUX INTEL Claimed Busy 0.650 251 0+09:48:20 vm1@concorde0 LINUX INTEL Unclaimed Idle 0.000 251 0+02:50:13 vm2@concorde0 LINUX INTEL Unclaimed Idle 0.000 251 0+02:50:05 • In order to see the available machines, you can call: condor_ status -available
How to Run a Job in Condor • The jobs that run in Condor environment are background jobs. Hence, they will not accept any input from the user during its run. • According to the type of your application, you should choose an appropriate universe. • A universe is defined as an execution environment in Condor. The Condor provides many universes such as ’Standard’, ’Vanilla’, ’PVM’, ’MPI’, ’Globus’, ’Java’ and ’Scheduler’. The universe type should be specified in the ClassAd file.
How to Run a Job in CondorStandard Universe • Standard universe provides checkpoint mechanism that saves the last state of the job. This is of benefit when the long running jobs are required to migrate to another machine. • Create a directory such as $HOME/gt3/samples/condor.
How to Run a Job in CondorStandard Universe • Create a file called counter.c. Write the following lines to the file: #include <stdio.h> #include <math.h> int main(int args,char *argv[]) { int i; for (i=atoi(argv[1]);i<atoi(argv[2]);i++) { printf ("%d \n",i); } } • Compile the file and link it to the Condor. condor_compile cc counter.c -o counter
How to Run a Job in CondorStandard Universe • Once it was linked, we should create a ClassAd file to execute our job. Create a file with an extension ’cmd’ such as ’standardunitest.cmd’. Then, write the following lines: Executable = counter Arguments = 1 30 Output = counterc1.out Log = counterc1.log Queue 1 Arguments = 30 60 Output = counterc2.out Queue 1
How to Run a Job in CondorStandard Universe • Once it was linked, we should create a ClassAd file to execute our job. Create a file with an extension ’cmd’ such as ’standardunitest.cmd’. Then, write the following lines: Executable = counter Requirements = (Name== "vm1@concorde01.mcs.surrey.ac.uk“ || Name== "vm2@concorde01.mcs.surrey.ac.uk“ || Name== "vm3@concorde01.mcs.surrey.ac.uk“ || Name== "vm4@concorde01.mcs.surrey.ac.uk“ || Name== "vm1@concorde02.mcs.surrey.ac.uk“ || Name== "vm2@concorde02.mcs.surrey.ac.uk“ || Name== "vm3@concorde02.mcs.surrey.ac.uk“ || Name== "vm4@concorde02.mcs.surrey.ac.uk“ || Name== "vm1@concorde03.mcs.surrey.ac.uk“ || Name== "vm2@concorde03.mcs.surrey.ac.uk“ || Name== "vm3@concorde03.mcs.surrey.ac.uk“ || Name== "vm4@concorde03.mcs.surrey.ac.uk“ || Name== "vm1@concorde04.mcs.surrey.ac.uk“ || Name== "vm2@concorde04.mcs.surrey.ac.uk“ || Name== "vm3@concorde04.mcs.surrey.ac.uk“ || Name== "vm4@concorde04.mcs.surrey.ac.uk“ || Name== "vm1@concorde05.mcs.surrey.ac.uk“ || Name== "vm2@concorde05.mcs.surrey.ac.uk“ || Name== "vm3@concorde05.mcs.surrey.ac.uk“ || Name== "vm4@concorde05.mcs.surrey.ac.uk“ || Name== "vm1@concorde06.mcs.surrey.ac.uk“ || Name== "vm2@concorde06.mcs.surrey.ac.uk“ || Name== "vm3@concorde06.mcs.surrey.ac.uk" || Name== "vm4@concorde06.mcs.surrey.ac.uk") Arguments = 1 30 Output = counterc1.out Log = counterc1.log Queue 1 Arguments = 30 60 Output = counterc2.out Queue 1
How to Run a Job in CondorStandard Universe • To submit the job to the Condor pool, run the following command: condor_submit standardunitest.cmd • The output will be: Submitting job(s).. Logging submit event(s).. 2 job(s) submitted to cluster 92. • To inspect your job, run: condor_q • This will display your jobs. At first, you are expected to see that your jobs are idle: -- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773> : concorde03.mcs.surrey.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 92.0 css1tt 3/3 14:52 0+00:00:00 I 0 3.4 counter 1 30 92.1 css1tt 3/3 14:52 0+00:00:00 I 0 3.4 counter 30 60 2 jobs; 2 idle, 0 running, 0 held • After couple of minutes, when you call the same command ’condor q’, you should expect to see the following: -- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773> : concorde03.mcs.surrey.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
How to Run a Job in CondorJava Universe • A java file can be run on a machine with a JVM. Unlike in standard universe, the jobs cannot be suspended and moved to another machine. However, in the case of a failure, the jobs can be restarted in another machine. • Create a Java file called ’Counter.java’ and write the following lines to the file: import java.lang.*; public class Counter{ public static void main(String [] args) { int startt = Integer.parseInt(args[0]); int stopp = Integer.parseInt(args[1]); for(int i=startt;i<stopp;i++) { System.out.println(i); } } } • Then, compile the program: javac Counter.java
How to Run a Job in CondorJava Universe • We should create a submit description file. Recall that the file extension should be ’cmd’ such as ’javaunitest.cmd’. Add the following lines to the file: universe = java executable= Counter.class log= counter.log arguments = Counter 1 30 output = counter1.output error = counter1.error should_transfer_files = YES when_to_transfer_output = ON_EXIT queue arguments = Counter 30 60 output = counter2.output error = counter2.error should_transfer_files = YES when_to_transfer_output = ON_EXIT queue
How to Run a Job in CondorJava Universe • To submit the jobs to Condor, run condor_submit javaunitest.cmd • To inspect its status, type: condor_q • The output of the command will look like: -- Submitter: concorde03.mcs.surrey.ac.uk : <131.227.74.149:32773>: concorde03.mcs.surrey.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 98.0 css1tt 3/3 15:58 0+00:00:00 I 0 0.0 java Counter 1 30 98.1 css1tt 3/3 15:58 0+00:00:00 I 0 0.0 java Counter 30 60 2 jobs; 2 idle, 0 running, 0 held • When you notice a problem with your job, you need to remove it from the Condor pool. To do it, you need to call: condor_rm ID
How to Run a Job in CondorVanilla Universe • There are some applications that cannot be run in standard and java universe such as shell scripts. Shell scripts can be used to call external applications such as Matlab. • Create a file, called ’count.m’ and write the following lines to the file: function count(startt, stopp) for i=startt:stopp-1 i end • To call the matlab program, we should call the Matlab application and then call our program. To do it, we should write a script. Create a file called ’runmatlab.sh’: #!/bin/sh echo "Number of arguments: $#" matlab -r "addpath /user/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/condortutorial/; count($1,$2);quit;"
How to Run a Job in CondorVanilla Universe • As a final step, we should prepare the description file. Create a file with extension ’.cmd’ such as ’matlabtest.cmd’. Universe = vanilla executable = /a/filer2/home/filer2/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/ condortutorial/runmatlab.sh Initialdir = /a/filer2/home/filer2/csckmst/css1tt/gt3/samples/csm23_2006/Tutorials/ condortutorial Requirements = Memory>=20 && Arch == "INTEL" && OpSys == "LINUX" Getenv = True Log = matlabpro.log # main matlab file to execute Arguments = 1 30 Output = matlab1.out Error = matlab1.err transfer_input_files = count.m should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT Queue 1 # main matlab file to execute Arguments = 30 60 Output = matlab2.out Error = matlab2.err transfer_input_files = count.m should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT Queue 1
How to Run a Job in CondorVanilla Universe • To submit the job, run: condor_submit matlabtest.cmd • To see the output of your job, call: more matlab1.out more matlab2.out EXERCISE: Submit the matlab and java counter programs together using the same description file. Both programs should be specified in the vanilla universe.