110 likes | 272 Views
Bigben. Pittsburgh Supercomputing Center J. Ray Scott scott@psc.edu. Outline. Running A Job Scheduling Policies Batch Access Interactive Access Packing Jobs Monitoring And Killing Jobs Programming Tools. Workshop Scheduling.
E N D
Bigben Pittsburgh Supercomputing Center J. Ray Scott scott@psc.edu
Outline • Running A Job • Scheduling Policies • Batch Access • Interactive Access • Packing Jobs • Monitoring And Killing Jobs • Programming Tools
Workshop Scheduling For the workshop, users should submit jobs to the "training" queue: qsub -q training or in their job scripts as: #PBS -q training At the end of a user job, PBS extracts the relevant lines from the system console logs (the console output from each cpu) for the user's job and places these lines in a file in the user's $HOME as job_<jid>_console.log where <jid> is the PBS job id. Viewing this file post-run can provide some details as to job failure.
Scheduling Policies The Portable Batch Scheduler (PBS) controls all access to bigben's compute processors, for both batch and interactive jobs. PBS on bigben currently has two queues. Interactive and batch jobs compete in these queues for scheduling. The two queues are "batch" and "debug" which are controlled through two different modes during a 24 hour day. The "batch" or default queue (does not need to be explicitly named in a job submission) is active during both day and night modes discussed next. The "debug" queue must be explicitly named in a job script: #PBS -q debugand is limited to 32 cpus and 15 minutes of wall-clock time. PBS specifications are discussed below. Day ModeDuring the day, defined to be 8am-8pm, 64 cpus will be reserved for debugging jobs (jobs run from the "debug" queue). Jobs submitted to the "debug" queue may request no more than 32 cpus and 15 minutes of wall-clock time. Jobs submitted to the "batch" (default) queue may be any size up to the limit of the machine but only jobs of 1024 cpus or less will be scheduled to start during Day Mode. "batch" jobs are limited to 6 wall-clock hours in duration. Jobs in the "debug" and "batch" queues will be ordered FIFO and also in a way to keep any one user from dominating usage and to ensure fair turnaround. Jobs started during the Day Mode must finish by 8pm at which time the machine will be rebooted. Night ModeDuring the night, defined to be 8pm-8am (starts following a machine reboot), jobs of 2048 cpus or less will be allowed to run and are limited to 6 wall-clock hours in duration. Jobs will be ordered largest to smallest and in a way to keep any one user from dominating usage. Jobs in the "debug" queue will not be allowed to run during Night Mode.
Debug 32 cpus, 15min Batch 1024 cpus, 6hrs Batch 2048 cpus, 6hrs Scheduling Queues
Batch Access • You use the qsub command to submit a job script to PBS. A PBS job script consists of PBS directives, comments and executable commands. The last line of your job script must end with a newline character. • A sample job script is #!/bin/csh #PBS -l size=4 #PBS -l walltime=5:00 #PBS -j oe set echo # move to my /scratch directory cd /scratch/myscratchdir # run my executable pbsyod ./hellompi • The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used. • The next three lines are PBS directives.
Batch Access (cont’d) • #PBS -l size=4 • The first directive requests 4 processors. • #PBS -l walltime=5:00 • The first directive requests 5 minutes of wallclock time. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification. • #PBS -j oe • The final PBS directive combines your .o and .e output into one file, in this case your .o file. This will make your program easier to debug. • The remaining lines in the script are comments or command lines. • set echo • This command causes your batch output to display each command next to its corresponding output. This will make your program easier to debug. If you are using the Bourne shell or one of its descendants use 'set -x' instead of 'set echo'. • Comment lines • The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must begin in column one of your script file. The remaining lines in the sample script are executable commands. • pbsyod • The pbsyod command is used to launch your executable on your compute processors. Only programs executed with pbsyod are executed on your compute processors. All other commands are executed on the front end processor. Thus, you must use pbsyod to run your executable or it will run on the front end, where it will probably not work. If it does work it will degrade system performance.
Batch Access (cont’d) • Within your batch script the variable PBS_O_WORKDIR is set to the directory from which you issued your qsub command. The variable PBS_O_SIZE is set to the number of processors you requested. • After you create your script you must make it executable with the chmod command. chmod 755 myscript.job • Then you can submit it to PBS with the qsub command. qsub myscript.job • Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub comand after your job finishes. • You can also specify PBS directives as command-line options to qsub. Thus, you could omit the PBS directives in the sample script above and submit the script with qsub -l size=4 -l walltime=5:00:00 -j oe • Command-line options override PBS directives included in your script. • The -M and -m options can be used to have the system send you email when your job undergoes specified state transitions.
Interactive Access • A form of interactive access is available by using the -I option to qsub. For example, the command qsub -I -l walltime=10:00 -l size=2 • requests interactive access to 2 processors for 10 minutes. • The system will respond with a message similar to qsub: waiting for job 54.bigben.psc.edu to start • Your qsub request will wait until it can be satisfied. If you want to cancel your request you should type ^C. • When your job starts you will receive the message qsub: job 54.bigben.psc.edu ready • and then your shell prompt. You can use the -M and -m options to qsub to have the system send you email when your job has started. • At this point any commands you enter will be run as if you had entered them in a batch script. To run on the compute processors allocated to your interactive job you must use the pbsyod command. • Stdin, stdout, and stderr are all connected to your terminal, although you will need to use input direction for your MPI code to read stdin. • When you are finished with your interactive session type ^D. The system will respond qsub: job 54.bigben.psc.edu completed • When you use qsub -I you are charged for the total time that you hold your processors and your memory whether you are computing or not. Thus, as soon as you are done running executables you should type ^D.
Packing Jobs • You can pack several pbsyod commands into a single job and have each of them run on a distinct set of processors. This will allow you to increase the number of total processors your job asks for, which will become important once the scheduler is changed to favor large jobs. • For example, the job #!/bin/csh #PBS -l size=12 #PBS -l walltime=30:00 #PBS -j oe set echo cd /scratch/myscratchdir pbsyod -size 4 -base 0 ./mympi < mpi1.in pbsyod -size 4 -base 4 ./mympi < mpi2.in pbsyod -size 4 -base 8 ./mympi < mpi3.in will launch three executions, each on a distinct set of 4 processors. • The -size option to pbsyod indicates how many processors a pbsyod is to use. The default is to use all of your compute processors. The -base option indicates on which processor a pbsyod should begin executing, with your first processor having a base of 0. Thus, the first pbsyod above will begin executing on your first processor and use 4 processors, the second will run on the next 4 processors starting with your fifth processor and the third pbsyod will run on your final 4 processors. If you do not use the -base option all of your executions will run on top of each other on the same set of processors.
Monitoring and Killing Jobs • The qstat -a command is used to display the status of the PBS queue. It includes running and queued jobs. For each job in the queue it shows the amount of walltime and number of processors requested. This information can be useful in predicting when your job might run. The -f option to qstat provides you with more extensive status information for a single job. • The shownids command, located in /usr/local/bin, shows you the status of all the compute processors on bigben. A nid is a node id or processor. The output of shownids shows the number of processors in certain types of states. Enabled processors are all processors available to PBS for scheduling. Allocated processors are those enabled processors that are currently running jobs. Free processors are those enabled processors that are currently free. You can use the output from shownids and qstat -a to determine when your jobs might start. • The qdel command is used to kill queued and running jobs. qdel 54 • The argument to qdel is the jobid of the job you want to kill. If you cannot kill a job that you want to kill send email to remarks@psc.edu.