580 likes | 593 Views
Gain insights into Portable Batch System (PBS) with details on its components, resource management, scheduler selection, installation, configuration, and job scheduling techniques. Highlights OpenPBS and PBS Pro versions with recent updates and resource examples.
E N D
An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B gabriel.mateescu@nrc.ca www.sao.nrc.ca/~gabriel/presentations/sgi_pbs
Outline • PBS highlights • PBS components • Resources managed by PBS • Choosing a PBS scheduler • Installation and configuration of PBS • PBS scripts and commands • Adding preemptive job scheduling to PBS
PBS Highlights • Developed by Veridian / MRJ • Robust, portable, effective, extensible batch job queuing and resource management system • Supports different schedulers • Supports heterogeneous clusters • Open PBS - open source version • PBS Pro - commercial version
Recent Versions of PBS • PBS 2.2, November 1999: • both the FIFO and SGI scheduler have bugs in enforcing resource limits • poor support for stopping & resuming jobs • OpenPBS 2.3, September 2000 • better FIFO scheduler: resource limits enforced, backfilling added • PBS Pro 5.0, September 2000 • claims support for job stopping/resuming, better scheduling, IRIX cpusets
Resources managed by PBS • PBS manages jobs, CPUs, memory, hosts and queues • PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter • Resources - describe attributes of jobs, queues, and hosts • Scheduler - chooses the jobs that fit within queue and cluster resources
Main Components of PBS • Three daemons: • pbs_server server, • pbs_sched scheduler, • pbs_mom job executor & resource monitor • The server accepts commands and communicates with the daemons • qsub - submit a job • qstat - view queue and job status • qalter - change job’s attributes • qdel - delete a job
Batch Queuing Job exclusive scheduling Queue A Queue B SGI Origin System Node (CPUs + memory)
Resource Examples • ncpus number of CPUs per job • mem resident memory per job • pmem per-process memory • vmem virtual memory per job • cput CPU time per job • walltime real time per job • file file size per job
Resource limits • resources_max - per job limit for a resource; determines whether a job fits in a queue • resources_default - default amount of a resource assigned to a job • resources_available - advice to the scheduler on how much of a resource can be used by all running jobs
Choosing a Scheduler (1) • FIFO scheduler: • First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit • Supports per job and (in version 2.3) per queue resource limits: ncpus, mem • Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)
Choosing a Scheduler (2) • Algorithms in FIFO scheduler • FIFO - sort jobs by job queuing time running the earliest job first • Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order • Fair share: sort & schedule jobs based on past usage of the machine by the job owners • Round-robin - pick a job from each queue • By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first
Choosing a Scheduler (3) • FIFO scheduler supports round robin load balancing as of version 2.3 • FIFO scheduler • allows decoupling the job requirements on the number of CPUs from that on the amount memory • simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue
Choosing a Scheduler (4) • SGI scheduler • supports FIFO, fair share, backfilling, and attempts to avoid job starvation • supports both per job limits and per queue limits on number of CPUs, memory • per server limit is the number of node cards • makes a best effort in choosing a queue where to run a job. A job not having enough resources to run is kept in the submit queue • ties the number of cpus allocated to the memory allocated per job
Resource allocation • SGI scheduler allocates nodes - node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ] • Number of nodes N for a job is such that [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ] where ncpus and mem are the job’s memory and cpu job limitsspecified, e.g., with#PBS -l mem • Job attributes Resource_List.{ncpus, mem} set to Resource_List.ncpus = N * PE_PER_NODE Resource_List.mem = N * MB_PER_NODE
Queue and Server Limits • FIFO scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • as of version 2.3, resources_max also defines per queue limits • per server resource limits enforced with resources_available attributes
Queue and Server Limits • SGI scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • resources_max also defines per queue limits • per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced
Job enqueing (1) • The scheduler places each job in some queue • This involves several tests for resources • Which queue a job is enqueued into depends on • what limits are tested • first-fit versus best fit placement • A job can fit in a queue if the resources requested by the job do not exceed the maximum value of the resources defined for the queue. For example, for the resource ncpus Resource_List.ncpus <= resources_max.ncpus
Job enqueing (2) • A job fits in a queue if the amount of resources assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus • A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource, Σ resources_assigned.ncpus + Resource_List.ncpus <= resources_available.ncpus
First fit versus best fit • The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue • if the jobs does not actually fit it will wait for the requested resources in the execution queue • The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue • If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty. • However, if a job can fit in several queues, then SGI scheduler will find a better schedule
Limits on the number of running jobs • Per queue and per server limits on the number of running jobs: • max_running • max_user_run, max_group_run max number of running jobs per user or group • Unlike the FIFO scheduler, the SGI scheduler enforces these limits only on a per queue basis • It enforces MAX_JOBS from the scheduler config file - substitute for max_running
SGI Origin Install (1) • Source files under OpenPBS_v2_3/src • Consider the SGI scheduler • Make sure the machine dependent values defines in scheduler.cc/samples/sgi_origin/toolkit.hmatch the actual machine hardware #define MB_PER_NODE ((size_t) 512*1024*1024) #define PE_PER_NODE 2 • May set PE_PER_NODE =1 to allocate half-nodes if MB_PER_NODE is set accordingly
SGI Origin Install (2) • Bug fixes in scheduler.cc/samples/sgi_origin/pack_queues.c • Operator precedence bug (line 198): for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { // bad operator precedence bypasses this function if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... }
SGI Origin Install (3) • Fix of a logical bug in pack_queues.c: if a system limit is exceeded should not try to schedule the job for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... } for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) { // if allfull set, do not attempt to schedule }
SGI Origin Install (4) • Fix of a logical bug in user_limits.c, function user_running() • This function counts number of running jobs so must test for equality between job status and ‘R’ user_running ( ...) { for ( job= queue->jobs; job != NULL; job = job->next) { if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) ) jobs_running++; // … }
SGI Origin Install (5) • The limit npcus is not enforced in the function mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array #define SGI_ZOMBIE_WRONG 1 int mom_over_limit( ... ) { // ... #if !defined(SGI_ZOMBIE_WRONG) return (TRUE); #endif // ... }
SGI Origin Install (4) Script to run the configure command ___________________________________________________ #!/bin/csh -f set PBS_HOME=/usr/local/pbs set PBS_SERVER_HOME=/usr/spool/pbs # Select SGI or FIFO scheduler set SCHED="--set-sched-code=sgi_origin --enable-nodemask #set SCHED="--set-sched-code=fifo --enable-nodemask” $HOME/PBS/OpenPBS_v2_3/configure \ --prefix=$PBS_HOME \ --set-server-home=$PBS_SERVER_HOME \ --set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \ --set-sched=cc $SCHED --enable-array --enable-debug
SGI Origin Install (5) ___________________________________________________ # cd /usr/local/pbs # makePBS # make # make install # cd /usr/spool/pbs the script from the previous slide sched_priv config decay_usage
Configuring for SGI scheduler • Queue types • one submit queue • one or several execution queues • Per server limit on the number of running job • Load Control • Fair share scheduling • Past usage of the machine used in ranking the jobs • Decayed past usage per user is kept in sched_priv/decay_usage • Scheduler restart action • PBS manager tool: qmgr
Queue definition • File sched_priv/config SUBMIT_QUEUE submit BATCH_QUEUES hpc,back MAX_JOBS 256 ENFORCE_PRIME_TIME False ENFORCE_DEDICATED_TIME False SORT_BY_PAST_USAGE True DECAY_FACTOR 0.75 SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting SCHED_RESTART_ACTION RESUBMIT
Load Control • Load control for SGI scheduler sched_priv/config TARGET_LOAD_PCT 90% TARGET_LOAD_VARIANCE -15%,+10% • Load Control for FIFO scheduler mom_priv/config $max_load 2.0 $ideal_load 1.0
PBS for SGI scheduler • Qmgr tool s server managers=bob@n0.bar.com create queue submit s q submit queue_type = Execution s q submit resources_max.ncpus = 4 s q submit resources_max.ncpus = 1gb s q submit resources_default.mem = 256mb s q submit resources_default.ncpus = 1 s q submit resources_default.nice = 15 s q submit enabled = True s q submit started = True
PBS for SGI scheduler create queue hpc s q hpc queue_type = Execution s q hpc resources_max.ncpus = 2 s q hpc resources_max.ncpus = 512mb s q hpc resources_default.mem = 256mb s q hpc resources_default.ncpus = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True s q hpc enabled = True s q hpc started = True
PBS for SGI scheduler • Server attributes set server default_queue = submit s server acl_hosts = *.bar.com s server acl_host_enable = True s server scheduling = True s server query_other_jobs = True
PBS for FIFO scheduler • File sched_config instead of config and queues are not defined there • Submit queue is Route queue s q submit queue_type = Route s q submit route_destinations = hpc s q submit route_destinations += back • Server attributes s server resources_available.mem = 1gb s server resources_available.ncpus = 4
PBS Job Scripts • Job scripts contain PBS directives and shell commands #PBS -l ncpus=2 #PBS -l walltime=12:20:00 #PBS -m ae #PBS -c c=30 cd ${PB_O_WORKDIR} mpirun -np 2 foo.x
Basic PBS commands • Jobs are submitted with qsub % qsub [-q hpc] foo.pbs 13.node0.bar.com • Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime % qstat -a 13 • Alter job attributes % qalter -l walltime 20:00:00 13
Job Submission and Tracking • Find jobs in status R (running) or submitted by user bob % qselect -s R % qselect -u bob • Query queue status to find if the queue is enabled/started, and the number of jobs in the queue qstat [-f | -a ] -Q • Delete a job: qdel 13
Job Environment and I/O • The job’s current directory is the submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script • The standard out and err of the job are spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with #PBS -o | -e pathname
Tips • Trace the history of a job % tracejob - give a time-stamped sequence of events affecting a job • Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs • #crontab -e 9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;
Sample PBS Front-End node1 node0 Execution server Submission server pbs_server, pbs_sched, pbs_mom qsub, qdel, ...
PBS for clusters • File staging - copy files (other than stdout/stderr) from a submission-only host to the server #PBS -W stagein=/tmp/bar@n1:/home/bar/job1 #PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1 PBS uses the directory /tmp/bar/job1 as a scratch directory • File staging may precede job starting - helps in hiding latencies
Setting up a PBS Cluster • Assume n1 runs the pbs_mom daemon • $PBS_SERVER_HOME/server_priv/nodes n0 np=2 gaussian n1 np=2 irix • n0:$PBS_SERVER_HOME/mom_priv/config $clienthost n1 $ideal_load 1.5 $max_load 2.0 • n1:$PBS_SERVER_HOME/mom_priv/config $ideal_load 1.5 $max_load 2.0
Setting up a PBS Cluster • Qmgr tool s server managers=bob@n0.bar.com create queue hpc s q hpc queue_type = Execution s q hpc Priority = 100 s q hpc resources_max.ncpus = 2 s q hpc resources_max.nodect = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True
Setting up a PBS Cluster • Server attributes set server default_node = n0 set server default_queue = hpc s server acl_hosts = *.bar.com s server acl_host_enable = True s s resources_default.nodect = 1 s s resources_default.nodes = 1 s s resources_default.neednodes = 1 set server max_user_run = 2
PBS features • The job submitter can request a number of nodes with some properties • For example • request a node with the property gaussian: #PBS -l nodes=gaussian • request two nodes with the property irix #PBS -l nodes=2:irix
PBS Security Features • All files used by PBS are owned by root and can be written only by root • Configuration files: sched_priv/config, mom_priv/config are readable only by root • $PBS_HOME/pbs_environment defines $PATH; it is writable only by root • pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config • The server accepts commands from selected hosts and users
Why preemptive scheduling? • Resource reservation (CPU, memory) is needed to achieve high job throughput • Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around • An approach is needed to achieve both high job throughput and rapid job turn-around
Static Reservation Pitfall (1) Parallel Computer or Cluster Physics Group Biotech Group Partition boundary Node (CPU + memory) Job Requests
Static Reservation Pitfall (2) • Physics Group’s Job 1 is assigned 3 nodes and dispatched • Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group • However, there are enough resources for Job 3
Proposed Approach (1) • Leverage the features of the Portable Batch System (PBS) • Extend PBS with preemptive job scheduling • All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues • Define a queue for jobs that may be preempted: the background queue