An Overview of the Portable Batch System

An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B gabriel.mateescu@nrc.ca www.sao.nrc.ca/~gabriel/presentations/sgi_pbs

Outline • PBS highlights • PBS components • Resources managed by PBS • Choosing a PBS scheduler • Installation and configuration of PBS • PBS scripts and commands • Adding preemptive job scheduling to PBS

PBS Highlights • Developed by Veridian / MRJ • Robust, portable, effective, extensible batch job queuing and resource management system • Supports different schedulers • Supports heterogeneous clusters • Open PBS - open source version • PBS Pro - commercial version

Recent Versions of PBS • PBS 2.2, November 1999: • both the FIFO and SGI scheduler have bugs in enforcing resource limits • poor support for stopping & resuming jobs • OpenPBS 2.3, September 2000 • better FIFO scheduler: resource limits enforced, backfilling added • PBS Pro 5.0, September 2000 • claims support for job stopping/resuming, better scheduling, IRIX cpusets

Resources managed by PBS • PBS manages jobs, CPUs, memory, hosts and queues • PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter • Resources - describe attributes of jobs, queues, and hosts • Scheduler - chooses the jobs that fit within queue and cluster resources

Main Components of PBS • Three daemons: • pbs_server server, • pbs_sched scheduler, • pbs_mom job executor & resource monitor • The server accepts commands and communicates with the daemons • qsub - submit a job • qstat - view queue and job status • qalter - change job’s attributes • qdel - delete a job

Batch Queuing Job exclusive scheduling Queue A Queue B SGI Origin System Node (CPUs + memory)

Resource Examples • ncpus number of CPUs per job • mem resident memory per job • pmem per-process memory • vmem virtual memory per job • cput CPU time per job • walltime real time per job • file file size per job

Resource limits • resources_max - per job limit for a resource; determines whether a job fits in a queue • resources_default - default amount of a resource assigned to a job • resources_available - advice to the scheduler on how much of a resource can be used by all running jobs

Choosing a Scheduler (1) • FIFO scheduler: • First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit • Supports per job and (in version 2.3) per queue resource limits: ncpus, mem • Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)

Choosing a Scheduler (2) • Algorithms in FIFO scheduler • FIFO - sort jobs by job queuing time running the earliest job first • Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order • Fair share: sort & schedule jobs based on past usage of the machine by the job owners • Round-robin - pick a job from each queue • By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first

Choosing a Scheduler (3) • FIFO scheduler supports round robin load balancing as of version 2.3 • FIFO scheduler • allows decoupling the job requirements on the number of CPUs from that on the amount memory • simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue

Choosing a Scheduler (4) • SGI scheduler • supports FIFO, fair share, backfilling, and attempts to avoid job starvation • supports both per job limits and per queue limits on number of CPUs, memory • per server limit is the number of node cards • makes a best effort in choosing a queue where to run a job. A job not having enough resources to run is kept in the submit queue • ties the number of cpus allocated to the memory allocated per job

Resource allocation • SGI scheduler allocates nodes - node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ] • Number of nodes N for a job is such that [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ] where ncpus and mem are the job’s memory and cpu job limitsspecified, e.g., with#PBS -l mem • Job attributes Resource_List.{ncpus, mem} set to Resource_List.ncpus = N * PE_PER_NODE Resource_List.mem = N * MB_PER_NODE

Queue and Server Limits • FIFO scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • as of version 2.3, resources_max also defines per queue limits • per server resource limits enforced with resources_available attributes

Queue and Server Limits • SGI scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • resources_max also defines per queue limits • per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced

Job enqueing (1) • The scheduler places each job in some queue • This involves several tests for resources • Which queue a job is enqueued into depends on • what limits are tested • first-fit versus best fit placement • A job can fit in a queue if the resources requested by the job do not exceed the maximum value of the resources defined for the queue. For example, for the resource ncpus Resource_List.ncpus <= resources_max.ncpus

Job enqueing (2) • A job fits in a queue if the amount of resources assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus • A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource, Σ resources_assigned.ncpus + Resource_List.ncpus <= resources_available.ncpus

First fit versus best fit • The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue • if the jobs does not actually fit it will wait for the requested resources in the execution queue • The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue • If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty. • However, if a job can fit in several queues, then SGI scheduler will find a better schedule

Limits on the number of running jobs • Per queue and per server limits on the number of running jobs: • max_running • max_user_run, max_group_run max number of running jobs per user or group • Unlike the FIFO scheduler, the SGI scheduler enforces these limits only on a per queue basis • It enforces MAX_JOBS from the scheduler config file - substitute for max_running

SGI Origin Install (1) • Source files under OpenPBS_v2_3/src • Consider the SGI scheduler • Make sure the machine dependent values defines in scheduler.cc/samples/sgi_origin/toolkit.hmatch the actual machine hardware #define MB_PER_NODE ((size_t) 512*1024*1024) #define PE_PER_NODE 2 • May set PE_PER_NODE =1 to allocate half-nodes if MB_PER_NODE is set accordingly

SGI Origin Install (2) • Bug fixes in scheduler.cc/samples/sgi_origin/pack_queues.c • Operator precedence bug (line 198): for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { // bad operator precedence bypasses this function if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... }

SGI Origin Install (3) • Fix of a logical bug in pack_queues.c: if a system limit is exceeded should not try to schedule the job for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... } for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) { // if allfull set, do not attempt to schedule }

SGI Origin Install (4) • Fix of a logical bug in user_limits.c, function user_running() • This function counts number of running jobs so must test for equality between job status and ‘R’ user_running ( ...) { for ( job= queue->jobs; job != NULL; job = job->next) { if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) ) jobs_running++; // … }

SGI Origin Install (5) • The limit npcus is not enforced in the function mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array #define SGI_ZOMBIE_WRONG 1 int mom_over_limit( ... ) { // ... #if !defined(SGI_ZOMBIE_WRONG) return (TRUE); #endif // ... }

SGI Origin Install (4) Script to run the configure command ___________________________________________________ #!/bin/csh -f set PBS_HOME=/usr/local/pbs set PBS_SERVER_HOME=/usr/spool/pbs # Select SGI or FIFO scheduler set SCHED="--set-sched-code=sgi_origin --enable-nodemask #set SCHED="--set-sched-code=fifo --enable-nodemask” $HOME/PBS/OpenPBS_v2_3/configure \ --prefix=$PBS_HOME \ --set-server-home=$PBS_SERVER_HOME \ --set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \ --set-sched=cc $SCHED --enable-array --enable-debug

SGI Origin Install (5) ___________________________________________________ # cd /usr/local/pbs # makePBS # make # make install # cd /usr/spool/pbs the script from the previous slide sched_priv config decay_usage

Configuring for SGI scheduler • Queue types • one submit queue • one or several execution queues • Per server limit on the number of running job • Load Control • Fair share scheduling • Past usage of the machine used in ranking the jobs • Decayed past usage per user is kept in sched_priv/decay_usage • Scheduler restart action • PBS manager tool: qmgr

Queue definition • File sched_priv/config SUBMIT_QUEUE submit BATCH_QUEUES hpc,back MAX_JOBS 256 ENFORCE_PRIME_TIME False ENFORCE_DEDICATED_TIME False SORT_BY_PAST_USAGE True DECAY_FACTOR 0.75 SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting SCHED_RESTART_ACTION RESUBMIT

Load Control • Load control for SGI scheduler sched_priv/config TARGET_LOAD_PCT 90% TARGET_LOAD_VARIANCE -15%,+10% • Load Control for FIFO scheduler mom_priv/config $max_load 2.0 $ideal_load 1.0

PBS for SGI scheduler • Qmgr tool s server managers=bob@n0.bar.com create queue submit s q submit queue_type = Execution s q submit resources_max.ncpus = 4 s q submit resources_max.ncpus = 1gb s q submit resources_default.mem = 256mb s q submit resources_default.ncpus = 1 s q submit resources_default.nice = 15 s q submit enabled = True s q submit started = True

PBS for SGI scheduler create queue hpc s q hpc queue_type = Execution s q hpc resources_max.ncpus = 2 s q hpc resources_max.ncpus = 512mb s q hpc resources_default.mem = 256mb s q hpc resources_default.ncpus = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True s q hpc enabled = True s q hpc started = True

PBS for SGI scheduler • Server attributes set server default_queue = submit s server acl_hosts = *.bar.com s server acl_host_enable = True s server scheduling = True s server query_other_jobs = True

PBS for FIFO scheduler • File sched_config instead of config and queues are not defined there • Submit queue is Route queue s q submit queue_type = Route s q submit route_destinations = hpc s q submit route_destinations += back • Server attributes s server resources_available.mem = 1gb s server resources_available.ncpus = 4

PBS Job Scripts • Job scripts contain PBS directives and shell commands #PBS -l ncpus=2 #PBS -l walltime=12:20:00 #PBS -m ae #PBS -c c=30 cd ${PB_O_WORKDIR} mpirun -np 2 foo.x

Basic PBS commands • Jobs are submitted with qsub % qsub [-q hpc] foo.pbs 13.node0.bar.com • Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime % qstat -a 13 • Alter job attributes % qalter -l walltime 20:00:00 13

Job Submission and Tracking • Find jobs in status R (running) or submitted by user bob % qselect -s R % qselect -u bob • Query queue status to find if the queue is enabled/started, and the number of jobs in the queue qstat [-f | -a ] -Q • Delete a job: qdel 13

Job Environment and I/O • The job’s current directory is the submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script • The standard out and err of the job are spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with #PBS -o | -e pathname

Tips • Trace the history of a job % tracejob - give a time-stamped sequence of events affecting a job • Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs • #crontab -e 9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;

Sample PBS Front-End node1 node0 Execution server Submission server pbs_server, pbs_sched, pbs_mom qsub, qdel, ...

PBS for clusters • File staging - copy files (other than stdout/stderr) from a submission-only host to the server #PBS -W stagein=/tmp/bar@n1:/home/bar/job1 #PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1 PBS uses the directory /tmp/bar/job1 as a scratch directory • File staging may precede job starting - helps in hiding latencies

Setting up a PBS Cluster • Assume n1 runs the pbs_mom daemon • $PBS_SERVER_HOME/server_priv/nodes n0 np=2 gaussian n1 np=2 irix • n0:$PBS_SERVER_HOME/mom_priv/config $clienthost n1 $ideal_load 1.5 $max_load 2.0 • n1:$PBS_SERVER_HOME/mom_priv/config $ideal_load 1.5 $max_load 2.0

Setting up a PBS Cluster • Qmgr tool s server managers=bob@n0.bar.com create queue hpc s q hpc queue_type = Execution s q hpc Priority = 100 s q hpc resources_max.ncpus = 2 s q hpc resources_max.nodect = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True

Setting up a PBS Cluster • Server attributes set server default_node = n0 set server default_queue = hpc s server acl_hosts = *.bar.com s server acl_host_enable = True s s resources_default.nodect = 1 s s resources_default.nodes = 1 s s resources_default.neednodes = 1 set server max_user_run = 2

PBS features • The job submitter can request a number of nodes with some properties • For example • request a node with the property gaussian: #PBS -l nodes=gaussian • request two nodes with the property irix #PBS -l nodes=2:irix

PBS Security Features • All files used by PBS are owned by root and can be written only by root • Configuration files: sched_priv/config, mom_priv/config are readable only by root • $PBS_HOME/pbs_environment defines $PATH; it is writable only by root • pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config • The server accepts commands from selected hosts and users

Why preemptive scheduling? • Resource reservation (CPU, memory) is needed to achieve high job throughput • Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around • An approach is needed to achieve both high job throughput and rapid job turn-around

Static Reservation Pitfall (1) Parallel Computer or Cluster Physics Group Biotech Group Partition boundary Node (CPU + memory) Job Requests

Static Reservation Pitfall (2) • Physics Group’s Job 1 is assigned 3 nodes and dispatched • Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group • However, there are enough resources for Job 3

Proposed Approach (1) • Leverage the features of the Portable Batch System (PBS) • Extend PBS with preemptive job scheduling • All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues • Define a queue for jobs that may be preempted: the background queue

An Overview of the Portable Batch System

An Overview of the Portable Batch System

Presentation Transcript

An Overview of the Computer System

An Overview of the Financial System

An Overview of the SUIF2 System

An Overview of the Computer System

Advanced Portable Batch System PBS

An Overview of the Financial System

An overview of the financial system

An Overview of the Computer System

Advanced Portable Batch System (PBS)

OpenPBS (Portable Batch System)

An overview of the Mystiq System

Basic Portable Batch System (PBS)

An Overview of the Portable Batch System

Jefferson Lab and the Portable Batch System

Portable Stage London – An Overview

An Overview of the Monetary System

An Introduction to the Portable Batch System (PBS)

An overview of the financial system

An Overview of the Financial System

An Overview of the Computer System

An Overview of the Financial System

The Manual of Portable Concrete Batch Plant