320 likes | 335 Views
Learn how to run and manage jobs on the Blue Gene system using LoadLeveler, including defining partitions, finding resources, and interacting with the Blue Gene API.
E N D
LoadLeveler Blue Gene Support Enci Zhong LoadLeveler Development
Interaction with Blue Gene LoadLeveler submitted Run a job Find Resource For jobs And Define Partitions Get Resources And Jobs Data Jobs Blue Gene mpirun Blue Gene Blue Gene Bridge API
LoadLeveler Daemons Front End Node Service Node Master LoadL_master Master LoadL_master Schedd LoadL_schedd Central Manager LoadL_negotiator Jobs Startd/Starter LoadL_startd LoadL_starter Blue Gene Bridge API Blue Gene mpirun
LoadLeveler Configuration Front End Node Service Node /etc/LoadL.cfg /etc/LoadL.cfg LoadL_config LoadL_admin LoadL_config.local LoadL_config.local
LoadL_config SCHEDULER_TYPE = BACKFILL NEGOTIATOR_CYCLE_DELAY = 10 VM_IMAGE_ALGORITHM = FREE_PAGING_SPACE_PLUS_FREE_REAL_MEMORY BG_ENABLED = true BG_CACHE_PARTITIONS = true BG_MIN_PARTITION_SIZE = 32 CM_CHECK_USERID = false BG_ALLOW_LL_JOBS_ONLY = false
LoadL_admin <mySN> : type = machine central_manager = true <myFEN> : type = machine central_manager = false schedd_host = true # Allow jobs be submitted from the SN small: type = class include_bg = R00-M0 row1: type = class include_bg = R1 medium: type = class exclude_bg = R0 R1
LoadL_config.local Service Node START_DAEMONS = TRUE SCHEDD_RUNS_HERE = FALSE STARTD_RUNS_HERE = FALSE Front End Node START_DAEMONS = TRUE SCHEDD_RUNS_HERE = True STARTD_RUNS_HERE = True MAX_STARTERS = 60 CLASS = small(10) row1(20) medium(30) large(10) Note: mpirun is run on the FEN and it doesn’t use a lot of resources and thus many mpirun processes can share the same FEN.
Before Starting LoadLeveler on Blue Gene/P • Standalone mpirun must work • Add userid loadl to the bgpadmin group • /usr/lib64/libdb2.so must exist • In the login profile of userid loadl, add export BRIDGE_CONFIG_FILE=/bgsys/drivers/ppcfloor/bin/bridge.config export DB_PROPERTY=/bgsys/drivers/ppcfloor/bin/db.properties.tpl The two files or their local copy must be readable by userid loadl • Note: LoadLeveler need to be restarted after Blue Gene driver or database updates, etc.
Starting LoadLeveler • llctl start on both the FEN and SN • llstatus look for “Blue Gene is present” • llstatus -b Name Base Partitions c-nodes InQ Run BGP 4x4x2 32x32x16 0 0 • llstatus –B all show all base partitions • llstatus –P <partition_name> • llstatus –b –l show more BG resources
LoadLeveler Job Command File # @ job_name = myjob# @ comment = "BG Job by Size"# @ error = $(home)/output/$(job_name).$(jobid).err # @ output = $(home)/output/$(job_name).$(jobid).out# @ environment = COPY_ALL;# @ wall_clock_limit = 00:20:00# @ notification = error# @ notify_user = $(user)@us.ibm.com # @ job_type = bluegene# @ bg_size = 32# @ queue/usr/bin/mpirun -exe /bgtest/hello.rts -verbose 1
Blue Gene Job Keywords • Mutually exclusive (one must be specified) • bg_size number of compute nodes • bg_shape 1x2x4 number of BPs in x,y,z direction • bg_partition specify a predefined partition • Optional • bg_connection MESH, TORUS, PREFER_TORUS • bg_rotate True or False • bg_requirements c-node memory
Submit a Job • llsubmit <my_job_command_file> • llq • llq –b show Blue Gene specific info • llq –s <step_id> show why the job step remains idle
Partition Size and I/O Nodes • I/O Nodes/BP = 4, partition size >= 128 • I/O Nodes/BP = 8, partition size >= 64/128 • I/O Nodes/BP = 16, partition size >= 32 • I/O Nodes/BP = 32, partition size >= 16/32 • Only Blue Gene/P allows partition sizes 16, 64 and 256 • LoadLeveler defined partition size can not be smaller than BG_MIN_PARTITION_SIZE • 1 Rack has two Base Partitions (BP)
Mixed I/O Nodes Ratio • One rack has 16 I/O Nodes/BP • Other racks have 4 I/O Nodes/BP • A job asks for 32 compute nodes will only be run on the rack with 16 I/O Nodes/BP • A job asks for 128 compute nodes can be run on any rack • BG_MIN_PARTITION_SIZE=16 32 actual • BG_MIN_PARTITION_SIZE=128 128 actual
Unconnected I/O Nodes • Each BP has 16 I/O Nodes (ION) • One rack has all 16 IONs/BP connected • Other racks has only 4 of them connected • Must set • max_psets_per_bp=4 in db.properties file • BG_MIN_PARTITION_SIZE=128 • Dynamically created partitions only use 4 IONs per BP • Predefined partitions (through mmcs_db_console or the navigator) can use more IONs and be smaller
Advance Reservation • In LoadL_admin, add loadl: type = user max_reservations = 10 • llmkres –t 14:00 –d 300 –c 1024 • llmkres –t 12/18 08:00 –d 60 –f my_jcf • In LoadL_config, can add MAX_RESERVATIONS = 20 (default 10)
Advance Reservation • Reserve for maintenance • Reserve for special workload • Allow other users or groups to use • Allow a reservation be automatically cancelled if no more jobs can run • Allow extra resources to be shared when all special jobs for the reservation start to run
Advance Reservation • More resources are needed by TORUS than by MESH • Reservation made through bg_partition reserves exactly the same resources as the predefined partition • Reservation made through bg_size or bg_shape can reserves more resources to allow smaller jobs to run inside the reservation
Fair Share Scheduling • Share resources “fairly” according to resource entitlement and usage • In LoadL_config, specify FAIR_SHARE_TOTAL_SHARES = 1000 FAIR_SHARE_INTERVAL = 720 • llfs to show shares allocated and used • llfs –s/-r <file>/-r to save/restore/reset the fair share data
Fair Share Scheduling • It’s all about job priority! • SYSPRIO must be specified to enable Fair Share Scheduling very flexible • NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL must be positive • In LoadL_admin, specify fair_shares values for some or all users/groups • All users can run jobs even if fair_shares=0
A Mixed LoadLeveler Cluster • A Blue Gene system can be in the same cluster with other AIX or Linux machines • The Central Manager must be run on the service node of the Blue Gene system • Only one Blue Gene system can be in a LoadLeveler cluster • Job classes can be used to separate Blue Gene FENs, Linux and AIX machines • End users can submit all jobs the same way
Multicluster Support In LoadL_admin, add Multicluster definitions ################################ # MULTICLUSTER DEFINITIONS # ################################ BGL: type = cluster outbound_hosts = bglfen3 inbound_hosts = bglfen3 local = true BGP1: type = cluster outbound_hosts = dd1sys1fen1 inbound_hosts = dd1sys1fen1 BGP2: type = cluster outbound_hosts = dd2sys1fen2 inbound_hosts = dd2sys1fen2 Three separate clusters forms a Multicluster environment
Multicluster Support • From one cluster, a user can submit jobs to any other cluster llsubmit –X BGP1 my_job_command_file • From one cluster, a user can query jobs in any other cluster llq –X BGP2
Runtime Environment • Available to Prologs and Epilogs • In LoadL_config, add JOB_PROLOG = /bgtest/bg_job_prolog.sh #!/bin/ksh name=`basename $0 .sh` echo "$LOADL_BG_PARTITION $LOADL_BG_SIZE $LOADL_BG_CONNECTION $LOADL_BG_BPS $LOADL_BG_IONODES `date` $LOADL_STEP_OWNER $LOADL_STEP_ID $LOADL_STEP_CLASS " > /tmp/$name.$LOADL_STEP_ID.log • cat /tmp/bg_job_prolog.bgpdd1sys1.rchland.ibm.com.2.0.log LL07111910011602512MESHR20-M1N00-J00,N04-J00,N08-J00,N12-J00 Mon Nov 19 10:01:16 CST 2007 ezhongbgpdd1sys1.rchland.ibm.com.2.0high
Blue Gene Job Info from llq # llq Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- bgpdd1sys1.9.0 ezhong 11/21 10:29 R 50 high bgpdd1sys1 1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted # llq –b Id Owner Submitted LL BG PT Partition Size ________________________ __________ ___________ __ __ __ ________________ ______ bgpdd1sys1.9.0 ezhong 11/21 10:29 R FR LL07112110294409 512 1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted # llq -f %id %BB %BS %PT %BG %dd %st Step Id Partition Size PT BG Disp. Date ST ------------------------ ---------------- ------ -- -- ----------- -- bgpdd1sys1.9.0 LL07112110294409 512 FR 11/21 10:29 R 1 job step(s) in queue, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted
Blue Gene Job Info from llq # llq –l =============== Job Step bgpdd1sys1.rchland.ibm.com.9.0 =============== ... Step Type: Blue Gene Size Requested: 512 Size Allocated: 512 Shape Requested: Shape Allocated: 1x1x1 Wiring Requested: MESH Wiring Allocated: MESH Rotate: True Blue Gene Status: Blue Gene Job Id: Partition Requested: Partition Allocated: LL07112110294409 BG Partition State: FREE BG Requirements: ...
Multiple Top Dogs • Resources are reserved for highest priority jobs (top dogs) during a dispatching cycle that other jobs are backfilled around them. • In LoadL_config, set MAX_TOP_DOGS = <number> • In LoadL_admin, set max_top_dogs = <number> • Default number of top dogs is 1.
Top Dog Query • A sample Data Access API program/opt/ibmll/LoadL/full/samples/lldata_access/topdog.c > make /usr/bin/g++ -m64 -g -I. -I/opt/ibmll/LoadL/full/include -c -o topdog.o topdog.c /usr/bin/g++ -m64 -g -I. -I/opt/ibmll/LoadL/full/include -o topdog topdog.o -m64 -L. -L/usr/lib64 -lllapi -lpthread –ldl > ./topdog Step Owner q_sysprio Estimated Start Time ------------------------------ ---------- ---------- ------------------------ bgpsys6.rchland.ibm.com.56.0 ezhong 50000 Thu Jun 21 17:50:32 2007 bgpsys6.rchland.ibm.com.56.1 ezhong 50000 Thu Jun 21 18:00:19 2007 bgpsys6.rchland.ibm.com.55.2 ezhong 50000 Thu Jun 21 17:50:19 2007 bgpsys6.rchland.ibm.com.55.3 ezhong 50000 Thu Jun 21 17:50:32 2007 ===== The top dogs were considered for scheduling at Thu Jun 21 17:40:43 2007
More about job priority • q_sysprio in the llq –l output is used by LoadLeveler Central Manger for scheduling • Set in LoadL_configSYSPRIO_THRESHOLD_TO_IGNORE_STEP = integer • Jobs with lower q_sysprio won’t be scheduled to run • llmodify –s <q_sysprio> <step_id> -- Admin only command option • Assign a fixed priority, won’t be changed by priority recalculation
LoadLeveler Download Sites • For the initial download (including the license information) • https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGL-BLUEGENE • https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGP-BLUEGENEP • Those pages are password protected. • For the updates • http://www14.software.ibm.com/webapp/set2/sas/f/lodleveler/home.html • open for everyone. • For LoadLeveler documentation • http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.infocenter.doc/library.html
Installing LoadLeveler for Blue Gene/P • File sets needed • IBMJava2-142-ppc64-JRE-1.4.2-5.0.ppc64.rpm • LoadL-full-license-SLES10-PPC64-3.4.2.1-0.ppc64.rpm • LoadL-full-SLES10-PPC64-3.4.2.1-0.ppc64.rpm • From the directory with the filesets: • rpm -ihv LoadL-full-license-SLES10-PPC64-3.4.2.1-0.ppc64.rpm • /opt/ibmll/LoadL/sbin/install_ll -y -d .
Installing LoadLeveler for Blue Gene/L • Please see Chapter 10 of the IBM Redbook: “IBM System Blue Gene Solution: Configuring and Maintaining Your Environment” http://www.redbooks.ibm.com/abstracts/sg247352.html