90 likes | 233 Views
Master Control Program Subha Sivagnanam SDSC. Master Control Program. Provides automatic resource selection for running a single parallel job on HPC resources MCP uses directives in batch submission scripts to submit to the queues of multiple resources. Eg :
E N D
Master Control Program • Provides automatic resource selection for running a single parallel job on HPC resources • MCP uses directives in batch submission scripts to submit to the queues of multiple resources. Eg: #MCP submit_host <head node for the remote cluster> #MCP username <local username on the remote cluster> #MCP scratch_dir <scratch directory on the remote cluster> • As soon as the job starts to run on one of the resources, it removes the jobs from all other resources' queues.
Assumption: • User should compile the application on the desired machines • Input should be staged on the remote clusters • Submission will be initiated only from one machine • MCP can be initiated by • using mcp.py, manually creating job scripts • using fullauto.py, automating job scripts based on desired attributes
MCP flow • Grid credential needs to be established (grid-proxy-init or myproxy-get-delegation ) • Write job script for each resource Example – NCSA jobscript #!/bin/ksh #MCP qtypepbs #MCP submit_host tg-login.ncsa.teragrid.org #MCP username your_username #MCP scratch_dir /home/ncsa/your_username/info/mcp/test/mcp #PBS -l walltime=00:05:00,nodes=4:ppn=2:compute #PBS -d /home/ncsa/your_username/info/mcp/test/run NPROCS=`wc -l < $PBS_NODEFILE` /usr/local/mpich/mpich-gm-1.2.5..10-intel-r2/bin/mpirun -v -machinefile $PBS_NODEFILE -np $NPROCS /home/ncsa/your_username/testprog/ring26 -t 10 -n 2 -l 10 -i 0.03125 #/bin/sleep 900
User submits the job files to MCP with job files as the input. ./mcp.py [--debug] <submit_script1> <submit_script2> • MCP submits jobs to all clusters and monitors all clusters for job start • Once one job starts, MCP cancels all other jobs
Fullauto Flow • User runs grid-proxy-init or myproxy-get-delegation to establish grid credential. • autojob.py is created with personalized settings. Eg: match_attributes = { 'CPU_MODEL' : ['==', 'ia64'], 'CPU_MEMORY_GB' : ['>=', 2], 'CPU_MHZ' : ['>=', 1300], 'CPU_SMP' : ['>=', 2], 'NODECOUNT' : ['>=', 128], } machine_dict_list = [ { 'HOSTNAME' : 'tg-login.ncsa.teragrid.org', 'substitutes_dict' : { 'arguments' : ['-t', '100', '-n', '10', '-l', '4000', '-i', '0.03125', '-c', '0', '-s', '0'], 'wallclock_seconds' : '300', ‘ __MCP_SHELL__' : '/bin/ksh', ‘ __MCP_PARALLEL_RUN__' : '/usr/local/mpich/mpich-gm-1.2.6..14b-intel-r2/bi n/mpirun', ‘ __MCP_SERIAL_RUN__' : '#', ‘ __MCP_NODES__' : '4', ‘ __MCP_CPUS_PER_NODE__' : '2', ‘ __MCP_USERNAME__' : 'your_username', ‘ __MCP_SCRATCH_DIR__' : '/home/ncsa/your_username/info/mcp/test/mcpdata', ‘ __MCP_JOB_DIR__' : '/home/ncsa/your_username/info/mcp/test/run', ‘ __MCP_EXECUTABLE__' : '/home/ncsa/your_username/testprog/ring26', }, }, ]
User runs fullauto.py with autojob.py as the input.fullauto.py --autojobfile=<autojob file> • Fullauto finds clusters from the allowable list of resources (automachine.py) and creates job scripts for each selected cluster. • Fullauto uses MCP to run the scripts.
Resources available • Fullauto.py –attributes or from automachine.py