190 likes | 264 Views
Task Farming on HPCx. David Henty HPCx Applications Support d.henty@epcc.ed.ac.uk. What is Task Farming?. Many independent programs (tasks) running at once each task can be serial or parallel “independent” means they don’t communicate directly
E N D
Task Farming on HPCx David Henty HPCx Applications Support d.henty@epcc.ed.ac.uk
What is Task Farming? • Many independent programs (tasks) running at once • each task can be serial or parallel • “independent” means they don’t communicate directly • Common approach for using cycles in a loosely-connected cluster • how does it relate to HPCx and Capability Computing? • Often needed for pre or post-processing • Tasks may contribute to a single, larger calculation • parameter searches or optimisation • enhanced statistical sampling • ensemble modelling HPCx User Group
Worker 1 Worker 2 Controller Worker 3 Worker 4 Classical Task Farm • A single parallel code (eg written in MPI) • one process is designated as the controller • the rest are workers input output HPCx User Group
Characteristics • Pros • load balanced for sufficiently many tasks • can use all of HPCx (using MPI) • Cons • must write a new parallel code • potential waste of a CPU if controller is not busy • each task must be serial, ie use a single CPU • Approach • find an existing task farm harness on the WWW HPCx User Group
Worker 5 Worker 1 Worker 2 Worker 3 Worker 4 Shared Counter • Tasks are numbered 1, 2, ... maxTask • shared counter requires no CPU time output 1 1 5 2 Counter 4 3 6 7 output 3 output 6 HPCx User Group
Characteristics • Pros • load-balanced • don’t have to designate a special controller • Cons • very much a shared-memory model • easy to scale up to a frame (32 CPUs) with OpenMP • harder to scale to all of HPCx • need to write a new parallel program HPCx User Group
Task Farming Existing Code • Imagine you have a pre-compiled executable • and you simply want to run P copies on P processors • common in parameter searching or ensemble studies • can be done via poe but is non-portable • Possible to launch a simple MPI harness • each process does nothing but run the executable • easy to do via “system(commandstring)” • Have written a general-purpose harness • called taskfarm • see /usr/local/packages/bin/ HPCx User Group
Controlling the Task Farm • Need to allow the tasks to do different things • each task assigned unique MPI rank: 0, 1, 2, ..., P-2, P-1 • I have hijacked the C “%d” printf syntax • taskfarm “echo hello from task %d” • command string is run as-is on each processor • except with %d replaced by the MPI rank • On 3 CPUs: hello from task 0 hello from task 1 hello from task 2 HPCx User Group
Verbose Mode taskfarm -v "echo hello from task %d" taskfarm: called with 5 arguments: echo hello from task %d taskfarm: process 0 executing "echo hello from task 0“ taskfarm: process 1 executing "echo hello from task 1" taskfarm: process 2 executing "echo hello from task 2" hello from task 0 hello from task 1 hello from task 2 taskfarm: return code on process 0 is 0 taskfarm: return code on process 1 is 0 taskfarm: return code on process 2 is 0 • Could also report where task is running • ie the name of the HPCx frame HPCx User Group
Use in Practice • Need tasks to use different input and output files taskfarm "cd rundir%d; serialjob < input > output.log" • or taskfarm "serialjob < input.%d > output.%d.log” • Pros • no new coding, and taskfarm also relatively portable • Cons • no load balancing: single job per run • Extensions • do more tasks than CPUs, aiming for load balance? • dedicated controller makes this potentially messy HPCx User Group
Implement Shared Counter in MPI-2 • Could be accessed as a library function: do task = gettask() if (task .ge. 0) then call serialjob(task) end if while (task .ge. 0) • or via an extended harness taskfarm -n 150 “serialjob < input.%d > output.%d.log” • Would run serial jobs on all available processors until all 150 had been completed • potential for load-balancing with more tasks than processors • work in progress! HPCx User Group
Multiple Parallel MPI Jobs • What is the issue in HPCx? • poe picks up number of MPI processes directly from the Loadleveler script • can only have a single global MPI job running at once • Cannot do mpirun mpijob -nproc 32 & mpirun mpijob -nproc 32 & mpirun mpijob -nproc 32 & mpirun mpijob -nproc 32 & wait • unlike on many other systems like Sun, T3E, Altix, ... HPCx User Group
Using taskfarm • taskfarm is a harness implemented in MPI • cannot use it to run MPI jobs • but can run jobs parallelised with some other method, eg threads • To run 4 copies of a 32-way OpenMP job: • export OMP_NUM_THREADS=32 • taskfarm "openmpjob < input.%d > output.%d.log" • Controlling the OpenMP parallelism • how to ensure that each OpenMP job runs on a separate frame? • need to select 4 MPI tasks but place only one on each node #@ cpus=4 #@ tasks_per_node=1 HPCx User Group
Real Example: MOLPRO • An ab-initio quantum chemistry package • parallelised using the Global Array (GA) Tools library • on HPCx, normal version of GA Tools uses LAPI • LAPI requires poe: same problems for taskfarm as with MPI • But ... • there is an alternative implementation of GA Tools • uses the TCMSG messaging library ... • which is implemented using Unix sockets,not MPI • Not efficient over the switch • but probably fine on a node, ie up to 32 processors HPCx User Group
Running MOLPRO as Parallel Taskfarm • TCMSG parallelism specified on command line • to run 6 MOLPRO jobs each using 16 CPUs • ie 2 jobs per frame on a total of 3 frames #@ cpus=6 #@ tasks_per_node=2 taskfarm “molpro -n 16 < input.%d.com > output.%d.out” • Completely analogous to taskfarming OpenMP jobs • MOLPRO can now be used to solve many different problems simultaneously • which may not individually scale very well HPCx User Group
Multiple Parallel MPI Jobs • So far have seen ways of running the following (where simple means no load balancing) • general serial task farm requiring new parallel code • simple serial task farm of existing program(s) • potential for general serial task farm of existing program(s) • simple parallel (non-MPI) task farms with existing program(s) • What about task farming parallel MPI jobs? • eg four 64-way MPI jobs in a 256 CPU partition • requires some changes to source code • but potentially not very much HPCx User Group
Communicator Splitting • (Almost) every MPI routine takes a communicator • usually MPI_COMM_WORLD but can be a subset of processes call MPI_Init(ierr) comm = MPI_COMM_WORLD call MPI_Comm_size(comm, ...) call MPI_Comm_rank(comm, ...) if (rank .eq. 0) & write(*,*) 'Hello world‘ ! now do the work ... call MPI_Finalize(ierr) call MPI_Init(ierr) bigcomm = MPI_COMM_WORLD comm = split(bigcomm,4) call MPI_Comm_size(comm, ...) call MPI_Comm_rank(comm, ...) if (rank .eq. 0) & write(*,*) 'Hello world' ! now do the work ... call MPI_Finalize(ierr) HPCx User Group
Issues • Each group of 64 processors lives in its own world • each has ranks 0 – 63 and its own master, rank = 0 • must never directly reference MPI_COMM_WORLD • Need to allow for different input and output files • use different directories for minimum code changes • can arrange for each parallel task to run in a different directory using clever scripts • How to split the communicator appropriately? • can be done by hand with MPI_Comm_split • the MPH library gives users some help • If you’re interested, submit a query! HPCx User Group
Summary • Like any parallel computer, HPCx can run parallel taskfarm programs written by hand • However, usual request are: • multiple runs of existing serial program • multiple runs of existing parallel program • These can both be done with the taskfarm harness • limitations on tasks’ parallelism (must be non-MPI) • currently no load-balancing • Task farming MPI code requires source changes • but can be quite straightforward in many cases • eg ensemble modelling with Unified Model HPCx User Group