330 likes | 471 Views
Master/Worker and Condor Barcelona, 2006. Agenda. Extended user’s tutorial Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing Case studies, and a discussion of your application‘s needs. Why M aster W orker?. MW addresses a weakness in Condor: Short jobs
E N D
Agenda • Extended user’s tutorial • Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing • Case studies, and a discussion of your application‘s needs
Why Master Worker? • MW addresses a weakness in Condor: Short jobs • Excellent for dynamic, parallel workflows
A Workflow Problem A problem requires that we do A 60,000 times, and we do B 100,000 times • A takes 1 second • B takes 3 seconds Computation time for the problem is (60000 x 1) + (100000 x 3) = 360,000 seconds or 100 hours
Condor Runs the Workflow Assume that the overhead Condor adds to running each instance of A or B is 20 seconds (this overhead is much too small) Time for Condor to do the problem is (60000 x 21) + (100000 x 23) = 3,560,000 seconds or 989 hours
An Often Considered Solution • Bundle several As or Bs into a single Condor job • Must address further issues: • Partial failures • Load balancing • Dynamic creation of work A A A One Condor job
Basics of MW The master gives tasks to the workers.
Workers and Tasks Each worker serially takes on tasks, as assigned by the master feed me one worker bathe me change diaper
Relating MW to Condor • There is 1 master • The masterdetermines the number ofworkers • Each worker is a Condor job • Each worker receives tasks serially • Many workers do tasks at the same time (in parallel) • Workers communicate only with the master
Solution: Lightweight TasksMultiplexed on top of Jobs The analogy: Process is to Thread as Condor Job is to an MW Task A Condor job may take minutes to create and dispatch; an MWTask dispatch takes milliseconds
MW is • C++ Framework • A way to re-use Condor worker jobs • Each worker may run many tasks • Results in a very parallel application
MW is not • MPI (Message Passing Interface) • General parallel programming scheme
MW in action T Worker Master exe T T T T Worker T T T condor_submit Worker Submit machine
You Must Write 3 Classes, the Subclasses of. . . MWDriver MWTask MWWorker Master exe Worker exe
An MWTask • Subclass MWTask • Data members for inputs • Data member for results • Serialization of inputs and results • Distinct instances on each side
The Four Task Methods • void MyTask::pack_work(void); • void MyTask::unpack_work(void); • void MyTask::pack_results(void); • void MyTask::unpack_results(void); • Also constructors and destructors!
RMC • Resource Management and Communication • An abstraction to set up communication, to specify resource requirements, etc. • RMC->pack(int *array, int length); • RMC->unpack(int *array, int length);
MWWorker • Just one method: executeTask(MWTask *t) • Also constructor and destructor!
MWDriver (the master) • get_userinfo(int argc, char **argv) • RMC->add_executable(char *exe, char *requirements); • setup_initial_tasks(int num_tasks, MWTask ***init_tasks) • act_on_completed_task(MWTask *t) • RMC->add_task(MWTask *t) • Also constructor and destructor
MWTask ***init_tasks pointer to the array array of pointers to tasks task
MWDriver (the master) • get_userinfo(int argc, char **argv) • RMC->add_executable(char *exe, char *requirements); • setup_initial_tasks(int num_tasks, MWTask ***init_tasks) • act_on_completed_task(MWTask *t) • RMC->add_task(MWTask *t) • Also constructor and destructor
Putting it all together:examples/new_skel • ./new_app MY_PROJECT A Perl script to create appropriately named files containing skeleton code • Use configure –help for options • make
Running an application • Just launch the appropriate master • use condor_q to see it in action
Real MW Applications • MWFATCOP (Chen, Ferris, Linderoth) A branch and cut code for linear integer programming • MWMINLP (Goux, Leyffer, Nocedal) A branch and bound code for nonlinear integer programming • MWQPBB (Linderoth) A (simplicial) branch and bound code for solving quadratically constrained quadratic programs • MWAND (Linderoth, Shen) A nested decomposition based solver for multistage stochastic linear programming • MWATR (Linderoth, Shapiro, Wright) A trust-region-enhanced cutting plane code for linear stochastic programming and statistical verification of solution quality. • MWQAP (Anstreicher, Brixius, Goux, Linderoth) A branch and bound code for solving the quadratic assignment problem
Other resources • http://www.cs.wisc.edu/condor/mw • Online manual • MW-users mailing list
Advice for Large Runs • Use Personal Condor • Flock, glidein, schedd-on-side, hobblein • Use checkpoints! • Set worker_increment high
Debugging with Independent Mode • Special RMComm for debugging • Single process, can run under gdb
MW Philosophy • Reuse either code or concept • Key idea: Late binding
User-level Checkpoints • MWTask::write_chkpt_info(FILE *) • MWTask::read_chkpt_info(FILE *) • MWDriver::read_master_state(FILE *) • MWDriver::write_master_state(FILE *)
Example codes with MW • Matmul • Blackbox • knapsack
More on MW • http://www.cs.wisc.edu/condor/mw • Version 0.2 is the latest • It is more stable than the version number suggests! • Mailing list available for discussion • Active development by the Condor team