230 likes | 365 Views
Condor DAGMan: Managing Job Dependencies with Condor. Condor DAGMan. What is DAGMan? What is it good for? How does it work? What’s next?. DAGMan. D irected A cyclic G raph Man ager
E N D
Condor DAGMan • What is DAGMan? • What is it good for? • How does it work? • What’s next? Condor DAGMan
DAGMan • Directed Acyclic Graph Manager • DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. • (e.g., “Don’t run job “B” until job “A” has completed successfully.”) Condor DAGMan
Typical Scenarios • Jobs whose output needs to be summarized or post-processed once they complete. • Jobs that need data to be generated or pre-processed before they can use it. • Jobs which require data to be staged to/from remote repositories before they start or after they finish. Condor DAGMan
Job A Job B Job C Job D What is a DAG? • A DAG is the datastructure used by DAGMan to represent these dependencies. • Each job is a “node” in the DAG. • Each node can have any number of “parents” or “children” (or neither) – as long as there are no loops! Condor DAGMan
An Example DAG • Jobs whose output needs to be summarized or post-processed once they complete: Job A Job B Job C Job D Condor DAGMan
Another Example DAG • Jobs that need data to be generated or pre-processed before they can use it: Job A Job B Job C Job D Condor DAGMan
Job A Job B Job C Job D Defining a DAG • A DAG is defined by a .dagfile., listing all its nodes and any dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Condor DAGMan
Job A Job B Job C Job D Defining a DAG (cont’d) • Each node in the DAG will run a Condor job, specified by a Condor submit file: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Condor DAGMan
Submitting a DAG • To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon & begin running your jobs: • % condor_submit_dag diamond.dag • The DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it. Condor DAGMan
Running a DAG • DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. DAGMan A Condor Job Queue .dag File A B C D Condor DAGMan
Running a DAG (cont’d) • DAGMan holds & submits jobs to the Condor queue at the appropriate times. DAGMan A Condor Job Queue B B C C D Condor DAGMan
Running a DAG (cont’d) • In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. DAGMan A Condor Job Queue Rescue File B X D Condor DAGMan
Recovering a DAG • Once the failed job is ready to be re-run, the Rescue file can be used to restore the prior state of the DAG. DAGMan A Condor Job Queue Rescue File B C C D Condor DAGMan
Recovering a DAG (cont’d) • Once that job completes, DAGMan will continue the DAG as if the failure never happened. DAGMan A Condor Job Queue B C D D Condor DAGMan
Finishing a DAG • Once the DAG is complete, the DAGMan job itself is finished, and exits. DAGMan A Condor Job Queue B C D Condor DAGMan
Additional Features • Provides some other handy features for job management… • nodes can have PRE & POST scripts • job submission can be “throttled” Condor DAGMan
PRE & POST Scripts • Each node can have a PRE or POST script, executed as part of the node: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub PARENT A CHILD B C PARENT B C CHILD D Script PRE B stage-in.sh Script POST B stage-out.sh Job A PRE Job B POST Job C Job D Condor DAGMan
PRE & POST Scripts (cont’d) • Useful for staging a job’s data from remote repositories, and/or putting it back afterwards. • Ex: • PRE: Globus FTP the data from afar • Run the job • POST: Globus FTP the data back Condor DAGMan
Submit Throttling • DAGMan can limit the maximum number of jobs it will submit to Condor at once: • condor_submit_dag -maxjobs N • Useful for managing resource limitations (e.g., storage). • Ex: 1000 jobs, each of which require 1 GB of disk space, and you have 100 GB of disk. Condor DAGMan
Summary • DAGMAN: • manages dependencies, holding & running jobs only at the appropriate times • monitors job progress • is fault-tolerant • is recoverable in case of job failure • provides some additional features to Condor • currently DAGMan itself can only run on Unix, but its jobs can run anywhere Condor DAGMan
Future Work • More sophisticated management of remote data transfer & staging to maximize CPU throughput. • Keep the pipeline full! I.e., intelligently manage disk & network to always have remote data ready where a CPU becomes available. • Possible integration with Kangaroo, etc. • Better integration with Condor tools • condor_q, etc. displaying DAG information Condor DAGMan
Conclusion • Interested in seeing more? • Come to the DAGMan demo • Wednesday 9am - noon • Room 3393, Computer Sciences (1210 W. Dayton St.) • Email me: • <pfc@cs.wisc.edu> • Try it: • http://www.cs.wisc.edu/condor Condor DAGMan