240 likes | 260 Views
Himani Apte. Dynamic DAGMan with ClassAds. Outline. DAGMan workflow management Motivation for dynamic DAGMan ClassAds Putting together: DAGMan + ClassAds Looking ahead. DAGMan. Directed Acyclic Graph Manager Meta-scheduler for Condor DAG: set of jobs with dependencies
E N D
Himani Apte Dynamic DAGMan with ClassAds
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
DAGMan • Directed Acyclic Graph Manager • Meta-scheduler for Condor • DAG: set of jobs with dependencies • Manages submission of DAG jobs • Enforces execution order • DAGMan itself is a Condor job!
Example DAG Job A A.condor Job B B.condor Job C C.condor Job D D.condor Parent A Child B C Parent B C Child D Script PRE A input.sh Script POST D output.sh A B C D
Simplified state diagram of a DAG node Pre-running Post-running Waiting Submitted Done Failed
DAGMan: important properties • Monitors job state using Condor logs • Simple and clean recovery model • Rescue DAG: saves state at failure • Restart: reconstruct internal state • Scripts allow “lazy” planning • Throttling parameters
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
Motivation for dynamic DAGMan • DAG: complete execution order • Flexibility to make run-time decisions • Which subset of DAG nodes should execute? • When should node X execute? • Conditional DAGs • Associate a condition with DAG edges • Simplest condition: successful completion of parent nodes
Conditional DAG: examples Example 1 Example 2 A P1 P2 Condition: P1.x OR P2.x Condition: A.x = = true Yes No B C C
Motivation for dynamic DAGMan • Scripts can be leveraged for lazy planning • For simple conditions • E.g. exit value of job • Modify DAG structure • E.g. convert branch-not-taken to no-op/empty • We want a generic solution • Supported by “Dynamic DAGMan”
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
ClassAds • Classified advertisements • Used extensively in Condor • Define jobs, machines, resources • Define conditions, triggers, requirements • Maintain internal state
ClassAds • List of attribute-value pairs • Simple value types: integer, strings • Complex types: list, expressions, ClassAds • Matchmaking framework • Tests match between two classAds • Using “Requirements” expression • Great fit for Dynamic DAGMan
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
Putting together: DAGMan + ClassAds • Dynamic DAGMan research project • Work-in-progress • Not yet available in Condor • DAG nodes have associated classAds • Basic node attributes • Job identifier, name, type • Status (Waiting, Submitted, Done, etc.)
Dynamic DAGMan: attributes • Execution characteristics of job • Exit value • Wall-clock time • CPU utilization (local and remote) • Network statistics (bytes sent / received) • Information about files transferred (for vanilla universe) • Attributes maintained by Condor for a job
Dynamic DAGMan: conditions • Requirements expression • Defines trigger condition for the node • Arbitrarily complex expression • Defined on the attributes of parent nodes • Use matchmaking to determine if a node can be submitted
Dynamic DAG: example Job A A.condor Job B B.condor Job C C.condor Parent A Child B \ COND [ ( other.job == A && other.x == true ) ] Parent A Child C \ COND [ ( other.job == A && other.x == false ) ] A Yes No condition x = = true B C
Dynamic DAGMan: example Job P1 P1.condor Job P2 P2.condor Job C C.condor Parent P1 P2 Child C \ COND [ (other.job == P1 && other.x == true) || (other.job == P2 && other.x == true) ] P1 P2 Condition: P1.x OR P2.x C
Dynamic DAGMan • Recovery model is still the same • Rescue DAG: saves node state at failure • ClassAd attribute-values can be re-generated from Condor logs • Flexibility to make run-time decisions • Which subset of nodes in the DAG should be executed? • When should node X be executed?
Outline • DAGMan workflow management • Motivation for dynamic DAGMan • ClassAds • Putting together: DAGMan + ClassAds • Looking ahead
Looking ahead • DAG with only implicit edges • Parent-child relations embedded in classAds • Nodes specify • Trigger condition • Preference for child nodes to run • On-the-fly dependency formation based on previous node execution • DAGMan collaborates with Quill • Getting attributes from persistent storage
Looking ahead • Allow job to modify/add its attributes • Determine what happens after job exits • Global state control • Throttling expression/parameters • Global DAG-classAd • Statistics on running, successful and failed jobs • E.g. if (#failed jobs > N ) run cleanup node
Thank-you We are interested in knowing your suggestions!