240 likes | 367 Views
Condor-G: A Case in Distributed Job Delegation. Job Delegation. Transfer of responsibility to schedule and execute a job Multiple delegations can form a chain. Job Delegation in Condor-G Today. Globus GRAM. Batch System Front-end. Execute Machine. Condor-G. Expanding the Model.
E N D
Job Delegation • Transfer of responsibility to schedule and execute a job • Multiple delegations can form a chain
Job Delegation in Condor-G Today Globus GRAM Batch System Front-end Execute Machine Condor-G
Expanding the Model • What can we do with new forms of job delegation? • Some ideas • Mirroring • Load-balancing • Glide-in schedd • Multi-hop grid scheduling
Mirroring • What it does • Jobs mirrored on two Condor-Gs • If primary Condor-G crashes, secondary one starts running jobs • On recovery, primary Condor-G gets job status from secondary one • Removes Condor-G submit point as single point of failure
Mirroring Example Matchmaker Condor-G 1 Condor-G 2 Execute Machine
Mirroring Example Matchmaker Condor-G 1 Condor-G 2 Execute Machine
Load-Balancing • What it does • Front-end Condor-G distributes all jobs among several back-end Condor-Gs • Front-end Condor-G keeps updated job status • Improves scalability • Maintains single submit point for users
Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2
Glide-In Schedd • What it does • Drop a Condor-G onto the front-end machine of a cluster • Delegate jobs to the cluster through the glide-in schedd • Apply cluster-specific policies to jobs
Glide-In Schedd Glide-In Schedd Example Condor-G Batch System
Multi-Hop Grid Scheduling • Match a job to a Virtual Organization (VO), then to a resource within that VO • Easier to schedule jobs across multiple VOs and grids
Multi-Hop Grid Scheduling Example Experiment Resource Broker VO Resource Broker Experiment Condor-G VO Condor-G Globus GRAM Batch Scheduler
Endless Possibilities • These new models can be combined with each other or with other new models • Resulting system can be arbitrarily sophisticated
Job Delegation Challenges • New complexity introduces new issues and exacerbates existing ones • A few… • Transparency • Representation • Scheduling Control • Active Job Control • Revocation • Error Handling and Debugging
Transparency • Full information about job should be available to user • Information from full delegation path • No manual tracing across multiple machines • Users need to know what’s happening with their jobs
Representation • Job state is a vector • How best to show this to user • Summary • Current delegation endpoint • Job state at endpoint • Full information available if desired • Series of nested ClassAds?
Scheduling Control • Avoid loops in delegation path • Give user control of scheduling • Allow limiting of delegation path length? • Allow user to specify part or all of delegation path
Active Job Control • User may request certain actions • hold, suspend, vacate, checkpoint • Actions cannot be completed synchronously for user • Must forward along delegation path • User checks completion later
Active Job Control (cont) • Endpoint systems may not support actions • If possible, execute them at furthest point that does support them • Allow user to apply action in middle of delegation path
Revocation • Leases • Lease must be renewed periodically for delegation to remain valid • Allows revocation during long-term failures • What are good values for lease lifetime and update interval?
Error Handling and Debugging • Many more places for things to go horribly wrong • Need clear, simple error semantics • Logs, logs, logs • Have them everywhere
Current Status • Done • Mirroring • In Progress • Condor-G -> Condor-G delegation • User must specify hops • Glide-in schedd • Set up by hand
Thank You! • Questions?