1 / 24

Condor-G: A Case in Distributed Job Delegation

Condor-G: A Case in Distributed Job Delegation. Job Delegation. Transfer of responsibility to schedule and execute a job Multiple delegations can form a chain. Job Delegation in Condor-G Today. Globus GRAM. Batch System Front-end. Execute Machine. Condor-G. Expanding the Model.

lonna
Download Presentation

Condor-G: A Case in Distributed Job Delegation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor-G: A Case in Distributed Job Delegation

  2. Job Delegation • Transfer of responsibility to schedule and execute a job • Multiple delegations can form a chain

  3. Job Delegation in Condor-G Today Globus GRAM Batch System Front-end Execute Machine Condor-G

  4. Expanding the Model • What can we do with new forms of job delegation? • Some ideas • Mirroring • Load-balancing • Glide-in schedd • Multi-hop grid scheduling

  5. Mirroring • What it does • Jobs mirrored on two Condor-Gs • If primary Condor-G crashes, secondary one starts running jobs • On recovery, primary Condor-G gets job status from secondary one • Removes Condor-G submit point as single point of failure

  6. Mirroring Example Matchmaker Condor-G 1 Condor-G 2 Execute Machine

  7. Mirroring Example Matchmaker Condor-G 1 Condor-G 2 Execute Machine

  8. Load-Balancing • What it does • Front-end Condor-G distributes all jobs among several back-end Condor-Gs • Front-end Condor-G keeps updated job status • Improves scalability • Maintains single submit point for users

  9. Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2

  10. Glide-In Schedd • What it does • Drop a Condor-G onto the front-end machine of a cluster • Delegate jobs to the cluster through the glide-in schedd • Apply cluster-specific policies to jobs

  11. Glide-In Schedd Glide-In Schedd Example Condor-G Batch System

  12. Multi-Hop Grid Scheduling • Match a job to a Virtual Organization (VO), then to a resource within that VO • Easier to schedule jobs across multiple VOs and grids

  13. Multi-Hop Grid Scheduling Example Experiment Resource Broker VO Resource Broker Experiment Condor-G VO Condor-G Globus GRAM Batch Scheduler

  14. Endless Possibilities • These new models can be combined with each other or with other new models • Resulting system can be arbitrarily sophisticated

  15. Job Delegation Challenges • New complexity introduces new issues and exacerbates existing ones • A few… • Transparency • Representation • Scheduling Control • Active Job Control • Revocation • Error Handling and Debugging

  16. Transparency • Full information about job should be available to user • Information from full delegation path • No manual tracing across multiple machines • Users need to know what’s happening with their jobs

  17. Representation • Job state is a vector • How best to show this to user • Summary • Current delegation endpoint • Job state at endpoint • Full information available if desired • Series of nested ClassAds?

  18. Scheduling Control • Avoid loops in delegation path • Give user control of scheduling • Allow limiting of delegation path length? • Allow user to specify part or all of delegation path

  19. Active Job Control • User may request certain actions • hold, suspend, vacate, checkpoint • Actions cannot be completed synchronously for user • Must forward along delegation path • User checks completion later

  20. Active Job Control (cont) • Endpoint systems may not support actions • If possible, execute them at furthest point that does support them • Allow user to apply action in middle of delegation path

  21. Revocation • Leases • Lease must be renewed periodically for delegation to remain valid • Allows revocation during long-term failures • What are good values for lease lifetime and update interval?

  22. Error Handling and Debugging • Many more places for things to go horribly wrong • Need clear, simple error semantics • Logs, logs, logs • Have them everywhere

  23. Current Status • Done • Mirroring • In Progress • Condor-G -> Condor-G delegation • User must specify hops • Glide-in schedd • Set up by hand

  24. Thank You! • Questions?

More Related