1 / 20

OGF 19 Condor Software Forum Routing Jobs to the Grid

OGF 19 Condor Software Forum Routing Jobs to the Grid. What’s a Job Router?. Specialized scheduler operating on schedd’s jobs. Job 1 Job 2 Job 3 Job 4 Job 5 …. Job Router a.k.a. Schedd On The Side. Job 4*. job queue. Schedd. Adapted Quill Technology.

magee
Download Presentation

OGF 19 Condor Software Forum Routing Jobs to the Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OGF 19Condor Software ForumRouting Jobs to the Grid

  2. What’s a Job Router? Specialized scheduler operating on schedd’s jobs. Job 1 Job 2 Job 3 Job 4 Job 5 … Job Router a.k.a. Schedd On The Side Job 4* job queue Schedd

  3. Adapted Quill Technology • Using Quill library to mirror job queue in memory • Efficient - just “tails” the log • Independent - mirror without clogging schedd command queue • Modifying the job queue is another matter - must interact with schedd

  4. Usage Case Routing: Vanilla -> Grid

  5. Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Negotiator Schedd Startd Resources Condor Farm Story • Now that this is working, howcan I use my collaborator’sresources too? condor_submit job queue Application

  6. Option #1: Merge Farms • Combine machines with collaborator into one Condor resource pool. • Everything works just like it did before. • Excellent option for small to medium clusters. • Requires bidirectional connectivity to all startds, or equivalent via GCB. • Requires some administrative coordination (e.g. upgrades, negotiator policy, security, etc.)

  7. Option #1b: submit to multiple pools • condor_submit -remote … • Works • Ok for small scale • Have to manually partition jobs

  8. full featured(std universe etc) • automatic matchmaking • easy to configure • requires bidirectionalconnectivity • both sites must runcondor Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Negotiator Negotiator Schedd Remote Startds Random Seed Random Seed Random Seed Local Startds Option #2: Flocking Together

  9. Gatekeeper Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Negotiator Schedd X Random Seed Random Seed Random Seed Startds Option #3: Grid Universe vanilla site X • easier to live with private networks • may use non-Condor resources • restricted Condor feature set(e.g. no std universe over grid) • must pre-allocating jobsbetween vanilla and grid universe

  10. Random Seed Z Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Negotiator Gatekeeper Schedd X Random Seed Random Seed Random Seed Local Startds Y Option #4: Routing Jobs • dynamic allocation of jobsbetween vanilla and grid universes. • not every job is appropriate fortransformation into a grid job. vanilla site X site Y site Z

  11. Example Routing Table [GridResource = “gt2 gatekeeper.site1/jobmanager-pbs”; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = “(…)” ] [GridResource = “condor schedd.site2 collector.site2”; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500 ] …

  12. What About I/O? • Jobs must be sandboxable (i.e. specifying input/output via transfer-files mechanism). • Routing of standard universe is not supported. • Must have enough storage space at site for input/output files!

  13. Random Seed Random Seed Random Seed Random Seed Schedd On The Side Negotiator Schedd Schedd X Random Seed Random Seed Random Seed What Types of Grids? • Routing table may contain any combination of grid types supported by Condor’s grid universe. • Example: Condor-C site X • for two Condor sites, schedd-to-scheddsubmission requires no additional software • however, still not as trivial to use as flocking

  14. Source Routing • Routing the old-fashioned way: universe = Grid GridResource = condor site1 … remote_universe = Grid remote_GridResource = condor site2 … remote_remote_universe = Grid remote_remote_GridResource = pbs

  15. Schedd On The Side Schedd X3 Schedd Routing At the Site • navigate internal firewalls • provide custom routesfor special users • improve scalability • However, keep in mindI/O requirements etc. Gatekeeper X2 X

  16. Multicast in Future? • Currently: route one job to one site • Multicast: route one job to many sites • Thin out all but first to germinate • … or all but first to yield fruit.

  17. Gatekeeper Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Negotiator Schedd X Random Seed Random Seed Random Seed Startds Future Glidein Factory glidein jobs site X home • true late binding of jobs to resources • may run on top of non-Condor sites • supports full feature-set of Condor(e.g. standard universe) • requires GCB for private networks

  18. Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Schedd glidein factory Glideing in the Factory site X schedd-to-schedd • hierarchical strategy for scalabilityand reliability • better match for private networks schedd-to-gatekeeper • may require some additional horsepowerfrom gatekeeper machine, perhaps adedicated element for “edge services”.

  19. Pluggable Router • Beyond simple ClassAd transforms • Pluggins would fire when job matches entry in routing table • Don’t yet understand semantics • There is work to do!

  20. Thanks Interested?Let us know. We are currently using job routing for specific users at UW. Future development will focus on more use-cases. Jaime Frey jfrey@cs.wisc.edu

More Related