150 likes | 163 Views
Solve the problem of large computational overhead by creating a Pilot Factory application that efficiently submits and manages pilot jobs on remote machines, bypassing GRAM. Using Schedd Glidein approach to pull and categorize jobs, and streamline submissions for Condor-G and other environments.
E N D
Pilot Factory using Schedd Glidein Barnett Chiu BNL 10.04.07
Problem to solve (1) • Pilot • Probe the resource (http, environment, interpreter, other executables …etc) • Pull jobs from remote server (e.g. Panda server) • Matchmaking • Group jobs in different categories E.g Production jobs, Analysis jobs (CHARMM …), Test jobs … • Other criteria: Number of CPUs, RAM … etc
Problem to Solve (2) • Current approach of pilot submissions • Local pool : Vanilla • Remote pool: Condor-G • Large amounts of user jobs (production + analysis) ~ large amount of Condor-G pilot jobs ~ computational overhead on gatekeepers (e.g. large memory consumptions)
Solution • Is there any way to bypass GRAM to submit jobs to remote machines? • Local submissions, but how? • We need something that continuously submit local pilot jobs on the gatekeeper • Solution: Pilot Factory
Pilot Factory Overview • Pilot Factory is an application that combines the following ideas: • schedd glidein • pilot submission program (or pilot generator) • What is glidein? • Mini-Condor pool on a remote machine • A complete Condor pool has at least 5 components: i.e. master, startd, schedd, collector, negotiator • Glidein: {master, startd}, {master, schedd}, … etc • Properly configured condor daemons submitted as batch job
Glidein (1) • Two major steps Condor-G #1: installation glidein setup script condor configuration file glidein startup script download Condor binaries (http, gsiftp …etc) Condor-G #2: execution exec glidein startup script condor_master
Submit Host Central Manager Master master master startd schedd schedd master master master master startd startd schedd startd Glidein (2) ~/Condor_glidein Tarball server Startup script Glidein config {master, schedd …} ? Collector … Glidein types Execute hosts
Schedd Glidein • Logics based on startd glidein (two Condor-G to set up ) • Usage: By running glidein schedd on gatekeeper, the schedd then serves as a gateway between submit host and grid sites • Mini Condor pool with schedd functionalities: • Submit host • Maintain persistent queue of jobs • Communicate with native batch system and forward user jobs • Condor, PBS, LSF, …etc • Manipulate job queues through the followoing commands: • condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio • Security Features(GSI) • Who is authorized to set up Pilot Factory?
Schedd Glidein Example (1) • Command: // schedd glidein #1 condor_glidein -count 1 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk01.racf.bnl.gov/jobmanager-fork-type schedd –forcesetup • Command: // schedd glidein #2 condor_glidein -count 1 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork gridgk02.racf.bnl.gov/jobmanager-fork-type schedd –forcesetup • Command: // schedd glidein # 3, #4, #5 condor_glidein -count 3 -arch 6.8.1-i686-pc-Linux-2.4 -setup_jobmanager=jobmanager-fork nostos.cs.wisc.edu/jobmanager-fork-type schedd –forcesetup Use fork since we want schedd to be on gatekeeper!
Schedd Glidein Example (2) Command: condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs agrd0926@gridgk01.ra gridgk01.r 0 0 0 agrd0926@gridgk02.ra gridgk02.r 0 0 0 pleiades@gridui01.us gridui01.u 0 0 0 pleiades@ribera.cs.w ribera.cs. 0 0 0 pleiades@ron.cs.wisc ron.cs.wis 0 0 0 pleiades@vail.cs.wis vail.cs.wi 0 0 0 TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 0 0 0
Pilot Submission Program (Generator) • Communicate with a DB server that maintains information about pilot jobs • E.g. pilot_type, pilot_queue • Pulls desired pilot script from an external server • Periodically submit pilot jobs (with pilot script as executable) • condor_submit • qsub? No, not necessary, since …
master schedd LSF PBS schedd Build Pilot Factory with Glidein Grid Resource • Schedd glidein installed and executed on the gatekeeper • User submit a Condor-C job with pilot generator as the executable • Generator runs on the gatekeeper as a local universe job supervised by the glidein schedd • Generator submits pilots • Types, frequency adjustable by users • Depending on the native batch system, pilots can be submitted as grid universe jobs • Along with GAHP and related binaries, schedd has the ability to communicate different batch systems JobManager ~ Pilot generator
master schedd ~ Pilot Factory Cluster Worker Nodes Pilot Factory Connected to Collector Glidein request Submit Pilots Submit Node (Collector, Master, Negotiator, Schedd) Gatekeeper with {Globus, Condor|PBS|…}
Future Work • Integrating pilot with Condor startd to implement startd-based pilot • the startd-based pilot retrieves the payload of a user job in the same way as does the generic pilot but in addition, it also inherits functionalities of Condor startd. • Original intention was to run PFs with the startd-pilots on worker nodes (too greedy, unacceptable?) • Using Condor started makes it easier to integrate with gLexec • Transform Generic PF (GPF) to Startd PF (SPF)
Reference [1] Schedd Glidein [2] Pilot Factory [3] glideinWMS: An advanced application on glideins