170 likes | 207 Views
Learn how the Panda/Bamboo Workflow system automates job submission, dynamic task assignment, and job status updates using cron triggers and HTTP requests to efficiently manage resources and distribute tasks to worker nodes.
E N D
PANDA/BambooWorkflow Tadashi Maeno (BNL)
System Overview DQ2 Panda server LRC/LFC ProdDB job bamboo logger send log http pull https site B job pilot job https submit site A submit condor-g pilot Autopilot End-user Worker Nodes
Bamboo prodDB Bamboo Panda server • Get jobs from prodDB to submit them to Panda • Update job status in prodDB • Assign tasks to clouds dynamically • Kill TOBEABORTED jobs A cron triggers the above procedures every 10 min cx_Oracle https Apache + gridsite cron https
Bamboo (Job Submission) cron Bamboo prodDB panda PandaDB trigger https request get # of jobs in each cloud http request If running*2 > waiting get jobs submit jobs http request insert jobs http response update jobs http response
Bamboo (Update Job Status) cron Bamboo prodDB panda PandaDB trigger http request get jobs which changed status recently http request get jobs http response update jobs send ACK http request mark jobs http response http response
Bamboo (Dynamic Task Assignment) cron Bamboo prodDB panda LRC/LFC trigger http request get jobs with tier=NULL check whether cloud has already been set If yes set tier http response If no Scan LRC/LFC to count # of input files available in each T1 run task assignment http response
Dynamic Task Assignment • Bamboo sends requests to Panda • Panda assigns task to clouds asynchronously • Get the number of input files available in each T1 by scanning LRC/LFC • Get the number of active worker nodes in each cloud from PandaDB • Get available disk space in each T1 from schedconfig DB • Clouds are propagated from Panda to Bamboo • Bamboo sets Tier (=cloud) in prodDB
Panda server clients DQ2 Panda server https • Central queue for all kinds of jobs • Assign jobs to sites (brokerage) • Setup input/output datasets • Create them when jobs are submitted • Add files to output datasets when jobs are finished • Dispatch jobs LRC/LFC PandaDB logger Apache + gridsite pilot
Transition of Job Status in Panda Panda submitter Job status submit DQ2 defined Brokerage assigns job to site make subscription for disp DS Creates dispatch dataset and makes subscription assigned callback pilot activated getJob running finished holding Adds output files to destination datasets add files to dest DS transferring callback finished/failed
Brokerage • Assign jobs to sites • For T1 : defined activated (no input data transfer) • For T2 : defined assigned • If input files don’t exist at T1 : defined waiting • How it works • Calculate weights using • The number of WNs running jobs for last 3 hours • The number of WNs which requested jobs for last 3 hours • The number of jobs per WN • The number of assigned/activated jobs • The number of input files already available at the site • Available ATLAS releases • Jobs tend to go to the site which • Has large weight = (# of active WNs) (# of jobs per WN) (# of waiting jobs) • Caches many files • Installs the ATLAS release which the jobs use • Brokerage will be skipped if site is pre-defined • E.g., reprocessing jobs -> T1 sites
Dispatch and Destination Datasets (1/2) • Temporary datasets created when jobs are submitted • Typically one dispatch/destination dataset per 20 jobs or 20 files • Dispatch datasets • _disXYZ • Dispatch input files to T2 • Get frozen when they are created • DQ2 or PandaMover transfers files and then sends callbacks to activate jobs • assigned activated
Dispatch and Destination Datasets (2/2) • Destination datasets • _subXYZ • transfer output files to T1 • Empty at beginning. Files are added when jobs are finished • For T1: holding finished/failed • For T2: holding transferring • Subscription is made when the first file is added • Get frozen when all jobs contributing to the dataset are finished/failed • DQ2 transfers files and sends callbacks • transferring finished/failed • Panda scans LFC/LRC for transferring jobs every day and change job status if all output files are available at T1 callbacks are not mandatory
Pilot and Autopilot (1/3) • Autopilot is a scheduler to submit pilots to sites via condor-g/glidein Pilot Gatekeeper Job Panda server • Pilots are scheduled to the site batch system and pull jobs as soon as CPUs become available Panda server Job Pilot • Pilot submission and Job submission are different Job = payload for pilot
Pilot and Autopilot (2/3) • How pilot works • Sends the several parameters to Panda server for job matching (HTTP request) • CPU speed • Available memory size on the WN • List of available ATLAS releases at the site • Retrieves an `activated’ job (HTTP response of the above request) • activated running • Runs the job immediately because all input files should be already available at the site • Sends heartbeat every 30min • Each heartbeat is a single HTTPS session • There isn’t a permanent connection between pilot and Panda server • If pilot dies silently, panda will set the job status to ‘holding’ 6 hours later
Pilot and Autopilot (3/3) • Sends jobStatus=‘finished’/’failed’ at the end of the job • Copy output files to SE • Register files to LRC/LFC • running holding • Pilot itself doesn’t access DQ2 • Panda server adds output files to DQ2 datasets • holding transferring/finished/failed • Sends jobStatus = ‘holding’ if the pilot cannot copy output files to SE or cannot register files to LRC/LFC • running holding • Then the pilot terminates immediately • Another pilot will try to recover the job if the site supports job-recovery
Kill Job • Kill command is propagated to pilot as http response of heartbeat • May take ~30 min at most to kill the job Panda pilot heartbeat client ACK kill job heartbeat ACK, tobekilled kill itself failed
Timeouts • assigned • reassigned 12 hours later • killed 7 days after the submission • waiting • retry assignment 1 day later • killed 3 days after the submission • activated • reassigned 2 days later • running • changed to holding if no heartbeat for 6 hours • holding • killed 3 days later • transferring • killed 2 weeks later