560 likes | 812 Views
HTCondor at the RAL Tier-1. Andrew Lahiff. Overview. Current status at RAL Multi-core jobs Interesting features we’re using Some interesting features & commands Future plans. Current status at RAL. Background. Computing resources 784 worker nodes, over 14K cores
E N D
HTCondor at the RAL Tier-1 Andrew Lahiff
Overview • Current status at RAL • Multi-core jobs • Interesting features we’re using • Some interesting features & commands • Future plans
Background • Computing resources • 784 worker nodes, over 14K cores • Generally have 40-60K jobs submitted per day • Torque / Maui had been used for many years • Many issues • Severity & number of problems increased as size of farm increased • Doesn’t like dynamic resources • A problem if we want to extend batch system into the cloud • In 2012 decided it was time to start investigating moving to a new batch system
Choosing a new batch system • Considered, tested & eventually rejected the following • LSF, Univa Grid Engine* • Requirement: avoid commercial products unless absolutely necessary • Open source Grid Engines • Competing products, not sure which has the next long term future • Communities appear less active than HTCondor & SLURM • Existing Tier-1s running Grid Engine using the commercial version • Torque 4 / Maui • Maui problematic • Torque 4 seems less scalable than alternatives (but better than Torque 2) • SLURM • Carried out extensive testing & comparison with HTCondor • Found that for our use case Very fragile, easy to break Unable to get reliably working above 6000 running jobs * Only tested open source Grid Engine, not Univa Grid Engine
Choosing a new batch system • HTCondor chosen as replacement for Torque/Maui • Has the features we require • Seems very stable • Easily able to run 16,000 simultaneous jobs • Didn’t do any tuning – it “just worked” • Have since tested > 30,000 running jobs • Is more customizable than all other batch systems
The story so far • History of HTCondor at RAL • Jun 2013: started testing with real ATLAS & CMS jobs • Sep 2013: 50% pledged resources moved to HTCondor • Nov 2013: fully migrated to HTCondor • Experience • Very stable operation • No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores • Job start rate much higher than Torque / Maui even when throttled • Very good support
Architecture CEs condor_schedd Central manager condor_negotiator condor_collector Worker nodes condor_startd
Our setup • 2 central managers • condor_master • condor_collector • condor_HAD(responsible for high-availability) • condor_replication(responsible for high-availability) • condor_negotiator(only running on at most 1 machine at a time) • 8 submission hosts (4 ARC CE, 2 CREAM CE, 2 UI) • condor_master • condor_schedd • Lots of worker nodes • condor_master • condor_startd • Monitoring box (runs 8.1.x which contains ganglia integration) • condor_master • condor_gangliad
Computing elements • ARC experience so far • Have run over 9.4 million jobs so far across our ARC CEs • Generally ignore them & they “just work” • VO status • ALTAS & CMS • Fine from the beginning • LHCb • Added ability to DIRAC to submit to ARC • ALICE • Not yet able to submit to ARC, have said they will work on this • Non-LHC VOs • Some use DIRAC, which can now submit to ARC • Some use EMI WMS, which can submit to ARC
Computing elements • ARC 4.1.0 released recently • Will be in UMD very soon • Has just passed through staged-rollout • Contains all of our fixes to HTCondor backend scripts • Plans • When VOs start using RFC proxies we could enable the web service interface • Doesn’t affect ATLAS/CMS • VOs using NorduGrid client commands (e.g. LHCb) can get job status information more quickly
Computing elements • Alternative: HTCondor-CE • Special configuration of HTCondor, not a brand new service • Some sites starting to use this in the US • Note: contains no BDII (!) site ATLAS & CMS pilot factories Central manager(s) HTCondor-CE collector(s) negotiator schedd schedd HTCondor-G Worker nodes startds job router
Getting multi-core jobs to work • Job submission • Haven’t set up dedicated queues • VO has to request how many cores they want in their JDL • Fine for ATLAS & CMS, not sure yet for LHCb/DIRAC… • Could set up additional queues if necessary • Did 5 things on the HTCondor side to support multi-core jobs…
Getting multi-core jobs to work • Worker nodes configured to use partitionable slots • WN resources divided up as necessary amongst jobs • We had this configuration anyway • Setup multi-core accounting groups & associated quotas • Configured so that multi-core jobs automatically get assigned to the appropriate groups • Specified group quotas (fairshares) for the multi-core groups • Adjusted the order in which the negotiator considers groups • Consider multi-core groups before single core groups • 8 free slots are “expensive” to obtain, so try not to lose them too quickly
Getting multi-core jobs to work • Setup condor_defrag daemon • Finds WNs to drain, triggers draining & cancels draining as required • Pick WNs to drain based on how many cores they have that can be freed up • E.g. getting 8 free cores by draining a full 32 core WN is generally faster than draining a full 8 core WN
Getting multi-core jobs to work • Improvement to condor_defrag daemon • Demand for multi-core jobs not known by condor_defrag • Setup simple cron which adjusts number of concurrent draining WNs based on demand • If many idle multi-core jobs but few running, drain aggressively • Otherwise very little draining
Results Running & idle multi-core jobs Gaps in submission by ATLAS results in loss of multi-core slots. Number of WNs running multi- core jobs & draining WNs • Significantly reduced CPU wastage • due to the cron • Aggressive draining: 3% waste • Less-aggressive draining: < 1% waste
Multi-core jobs • Current status • Haven’t made any changes over the past few months • Now only CMS running multi-core jobs • Waiting for ATLAS to start up again • Will be interesting to see what happens when multiple VOs run multi-core jobs • May look at making improvements if necessary • Details about our configuration here: https://www.gridpp.ac.uk/wiki/RAL_HTCondor_Multicore_Jobs_Configuration
Startdcron • Worker node health check script • Script run on WN at regular intervals by HTCondor • Can place custom information into WN ClassAds. In our case: • NODE_IS_HEALTHY (should WN start jobs or not) • NODE_STATUS (list of any problems) STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) WN_HEALTHCHECK STARTD_CRON_WN_HEALTHCHECK_EXECUTABLE = /usr/local/bin/healthcheck_wn_condor STARTD_CRON_WN_HEALTHCHECK_PERIOD = 10m STARTD_CRON_WN_HEALTHCHECK_MODE = periodic STARTD_CRON_WN_HEALTHCHECK_RECONFIG = false STARTD_CRON_WN_HEALTHCHECK_KILL = true ## When is this node willing to run jobs? START = (NODE_IS_HEALTHY =?= True)
Startdcron • Current checks • CVMFS • Filesystem problems (e.g. read-only) • Swap usage • Plans • May add more checks, e.g. CPU load • More thorough tests only run when HTCondor first starts up? • E.g. checks for grid middleware, checks for other essential software & configuration, … • Want to be sure that jobs will never be started unless the WN iscorrectly set up • Will be more important for dynamic virtual WNs
Startdcron • Easily list any WNs with problems, e.g. # condor_status -constraint 'partitionableslot == True && NODE_STATUS != "All_OK”’-autoformat Machine NODE_STATUS lcg1209.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1248.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1309.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1340.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch gmetric script using HTCondor Python API for making Ganglia plots
PID namespaces • We have USE_PID_NAMESPACES=True on WNs • Jobs can’t see any system processes or processes associated with other jobs on the WN • Example stdout of job running “ps –e”: PID TTY TIME CMD 1 ? 00:00:00 condor_exec.exe 3 ? 00:00:00 ps
MOUNT_UNDER_SCRATCH • Each job sees a different /tmp, /var/tmp • Uses bind mounts to directories inside job scratch area • No more junk left behind in /tmp • Jobs can’t fill /tmp & cause problems • Jobs can’t see what other jobs have written into /tmp • For glexec jobs to work, need a special lcmaps plugin enabled • lcmaps-plugins-mount-under-scratch • Minor tweak to lcmaps.db • We have tested this but it’s not yet rolled out
CPU affinity • Can set on WN ASSIGN_CPU_AFFINITY=True • Jobs locked to specific cores • Problem: • When PID namespaces also used, CPU affinity doesn’t work
Control groups • Cgroups: mechanism for managing a set of processes • We’re starting with the most basic option: CGROUP_MEMORY_LIMIT_POLICY=none • No memory limits applied • Cgroups used for • Process tracking • Memory accounting • CPU usage assigned proportionally to the number of CPUs in the slot • Currently configured on 2 WNs for initial testing with production jobs
Upgrades • Central managers, CEs: • Update the RPMs • condor_master will notice the binaries have changed & restart daemons as required • Worker nodes • Make sure WN config contains MASTER_NEW_BINARY_RESTART=PEACEFUL • Update the RPMs • condor_master will notice the binaries have changed, drain running jobs, then restart daemons as required
Dealing with held jobs • Held jobs have failed in some way & remain in the queue waiting for user intervention • E.g. input file(s) missing from CE • Can configure HTCondor to deal with them automatically • Try re-running held jobs once only after waiting 30 minutes SYSTEM_PERIODIC_RELEASE = ((CurrentTime - EnteredCurrentStatus > 30 * 60) && (JobRunCount < 2)) • Remove held jobs after 24 hours SYSTEM_PERIODIC_REMOVE = ((CurrentTime- EnteredCurrentStatus> 24 * 60 * 60) && JobStatus == 5))
condor_who • See what jobs are running on a worker node [root@lcg0975 ~]# condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM patls053@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_5 2730014.0 0+03:30:12 30184 /pool/condor/dir_30180/condor_exec.exe patls053@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_7 3189534.0 0+03:38:19 21266 /pool/condor/dir_21262/condor_exec.exe tatls011@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_6 2729613.0 0+04:40:21 6942 /pool/condor/dir_6938/condor_exec.exe tlhcb005@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_4 2977866.0 0+08:42:13 26669 /pool/condor/dir_26665/condor_exec.exe tatls001@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_1 3186150.0 0+10:57:05 12401 /pool/condor/dir_12342/condor_exec.exe tlhcb005@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_3 3174829.0 0+16:03:57 26418 /pool/condor/dir_26331/condor_exec.exe pcms054@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_2 3149655.0 1+23:55:51 31281 /pool/condor/dir_31268/condor_exec.exe
condor_q –analyze 1 • Why isn’t a job running? -bash-4.1$ condor_q -analyze 16244 -- Submitter: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:45033> : lcgui03.gridpp.rl.ac.uk User priority for alahiff@gridpp.rl.ac.uk is not available, attempting to analyzewithout it. --- 16244.000: Run analysis summary. Of 13180 machines, 13180 arerejected by yourjob'srequirements 0 rejectyour job because of theirownrequirements 0 match and arealreadyrunningyour jobs 0 match but areservingotherusers 0 areavailable to run your job WARNING: Be advised: No resourcesmatchedrequest'sconstraints
condor_q–analyze 2 The Requirements expression for your job is: ( ( Ceph is true ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer ) Suggestions: Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( TARGET.Cpus >= 16 ) 0 MODIFY TO 8 2 ( ( target.Ceph is true ) ) 139 3 ( TARGET.Memory >= 1 ) 13150 4 ( TARGET.Arch == "X86_64" ) 13180 5 ( TARGET.OpSys == "LINUX" ) 13180 6 ( TARGET.Disk >= 1 ) 13180 7 ( TARGET.HasFileTransfer ) 13180
condor_ssh_to_job • ssh into a job from a CE [root@arc-ce01 ~]# condor_ssh_to_job 3147487.0 Welcometo slot1@lcg1554.gridpp.rl.ac.uk! Your condor job is running withpid(s) 2402. [pcms054@lcg1554 dir_2393]$ hostname lcg1554.gridpp.rl.ac.uk [pcms054@lcg1554 dir_2393]$ ls condor_exec.exe _condor_stderr _condor_stderr.schedd_glideins4_vocms32.cern.ch_1497953.4_1400484359 _condor_stdout _condor_stdout.schedd_glideins4_vocms32.cern.ch_1497953.4_1400484359 glide_adXvcQ glidein_startup.sh job.MCJMDmk8W6jnCIXDjqiBL5XqABFKDmABFKDmb5JKDmEBFKDmQ8FM6m.proxy
condor_fetchlog • Retrieve log files from daemons on other machines [root@condor01 ~]# condor_fetchlog arc-ce04.gridpp.rl.ac.uk SCHEDD 05/20/14 12:39:09 (pid:2388) ****************************************************** 05/20/14 12:39:09 (pid:2388) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 05/20/14 12:39:09 (pid:2388) ** /usr/sbin/condor_schedd 05/20/14 12:39:09 (pid:2388) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 05/20/14 12:39:09 (pid:2388) ** Configuration: subsystem:SCHEDDlocal:<NONE> class:DAEMON 05/20/14 12:39:09 (pid:2388) ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $ 05/20/14 12:39:09 (pid:2388) ** $CondorPlatform: x86_RedHat6 $ 05/20/14 12:39:09 (pid:2388) ** PID = 2388 05/20/14 12:39:09 (pid:2388) ** Log last touched time unavailable (No such fileordirectory) 05/20/14 12:39:09 (pid:2388) ****************************************************** ...
condor_gather_info • Gathers information about a job • Including log files from schedd, startd [root@arc-ce01 ~]# condor_gather_info --jobid 3288142.0 cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ShadowLog.old cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ShadowLog cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/StartLog cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/MasterLog cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/MasterLog.lcg1043.gridpp.rl.ac.uk cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/condor-profile.txt cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/ cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_q cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_userlog_lines cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_ad_analysis cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_ad cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/SchedLog.old cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/StarterLog.slot1
Job ClassAds • Lots of useful information in job ClassAds • Including email address from proxy • Easy to contact users of problematic jobs # condor_q3279852.0 -autoformat x509UserProxyEmail andrew.lahiff@stfc.ac.uk
condor_chirp • Jobs can put custom information into job ClassAds • Example: lcmaps-plugin-condor-update • Puts information into job ClassAd about glexec payload user & DN • Can then use condor_q to see this information
Job router • Job router daemon transforms jobs from one type to another according to configurable policies • E.g. submit jobs to a different batch system or a CE • Example: sending excess jobs to the GridPP Cloud using glideinWMS JOB_ROUTER_DEFAULTS = \ [ \ MaxIdleJobs = 10; \ MaxJobs = 50; \ ] JOB_ROUTER_ENTRIES = \ [ \ Requirements=true; \ GridResource = "condor lcggwms02.gridpp.rl.ac.uk lcggwms02.gridpp.rl.ac.uk:9618"; \ name = "GridPP_Cloud"; \ ]
Job router • Example: initially have 5 idle jobs -bash-4.1$ condor_q -- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended
Job router • Routed copies of the jobs soon appear • Original job mirrors the status of the routed copy -bash-4.1$ condor_q -- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2250.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2251.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2252.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2253.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2254.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended
Job router • Can check that the new jobs have been sent to a remote resource [root@lcgvm21 ~]# condor_q -grid -- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID 2250.0 alahiff IDLE condor->lcggwms02.gridpp.rl 20.0 2251.0 alahiff IDLE condor->lcggwms02.gridpp.rl 21.0 2252.0 alahiff IDLE condor->lcggwms02.gridpp.rl 17.0 2253.0 alahiff IDLE condor->lcggwms02.gridpp.rl 18.0 2254.0 alahiff IDLE condor->lcggwms02.gridpp.rl 19.0 Job ids on remote resource glideinWMSHTCondor pool
Future plans • Setting up ARC CE & some WNs using Ceph as a shared storage system (CephFS) • ATLAS testing with arcControlTower • Pulls jobs from PanDA, pushes jobs to ARC CEs • Input files pre-staged & cached on Ceph by ARC • Currently in progress…
Future plans • Test power management features • HTCondor can power down idle WNs and wake them as required • Batch system expanding into the cloud • Make use of idle private cloud resources • We have tested that condor_rooster can be used to dynamically provision VMs as they are needed
Overview • Batch system monitoring • Mimic • Jobview • Ganglia • Elasticsearch
Mimic • Overview of state of worker nodes
Jobview http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html