290 likes | 301 Views
This article discusses the migration process from CREAM CE to ARC CE at RAL, including the reasons for choosing ARC CE and the steps involved in the migration. It also highlights important information and modifications made to the information system and job management for efficient operation of ARC CE.
E N D
Moving from CREAM CE to ARC CE Andrew Lahiff andrew.lahiff@stfc.ac.uk
The short version • Install ARC CE • Test ARC CE • Move ARC CE(s) into production • Drain CREAM CE(s) • Switch off CREAM CE(s)
Migration at RAL • In 2013 we combined • migration from Torque to HTCondor • migration from CREAM CE to ARC CE • Initial reasons for choice of ARC CE • we didn’t like CREAM • HTCondor-CE was still very new, even in OSG • had heard good things about ARC • Glasgow & Imperial College in the UK had already tried it • looked much simpler than CREAM • YAIM not required • ATLAS use it a lot
Migration at RAL Initially had CREAM CEs + Torque CREAM CEs (Torque) Torque server / Maui worker nodes (Torque) APEL glite-CLUSTER
Migration at RAL Added HTCondor pool + ARC & CREAM CEs CREAM CEs (Torque) Torque server worker nodes (Torque) APEL glite-CLUSTER ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd CREAM CEs condor_schedd
Migration at RAL Torque batch system decommissioned APEL glite-CLUSTER ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd CREAM CEs condor_schedd
Migration at RAL CREAM CEs & APEL publisher decommissioned - once all LHC VOs & non-LHC VOs could submit to ARC glite-CLUSTER ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd
Migration at RAL glite-CLUSTER decommissioned ARC CEs condor_schedd HA HTCondor central managers worker nodes condor_startd
ARC CEs at RAL • 4 ARC CEs – each is a VM with • 4 CPUs • 32 GB memory • most memory usage comes from the condor shadows we use 32-bit HTCondor rpms will move to static shadows soon • see slapd using up to ~1 GB • we wanted to have lots of headroom! (were new to both ARC and HTCondor) • Using multiple ARC CEs for redundancy & scalabilty
ARC CEs at RAL • Example from today – 5.5K running jobs on a single CE
Usage since Oct 2013 • Generally have 2-3K running jobs per CE Running jobs per ARC CE monitoring glitch
glite-WMS support • Some non-LHC VOs still use glite-WMS • getting less & less important • In order for the WMS job wrapper to work with ARC CEs, need an empty file /usr/etc/globus-user-env.sh on all worker nodes
Software tags • Software tags (almost) no longer needed due to CVMFS • some non-LHC VOs may need them however • again, probably getting less & less important • ARC • runtime environments appear in the BDII in the same way as software tags • unless you have a shared filesystem (worker nodes, CEs), no way for VOs to update tags themselves • our configuration management system manages the runtime environments • mostly just empty files
Information system • Max CPU & wall time not published correctly • only a problem for the HTCondor backend • no way for ARC to determine this from HTCondor • could try to extract from SYSTEM_PERIODIC_REMOVE? • what if someone does this on the worker nodes, e.g. WANT_HOLD? • We modified /usr/share/arc/glue-generator.pl
Information system - VO views • ARC reports the same number of running & idle jobs for all VOs • We modified /usr/share/arc/glue-generator.pl • cron running every 10 mins queries HTCondor & creates files listing numbers of jobs by VO • glue-generator.pl modified to read these files • Some VOs still need this information (incl LHC VOs) • hopefully the need for this will slowly go away
Information system – VO shares • VO shares not published • Added some lines into /usr/share/arc/glue-generator.pl GlueCECapability: Share=cms:20 GlueCECapability: Share=lhcb:27 GlueCECapability: Share=atlas:49 GlueCECapability: Share=alice:2 GlueCECapability: Share=other:2 • Not sure why this information is needed anyway
LHCb • DIRAC can’t specify runtime environments • we use an auth plugin to specify a default runtume environment • we put all essential things in here (grid-related env variables etc) • Default runtime environment needs to set • NORDUGRID_ARC_QUEUE=<queue name> https://github.com/alahiff/ral-arc-ce-rte/blob/master/GLITE
Multi-core jobs • In order for stdout/err to be available to VO, need to set RUNTIME_ENABLE_MULTICORE_SCRATCH=1 in a runtime environment • In ours we have: if [ "x$1" = "x0" ]; then export RUNTIME_ENABLE_MULTICORE_SCRATCH=1 fi (amongst other things) https://github.com/alahiff/ral-arc-ce-rte/blob/master/GLITE
Auth plugins • Can configure an external executable to run every time a job is about to switch to a different state • ACCEPTED, PREPARING, SUBMIT, FINISHING, FINISHED, DELETED • Very useful! Our standard uses • Setting default runtime environment for all jobs • Scaling CPU & wall time for completed jobs • Occasionally for debugging • keep all stdout/err files for completed jobs for a particular VO https://github.com/alahiff/ral-arc-ce-plugins
User mapping • Argus for mapping to local pool accounts (via lcmaps) • In /etc/arc.conf [gridftpd] ... unixmap="* lcmaps liblcmaps.so /usr/lib64 /usr/etc/lcmaps/lcmaps.db arc" unixmap="banned:banned all” ... • Setup Argus policies to allow all supported VOs to submit jobs
Monitoring - alerts • ARC Nagios tests • Check proc a-rex • Check proc gridftp • Check proc nordugrid-arc-bdii • Check proc nordugrid-arc-slapd • Check ARC APEL consistency • check that SSM message sent successfully to APEL < 24 hours ago • Check HTCondor-ARC consistency • check that HTCondor & ARC agree on number of running + idle jobs
Monitoring - alerts • HTCondor Nagios tests • Check HTCondor CE Schedd • check that the schedd ClassAd is available • we found that a check for condor_master is not enough, e.g. if you have a corrupt HTCondor config file • Check job submission HTCondor • check that Nagios can successfully submit job to HTCondor
Monitoring - Ganglia • Ganglia metrics • standard host metrics • Gangliarc: http://wiki.nordugrid.org/wiki/Gangliarc • ARC specific metrics • condor_gangliad • HTCondor specific metrics
Monitoring - Ganglia +more...
Monitoring - InfluxDB • 1-min time resolution • ARC CE metrics • job states, time since last arex heartbeat • HTCondor metrics include: • shadow exit codes • numbers of jobs run more than once
Problems we’ve had • APEL central message broker hardwired in config • when hostname of the message broker changed once, APEL publishing stopped • now have Nagios check for APEL publishing • ARC-HTCondor running+idle jobs consistency • before scan-condor-job was optimized, had ~2 incidents in the past couple of years where ARC lost track of jobs • best to use ARC version > 5.0.0