270 likes | 482 Views
A Year of HTCondor at the RAL Tier-1. Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop. Outline. Overview of HTCondor at RAL Computing elements Multi -core jobs Monitoring. Introduction. RAL is a Tier-1 for all 4 LHC experiments
E N D
A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop
Outline • Overview of HTCondor at RAL • Computing elements • Multi-core jobs • Monitoring
Introduction • RAL is a Tier-1 for all 4 LHC experiments • In terms of Tier-1 computing requirements, RAL provides • 2% ALICE • 13% ATLAS • 8% CMS • 32% LHCb • Also support ~12 non-LHC experiments, including non-HEP • Computing resources • 784 worker nodes, over 14K cores • Generally have 40-60K jobs submitted per day • Torque / Maui had been used for many years • Many issues • Severity & number of problems increased as size of farm increased • In 2012 decided it was time to start investigating moving to a new batch system
Choosing a new batch system • Considered, tested & eventually rejected the following • LSF, Univa Grid Engine* • Requirement: avoid commercial products unless absolutely necessary • Open source Grid Engines • Competing products, not sure which has the next long term future • Communities appear less active than HTCondor & SLURM • Existing Tier-1s running Grid Engine using the commercial version • Torque 4 / Maui • Maui problematic • Torque 4 seems less scalable than alternatives (but better than Torque 2) • SLURM • Carried out extensive testing & comparison with HTCondor • Found that for our use case • Very fragile,easy to break • Unable to get reliably working above 6000 running jobs * Only tested open source Grid Engine, not Univa Grid Engine
Choosing a new batch system • HTCondor chosen as replacement for Torque/Maui • Has the features we require • Seems very stable • Easily able to run 16,000 simultaneous jobs • Didn’t do any tuning – it “just worked” • Have since tested > 30,000 running jobs • Is more customizable than all other batch systems
Migration to HTCondor • Strategy • Start with a small test pool • Gain experience & slowly move resources from Torque / Maui • Migration 2012 Aug Started evaluating alternatives to Torque / Maui (LSF, Grid Engine, Torque 4, HTCondor, SLURM) 2013 Jun Began testing HTCondor with ATLAS & CMS ~1000 cores from old WNs beyond MoU commitments 2013 Aug Choice of HTCondor approved by management 2013 Sep HTCondor declared production service Moved 50% of pledged CPU resources to HTCondor 2013 Nov Migrated remaining resources to HTCondor
Experience so far • Experience • Very stable operation • Generally just ignore the batch system & everything works fine • Staff don’t need to spend all their time fire-fighting problems • No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores • Job start rate much higher than Torque / Maui even when throttled • Farm utilization much better • Very good support
Problems • A few issues found, but fixed quickly by developers • Found job submission hung when one of a HA pair of central managers was down • Fixed & released in 8.0.2 • Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS • Fixed & released in 8.0.5 • Experienced jobs dying 2 hours after network break between CEs and WNs • Fixed & released in 8.1.4
Computing elements • All job submission to RAL is via the Grid • No local users • Currently have 5 CEs • 2 CREAM CEs • 3 ARC CEs • CREAM doesn’t currently support HTCondor • We developed the missing functionality ourselves • Will feed this back so that it can be included in an official release • ARC better • But didn’t originally handle partitionable slots, passing CPU/memory requirements to HTCondor, … • We wrote lots of patches, all included in the recent 4.1.0 release • Will make it easier for more European sites to move to HTCondor
Computing elements • ARC CE experience • Have run almost 9 million jobs so far across our 3 ARC CEs • Generally ignore them and they “just work” • VOs • ATLAS & CMS fine from the beginning • LHCb added ability to submit to ARC CEs to DIRAC • Seem to be ready to move entirely to ARC • ALICE not yet able to submit to ARC • They have said they will work on this • Non-LHC VOs • Some use DIRAC, which now can submit to ARC • Others use EMI WMS, which can submit to ARC • CREAM CE status • Plan to phase-out CREAM CEs this year
HTCondor& ARC in the UK • Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC • RAL T2 HTCondor + ARC (in production) • Bristol HTCondor + ARC (in production) • Oxford HTCondor + ARC (small pool in production, migration in progress) • Durham SLURM + ARC • Glasgow Testing HTCondor + ARC • Liverpool Testing HTCondor • 7 more sites considering moving to HTCondor or SLURM • Configuration management: community effort • The Tier-2s using HTCondor and ARC have been sharing Puppet modules
Multi-core jobs • Current situation • ATLAS have been running multi-core jobs at RAL since November • CMS started submitting multi-core jobs in early May • Interest so far only for multi-core jobs, not whole-node jobs • Only 8-core jobs • Our aims • Fully dynamic • No manual partitioning of resources • Number of running multi-core jobs determined by fairshares
Getting multi-core jobs to work • Job submission • Haven’t setup dedicated multi-core queues • VO has to request how many cores they want in their JDL, e.g. (count=8) • Worker nodes configured to use partitionable slots • Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs • Setup multi-core groups & associated fairshares • HTCondor configured to assign multi-core jobs to the appropriate groups • Adjusted the order in which the negotiator considers groups • Consider multi-core groups before single core groups • 8 free cores are “expensive” to obtain, so try not to lose them to single core jobs too quickly
Getting multi-core jobs to work • If lots of single-core jobs are idle & running, how does a multi-core job start? • By default it probably won’t • condor_defrag daemon • Finds WNs to drain, triggers draining & cancels draining as required • Configuration changes from default: • Drain 8-cores only, not whole WNs • Pick WNs to drain based on how many cores they have that can be freed up • E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than draining a full 8-core WN • Demand for multi-core jobs not known by condor_defrag • Setup simple cron to adjust number of concurrent draining WNs based on demand • If many idle multi-core jobs but few running, drain aggressively • Otherwise very little draining
Results • Effect of changing the way WNs to drain are selected • No change in the number of concurrent draining machines • Rate in increase in number of running multi-core jobs much higher Running multi-core jobs
Results • Recent ATLAS activity Running & idle multi-core jobs Gaps in submission by ATLAS results in loss of multi-core slots. Number of WNs running multi- core jobs & draining WNs • Significantly reduced CPU wastage • due to the cron • Aggressive draining: 3% waste • Less-aggressive draining: < 1% waste
Worker node health check • Startdcron • Checks for problems on worker nodes • Disk full or read-only • CVMFS • Swap • … • Prevents jobs from starting in the event of problems • If problem with ATLAS CVMFS, then only prevents ATLAS jobs from starting • Information about problems made available in machine ClassAds • Can easily identify WNs with problems, e.g. # condor_status–constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUS lcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.ch lcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.ch lcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free
Worker node health check • Also can put this data into ganglia • RAL tests new CVMFS releases • Therefore it’s important for us to detect increases in CVMFS problems • Generally have only small numbers of WNs with issues: • Example: a user’s “problematic” jobs affected CVMFS on many WNs:
Jobs monitoring • CASTOR team at RAL have been testing Elasticsearch • Why not try using it with HTCondor? • Elasticsearch ELK stack • Logstash: parses log files • Elasticsearch: search & analyze data in real-time • Kibana: data visualization • Hardware setup • Test cluster of 13 servers (old diskservers & worker nodes) • But 3 servers could handle 16 GB of CASTOR logs per day • Adding HTCondor • Wrote config file for Logstash to enable history files to be parsed • Add Logstash to machines running schedds HTCondor history files Logstash Elastic search Kibana
Jobs monitoring • Can see full job ClassAds
Jobs monitoring • Custom plots • E.g. completed jobs by schedd
Jobs monitoring • Custom dashboards
Jobs monitoring • Benefits • Easy to setup • Took less than a day to setup the initial cluster • Seems to be able to handle the load from HTCondor • For us (so far): < 1 GB, < 100K documents per day • Arbitrary queries • Seem faster than using native HTCondorcommands(condor_history) • Horizontal construction • Need more capacity? Just add more nodes
Summary • Due to scalability problems with Torque/Maui, migrated to HTCondor last year • We are happy with the choice we made based on our requirements • Confident that the functionality & scalability of HTCondor will meet our needs for the foreseeable future • Multi-core jobs working well • Looking forward to ATLAS and CMS running multi-core jobs at the same time
Future plans • HTCondor • Phase in cgroups onto WNs • Integration with private cloud • When production cloud is ready, want to be able to expand the batch system into the cloud • Using condor_rooster for provisioning resources • HEPiX Fall 2013: http://indico.cern.ch/event/214784/session/9/contribution/205 • Monitoring • Move Elasticsearch into production • Try sending allHTCondor & ARC CE log files to Elasticsearch • E.g. could easily find information about a particular job from any log file
Future plans • Ceph • Have setup a 1.8 PB test Ceph storage system • Accessible from some WNs using Ceph FS • Setting up an ARC CE with shared filesystem (Ceph) • ATLAS testing with arcControlTower • Pulls jobs from PanDA, pushes jobs to ARC CEs • Unlike the normal pilot concept, jobs can have more precise resource requirements specified • Input files pre-staged & cached by ARC on Ceph