1 / 27

A Year of HTCondor at the RAL Tier-1

A Year of HTCondor at the RAL Tier-1. Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop. Outline. Overview of HTCondor at RAL Computing elements Multi -core jobs Monitoring. Introduction. RAL is a Tier-1 for all 4 LHC experiments

Download Presentation

A Year of HTCondor at the RAL Tier-1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop

  2. Outline • Overview of HTCondor at RAL • Computing elements • Multi-core jobs • Monitoring

  3. Introduction • RAL is a Tier-1 for all 4 LHC experiments • In terms of Tier-1 computing requirements, RAL provides • 2% ALICE • 13% ATLAS • 8% CMS • 32% LHCb • Also support ~12 non-LHC experiments, including non-HEP • Computing resources • 784 worker nodes, over 14K cores • Generally have 40-60K jobs submitted per day • Torque / Maui had been used for many years • Many issues • Severity & number of problems increased as size of farm increased • In 2012 decided it was time to start investigating moving to a new batch system

  4. Choosing a new batch system • Considered, tested & eventually rejected the following • LSF, Univa Grid Engine* • Requirement: avoid commercial products unless absolutely necessary • Open source Grid Engines • Competing products, not sure which has the next long term future • Communities appear less active than HTCondor & SLURM • Existing Tier-1s running Grid Engine using the commercial version • Torque 4 / Maui • Maui problematic • Torque 4 seems less scalable than alternatives (but better than Torque 2) • SLURM • Carried out extensive testing & comparison with HTCondor • Found that for our use case • Very fragile,easy to break • Unable to get reliably working above 6000 running jobs * Only tested open source Grid Engine, not Univa Grid Engine

  5. Choosing a new batch system • HTCondor chosen as replacement for Torque/Maui • Has the features we require • Seems very stable • Easily able to run 16,000 simultaneous jobs • Didn’t do any tuning – it “just worked” • Have since tested > 30,000 running jobs • Is more customizable than all other batch systems

  6. Migration to HTCondor • Strategy • Start with a small test pool • Gain experience & slowly move resources from Torque / Maui • Migration 2012 Aug Started evaluating alternatives to Torque / Maui (LSF, Grid Engine, Torque 4, HTCondor, SLURM) 2013 Jun Began testing HTCondor with ATLAS & CMS ~1000 cores from old WNs beyond MoU commitments 2013 Aug Choice of HTCondor approved by management 2013 Sep HTCondor declared production service Moved 50% of pledged CPU resources to HTCondor 2013 Nov Migrated remaining resources to HTCondor

  7. Experience so far • Experience • Very stable operation • Generally just ignore the batch system & everything works fine • Staff don’t need to spend all their time fire-fighting problems • No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores • Job start rate much higher than Torque / Maui even when throttled • Farm utilization much better • Very good support

  8. Problems • A few issues found, but fixed quickly by developers • Found job submission hung when one of a HA pair of central managers was down • Fixed & released in 8.0.2 • Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS • Fixed & released in 8.0.5 • Experienced jobs dying 2 hours after network break between CEs and WNs • Fixed & released in 8.1.4

  9. Computing elements • All job submission to RAL is via the Grid • No local users • Currently have 5 CEs • 2 CREAM CEs • 3 ARC CEs • CREAM doesn’t currently support HTCondor • We developed the missing functionality ourselves • Will feed this back so that it can be included in an official release • ARC better • But didn’t originally handle partitionable slots, passing CPU/memory requirements to HTCondor, … • We wrote lots of patches, all included in the recent 4.1.0 release • Will make it easier for more European sites to move to HTCondor

  10. Computing elements • ARC CE experience • Have run almost 9 million jobs so far across our 3 ARC CEs • Generally ignore them and they “just work” • VOs • ATLAS & CMS fine from the beginning • LHCb added ability to submit to ARC CEs to DIRAC • Seem to be ready to move entirely to ARC • ALICE not yet able to submit to ARC • They have said they will work on this • Non-LHC VOs • Some use DIRAC, which now can submit to ARC • Others use EMI WMS, which can submit to ARC • CREAM CE status • Plan to phase-out CREAM CEs this year

  11. HTCondor& ARC in the UK • Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC • RAL T2 HTCondor + ARC (in production) • Bristol HTCondor + ARC (in production) • Oxford HTCondor + ARC (small pool in production, migration in progress) • Durham SLURM + ARC • Glasgow Testing HTCondor + ARC • Liverpool Testing HTCondor • 7 more sites considering moving to HTCondor or SLURM • Configuration management: community effort • The Tier-2s using HTCondor and ARC have been sharing Puppet modules

  12. Multi-core jobs • Current situation • ATLAS have been running multi-core jobs at RAL since November • CMS started submitting multi-core jobs in early May • Interest so far only for multi-core jobs, not whole-node jobs • Only 8-core jobs • Our aims • Fully dynamic • No manual partitioning of resources • Number of running multi-core jobs determined by fairshares

  13. Getting multi-core jobs to work • Job submission • Haven’t setup dedicated multi-core queues • VO has to request how many cores they want in their JDL, e.g. (count=8) • Worker nodes configured to use partitionable slots • Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs • Setup multi-core groups & associated fairshares • HTCondor configured to assign multi-core jobs to the appropriate groups • Adjusted the order in which the negotiator considers groups • Consider multi-core groups before single core groups • 8 free cores are “expensive” to obtain, so try not to lose them to single core jobs too quickly

  14. Getting multi-core jobs to work • If lots of single-core jobs are idle & running, how does a multi-core job start? • By default it probably won’t • condor_defrag daemon • Finds WNs to drain, triggers draining & cancels draining as required • Configuration changes from default: • Drain 8-cores only, not whole WNs • Pick WNs to drain based on how many cores they have that can be freed up • E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than draining a full 8-core WN • Demand for multi-core jobs not known by condor_defrag • Setup simple cron to adjust number of concurrent draining WNs based on demand • If many idle multi-core jobs but few running, drain aggressively • Otherwise very little draining

  15. Results • Effect of changing the way WNs to drain are selected • No change in the number of concurrent draining machines • Rate in increase in number of running multi-core jobs much higher Running multi-core jobs

  16. Results • Recent ATLAS activity Running & idle multi-core jobs Gaps in submission by ATLAS results in loss of multi-core slots. Number of WNs running multi- core jobs & draining WNs • Significantly reduced CPU wastage • due to the cron • Aggressive draining: 3% waste • Less-aggressive draining: < 1% waste

  17. Worker node health check • Startdcron • Checks for problems on worker nodes • Disk full or read-only • CVMFS • Swap • … • Prevents jobs from starting in the event of problems • If problem with ATLAS CVMFS, then only prevents ATLAS jobs from starting • Information about problems made available in machine ClassAds • Can easily identify WNs with problems, e.g. # condor_status–constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUS lcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.ch lcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.ch lcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free

  18. Worker node health check • Also can put this data into ganglia • RAL tests new CVMFS releases • Therefore it’s important for us to detect increases in CVMFS problems • Generally have only small numbers of WNs with issues: • Example: a user’s “problematic” jobs affected CVMFS on many WNs:

  19. Jobs monitoring • CASTOR team at RAL have been testing Elasticsearch • Why not try using it with HTCondor? • Elasticsearch ELK stack • Logstash: parses log files • Elasticsearch: search & analyze data in real-time • Kibana: data visualization • Hardware setup • Test cluster of 13 servers (old diskservers & worker nodes) • But 3 servers could handle 16 GB of CASTOR logs per day • Adding HTCondor • Wrote config file for Logstash to enable history files to be parsed • Add Logstash to machines running schedds HTCondor history files Logstash Elastic search Kibana

  20. Jobs monitoring • Can see full job ClassAds

  21. Jobs monitoring • Custom plots • E.g. completed jobs by schedd

  22. Jobs monitoring • Custom dashboards

  23. Jobs monitoring • Benefits • Easy to setup • Took less than a day to setup the initial cluster • Seems to be able to handle the load from HTCondor • For us (so far): < 1 GB, < 100K documents per day • Arbitrary queries • Seem faster than using native HTCondorcommands(condor_history) • Horizontal construction • Need more capacity? Just add more nodes

  24. Summary • Due to scalability problems with Torque/Maui, migrated to HTCondor last year • We are happy with the choice we made based on our requirements • Confident that the functionality & scalability of HTCondor will meet our needs for the foreseeable future • Multi-core jobs working well • Looking forward to ATLAS and CMS running multi-core jobs at the same time

  25. Future plans • HTCondor • Phase in cgroups onto WNs • Integration with private cloud • When production cloud is ready, want to be able to expand the batch system into the cloud • Using condor_rooster for provisioning resources • HEPiX Fall 2013: http://indico.cern.ch/event/214784/session/9/contribution/205 • Monitoring • Move Elasticsearch into production • Try sending allHTCondor & ARC CE log files to Elasticsearch • E.g. could easily find information about a particular job from any log file

  26. Future plans • Ceph • Have setup a 1.8 PB test Ceph storage system • Accessible from some WNs using Ceph FS • Setting up an ARC CE with shared filesystem (Ceph) • ATLAS testing with arcControlTower • Pulls jobs from PanDA, pushes jobs to ARC CEs • Unlike the normal pilot concept, jobs can have more precise resource requirements specified • Input files pre-staged & cached by ARC on Ceph

  27. Thank you!

More Related