1 / 34

Monitoring HTCondor

This document provides insights into HTCondor monitoring, system overview, job status, and health of components at STFC. Learn about monitoring methods, command line utilities, and detailed views.

lhouston
Download Presentation

Monitoring HTCondor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

  2. Introduction • Two aspects of monitoring • General overview of the system • How many running/idle jobs? By user/VO? By schedd? • How full is the farm? • How many draining worker nodes? • More detailed views • What are individual jobs doing? • What’s happening on individual worker nodes? • Health of the different components of the HTCondor pool • ...in addition to Nagios

  3. Introduction • Methods • Command line utilities • Ganglia • Third-party applications (which run command-line tools or use python API)

  4. Command line • Three useful commands • condor_status • Overview of the pool (including jobs, machines) • Information about specific worker nodes • condor_q • Information about jobs in the queue • condor_history • Information about completed jobs

  5. Overview of jobs -bash-4.1$ condor_status -collector Name Machine RunningJobs IdleJobs HostsTotal RAL-LCG2@condor01.gridpp.rl. condor01.gridpp.rl 10608 8355 11347 RAL-LCG2@condor02.gridpp.rl. condor02.gridpp.rl 10616 8364 11360

  6. Overview of machines -bash-4.1$ condor_status -total Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 11183 95 10441 592 0 0 0 Total 11183 95 10441 592 0 0 0

  7. Jobs by schedd -bash-4.1$ condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs arc-ce01.gridpp.rl.a arc-ce01.g 2388 1990 13 arc-ce02.gridpp.rl.a arc-ce02.g 2011 1995 31 arc-ce03.gridpp.rl.a arc-ce03.g 4272 1994 9 arc-ce04.gridpp.rl.a arc-ce04.g 1424 2385 12 arc-ce05.gridpp.rl.a arc-ce05.g 1 0 6 cream-ce01.gridpp.rl cream-ce01 266 0 0 cream-ce02.gridpp.rl cream-ce02 247 0 0 lcg0955.gridpp.rl.ac lcg0955.gr 0 0 0 lcgui03.gridpp.rl.ac lcgui03.gr 3 0 0 lcgui04.gridpp.rl.ac lcgui04.gr 0 0 0 lcgvm21.gridpp.rl.ac lcgvm21.gr 0 0 0 TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 10612 8364 71

  8. Jobs by user, schedd -bash-4.1$ condor_status -submitters Name Machine RunningJobs IdleJobs HeldJobs group_ALICE.alice.alice043@g arc-ce01.gridpp.rl 0 0 0 group_ALICE.alice.alicesgm@g arc-ce01.gridpp.rl 540 0 1 group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl 142 0 0 group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl 82 5 0 group_CMS.cms.cmssgm@gridpp. arc-ce01.gridpp.rl 1 0 0 group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl 214 390 0 group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl 68 100 0 group_CMS.prodcms.pcms004@gr arc-ce01.gridpp.rl 78 476 4 group_CMS.prodcms.pcms054@gr arc-ce01.gridpp.rl 12 910 0 group_CMS.prodcms_multicore. arc-ce01.gridpp.rl 47 102 0 group_DTEAM_OPS.ops.ops047@g arc-ce01.gridpp.rl 0 0 0 group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl 992 0 2 group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl 0 0 0 …

  9. …Jobs by user RunningJobs IdleJobs HeldJobs group_ALICE.alice.al 0 0 0 group_ALICE.alice.al 3500 368 5 group_ALICE.alice_pi 0 0 0 group_ATLAS.atlas.at 0 0 0 group_ATLAS.atlas.at 0 0 0 group_ATLAS.atlas_pi 414 12 10 group_ATLAS.atlas_pi 0 0 2 group_ATLAS.prodatls 354 36 11 group_CMS.cms.cmssgm 1 0 0 group_CMS.cms_pilot. 371 2223 0 group_CMS.cms_pilot. 0 0 1 group_CMS.cms_pilot. 68 200 0 group_CMS.prodcms.pc 188 1905 10 group_CMS.prodcms.pc 312 3410 0 group_CMS.prodcms_mu 47 102 0 …

  10. condor_q [root@arc-ce01 ~]# condor_q -- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arc-ce01.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 794717.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794718.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794719.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794720.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794721.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794722.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794723.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794725.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794726.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) … 3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended

  11. Multi-core jobs -bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’ -- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 832677.0 pcms004 12/5 14:33 0+00:15:07 R 0 2.0 (gridjob ) 832717.0 pcms004 12/5 14:37 0+00:12:02 R 0 0.0 (gridjob ) 832718.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob ) 832719.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob ) 832893.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob ) 832894.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob ) …

  12. Multi-core jobs -bash-4.1$ condor_q -global -pr queue_mc.cpf -- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356> ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES 832677.0 pcms004 12/5 14:33 0+00:00:00 R 2.0 (gridjob) 8 832717.0 pcms004 12/5 14:37 0+00:00:00 R 0.0 (gridjob) 8 832718.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8 832719.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8 832893.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8 832894.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8 … https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats • Custom print format

  13. Jobs with specific DN -bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’ -- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 678275.0 tatls015 12/2 17:57 2+06:07:15 R 0 2441.4 (arc_pilot ) 681762.0 tatls015 12/3 03:13 1+21:12:31 R 0 2197.3 (arc_pilot ) 705153.0 tatls015 12/4 07:36 0+16:49:12 R 0 2197.3 (arc_pilot ) 705807.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot ) 705808.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot ) 706612.0 tatls015 12/4 09:16 0+15:09:37 R 0 2197.3 (arc_pilot ) 706614.0 tatls015 12/4 09:16 0+15:09:26 R 0 2197.3 (arc_pilot ) …

  14. Jobs killed • Jobs which were removed [root@arc-ce01 ~]# condor_history -constraint 'JobStatus == 3’ ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 823881.0 alicesgm 12/5 01:01 1+06:13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi 831849.0 tlhcb005 12/5 13:19 0+18:52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi 832753.0 tlhcb005 12/5 14:38 0+17:07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi 819636.0 alicesgm 12/4 19:27 1+12:13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi 825511.0 alicesgm 12/5 03:03 0+18:52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi 823799.0 alicesgm 12/5 00:56 1+05:58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi 820001.0 alicesgm 12/4 19:48 1+06:43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi 833589.0 alicesgm 12/5 16:01 0+14:06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi 778644.0 tlhcb005 12/2 05:56 4+00:00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi …

  15. Jobs killed • Jobs removed for exceeding memory limit [root@arc-ce01 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory 823953 alicesgm 3500000 3000 824438 alicesgm 3250000 3000 820045 alicesgm 3500000 3000 823881 alicesgm 3250000 3000 … [root@arc-ce04 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c 515 alice 5 cms 70 lhcb

  16. condor_who • What jobs are currently running on a worker node? [root@lcg1211 ~]# condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM tatls015@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_2 654753.0 0+00:01:54 15743 /usr/libexec/condor/co tatls015@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_5 654076.0 0+00:56:50 21916 /usr/libexec/condor/co pcms004@gridpp.rl.ac.uk arc-ce04.gridpp.rl.ac.uk 1_10 1337818.0 0+02:51:34 31893 /usr/libexec/condor/co pcms004@gridpp.rl.ac.uk arc-ce04.gridpp.rl.ac.uk 1_7 1337776.0 0+03:06:51 32295 /usr/libexec/condor/co tlhcb005@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_1 651508.0 0+05:02:45 17556 /usr/libexec/condor/co alicesgm@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_4 737874.0 0+05:44:24 5032 /usr/libexec/condor/co tlhcb005@gridpp.rl.ac.uk arc-ce04.gridpp.rl.ac.uk 1_6 1336938.0 0+08:42:18 26911 /usr/libexec/condor/co tlhcb005@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_8 826808.0 1+02:50:16 3485 /usr/libexec/condor/co tlhcb005@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_3 722597.0 1+08:44:28 22966 /usr/libexec/condor/co

  17. Startd history • If STARTD_HISTORY defined on your WNs [root@lcg1658 ~]# condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 841989.0 tatls015 12/6 07:58 0+00:02:39 C 12/6 08:01 /var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi 841950.0 tatls015 12/6 07:56 0+00:02:40 C 12/6 07:59 /var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi 841889.0 tatls015 12/6 07:53 0+00:02:33 C 12/6 07:56 /var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi 841847.0 tatls015 12/6 07:50 0+00:02:35 C 12/6 07:54 /var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi 841816.0 tatls015 12/6 07:48 0+00:02:36 C 12/6 07:51 /var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi 841791.0 tatls015 12/6 07:45 0+00:02:33 C 12/6 07:48 /var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi 716804.0 alicesgm 12/4 18:28 1+13:15:07 C 12/6 07:44 /var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI …

  18. Ganglia • condor_gangliad • Runs on a single host (can be any host) • Gathers daemon ClassAds from the collector • Publishes metrics to ganglia with host spoofing • At RAL we have on one host GANGLIAD_VERBOSITY = 2GANGLIAD_PER_EXECUTE_NODE_METRICS = FalseGANGLIAD = $(LIBEXEC)/condor_gangliadGANGLIA_CONFIG = /etc/gmond.confGANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.dGANGLIA_SEND_DATA_FOR_ALL_HOSTS = trueDAEMON_LIST = MASTER, GANGLIAD

  19. Ganglia • Small subset from schedd

  20. Ganglia • Small subset from central manager

  21. Easy to make custom plots

  22. Total running, idle, held jobs • f

  23. Running jobs by schedd

  24. Negotiator health • s Negotiation cycle duration Number of AutoClusters

  25. Draining & multi-core slots

  26. (Some) Third party tools

  27. Job overview • Condor Job Overview Monitor http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html

  28. Mimic • Internal RAL application

  29. htcondor-sysview

  30. htcondor-sysview • Hover mouse over a core to get job information

  31. Nagios • Most (all?) sites probably use Nagios or an alternative • At RAL • Process checks for condor_master on all nodes • Central mangers • Check for at least 1 collector • Check for the negotiator • Check for worker nodes Number of startdClassAds needs to be above a threshold Number of non-broken worker nodes above a threshold • CEs • Check for schedd • Job submission test

More Related