Efficient Monitoring and Troubleshooting Strategies for Resource Management

Monitoring and Troubleshooting Resources Davide Salomoni, NIKHEF Presented at the NE ROC Meeting Amsterdam - October 28, 2004

Agenda (or, Thinking out Loud) • Monitoring, monitoring, monitoring • But what, and how? And why? • Can we achieve some consistency? • How is the software (middleware and bottomware) doing? • Do we care about the topware, by the way? • How do we interact with • Users • Other centers • Other regions • The Rest NE ROC Meeting, 28/10/2004

Why is Monitoring/Troubleshooting Complex? • Well, in our context, at least • “I have attached a picture of the LCG-2 job submission chain, showing how many things have to be in good shape for one's job to run OK...”(Maarten Litmaath to LCG-ROLLOUT, 8/10/2004) • The picture on the side is simplified: a “successful user experience” also involves: • Farm setup • Network, Firewalls • Configurations • Dependencies NE ROC Meeting, 28/10/2004

Farm Monitoring • Both NIKHEF and SARA use ganglia • With some extensions, e.g. a ganglia/pbs interface • Available from ftp://ftp.sara.nl/pub/outgoing/ • Several stats are available for both admin and user consumption; for example: • The usual ganglia pages • Job details NE ROC Meeting, 28/10/2004

Use of Resources • Batch system/scheduler: torque/maui • Way better than OpenPBS • Custom RPMs • Basically to support the transient $TMPDIR patch (automatic removal of temporary directories upon job completion) – a feature present in PBSpro and other batch systems • Extensive use of maui’s fairshare mechanism to set targets for users (both local and grid), groups, classes, etc. • Flexible, and complex; and there are some annoyances • For example, how one specifies the max # of CPUs in the farm (static configuration) • Do not forget MAXJOBQUEUED or your system may get unhappy • Packages and configuration examples are available at http://www.dutchgrid.nl/Admin/Nikhef/ NE ROC Meeting, 28/10/2004

Use of Resources (2) • Check fairshare usage: Aggregate use of the NDPF from 2004-09-01 to 2004-09-23 inclusive For all users that are in groups XXXX. Date CPU time WallTime GHzHours #jobs 2004-09-04 00:00:00 00:00:10 0.00 6 2004-09-06 49:38:00 49:41:26 127.61 10 2004-09-07 155:32:36 159:15:56 388.77 9 2004-09-08 559:31:19 579:12:23 1336.88 14 2004-09-09 523:15:21 524:14:17 1202.94 25 2004-09-10 1609:29:32 1617:20:42 3685.88 89 2004-09-11 319:18:39 331:14:29 662.48 13 2004-09-12 96:58:59 97:24:11 194.81 2 2004-09-13 131:43:08 133:06:45 266.23 6 2004-09-14 214:41:10 215:44:00 431.47 11 2004-09-15 59:56:58 65:24:52 130.83 5 2004-09-16 38:50:30 39:06:36 78.22 3 2004-09-17 432:55:49 452:22:26 938.97 6 2004-09-18 95:35:22 96:00:23 192.01 1 2004-09-19 95:26:31 96:00:17 192.01 1 2004-09-20 10:09:34 10:17:38 20.59 22 2004-09-21 49:06:40 49:45:10 99.51 3 2004-09-22 88:14:41 88:37:06 177.24 2 2004-09-23 184:45:49 214:44:09 429.47 3 Summed 4715:10:38 4819:32:56 10555.91 231 NE ROC Meeting, 28/10/2004

Babysitting the (local) Grid • A number of home-built scripts try to keep the system under control • Indeed, an often-heard lament in the LCG world is that, regardless of the quality of the middleware, most problems occur because of site misconfigurations/problems. • Check for unusually short wallclock times repeated in short succession on the same node(s) – often an indication of a black hole • Check that nodes have a standard set of open ports (e.g. ssh, nfs, ganglia); if not, take them out of the batch system • Periodically remove old (stale) state files, lest they are taken into account by the job manager (noticeable burden on the pbs server in that case) • Monitor the pbs log and accounting files in various ways to check for errors and abnormal conditions • Standard node monitoring, e.g. CPU temperature, disk space NE ROC Meeting, 28/10/2004

Job Submission, the Grid Way • The job manager/grid monitor agents spawn several (potentially long-running) processes. These can be, depending on various factors, per RB, per job, per user. • These processes in the end all issue qstat calls, i.e. query the pbs server • These calls gather detailed job info for every job owned by a given user – even jobs that for some reason died long time ago, but left some traces on the system (in the form of GRAM state files) • With high job submission rates (e.g. DC), and a high number of nodes in the farm, this can lead to 25+ qstat calls/second and 100% CPU on decent hardware (2 x XEON 2.8 GHz) • In this case, “job submission” really means “the server submits itself” = dies, and brings the CE to an halt • And if you run e.g. GridICE, you will have even more qstat queries NE ROC Meeting, 28/10/2004

PBS Caching • Waiting for somebody to fix this madness, we have now a qstat/qsub/pbsnodes caching mechanism in place • CPU load is much more reasonable • At least this is not the bottleneck anymore • With a farm of our size, and apparently also with bigger farms (e.g. CNAF) • But there are many other players in the chain, so scalability may be at risk anyway • The caching wrappers are available at http://www.dutchgrid.nl/Admin/Nikhef/ NE ROC Meeting, 28/10/2004

On Being Monitored • We certainly want to beon this map • But for a while we had to askto be removed • There seem to be fartoo many testing scripts • Two main problems: • Given the way the existing job manager works, current monitoring can create disruption on busy sites. David Smith promised to change some important things in this area. • GOC polling seems (by default) too frequent (see recent mail from Dave Kant on LCG-ROLLOUT) • “we can [submit] fewer jobs to you [… but this] is only a temporary measure” (DK). Need to implement a different strategy. • Many things are apparently in the works for what regards the GOC monitoring future (John Gordon, GDB 13/10/2004) NE ROC Meeting, 28/10/2004

Grid Support, pre-EGEE • There are too many points of contact here and there, and they often seem not very correlated • LCG GOC @ RAL: http://goc.grid-support.ac.uk/gridsite/gocmain/ • Testzone page at CERN: http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/lastreport.cgi • Useful, but it often says that a site failed a test. Next time the site is OK. Furthermore, no actions seem to be taken anymore when a site fails. • GridICE @ RAL: http://grid-ice.esc.rl.ac.uk/gridice/site/site.php • Most sites do not seem to be tracked though, and I am not sure that the published numbers reflect reality (they don’t for NIKHEF, for example) • FAQs abound (but are too uncorrelated): • GOC FAQ @ RAL: http://www.gridpp.ac.uk/tb-support/faq/ • GOC Wiki FAQ @ Taipei: http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory • GGUS FAQ @ Karlsruhe: https://gus.fzk.de/ (seems tailored to GridKA) • LCG-ROLLOUT seems a good troubleshooting forum • GGUS • We (NL) have also our own FAQ and support pages: http://www.dutchgrid.nl/ • See Ron’s presentation for more user support details NE ROC Meeting, 28/10/2004

Questions (1) • How can we share expertise for e.g. monitoring, or batch systems, or system/farm configurations? (w/ and w/o firewalls) • But is everybody using the same tools? Probably not, not even within the same region. Need inventory • Many answers can be found (for LCG) in the LCG-ROLLOUT archives • About 4000 messages in the last 10 months – can we consolidate? • Another interesting (and probably less used) problem database is Savannah • There are things that will not be answered, or fixed (e.g. because of dependencies, people that left, politics, etc): what do we do in that case? • Accounting: how is this currently done across the region? NE ROC Meeting, 28/10/2004

Questions (2) • How does one keep current with e.g. pbs and maui? Good-will? Or should we explicitly suggest (at least within a region) to upgrade? • A potential problem is to identify dependencies when we talk about upgrades to non-middleware software (or even to middleware software, actually – GridKA had a good example) • Eventually, an issue for SLA • Applications: do we care about what they do to our nodes? • In theory, no. In practice, we may want to think about this • But this is “Grid”™, so it may well involve extra-region considerations • Future of the GOC? Need to make better use of the monitoring infrastructure soon • Given the complexity of some problems, we can’t rely on reactive monitoring only – it just doesn’t work well in a complex 24x7 environment NE ROC Meeting, 28/10/2004

Efficient Monitoring and Troubleshooting Strategies for Resource Management