130 likes | 274 Views
Monitoring and Troubleshooting Resources. Davide Salomoni, NIKHEF Presented at the NE ROC Meeting Amsterdam - October 28, 2004. Agenda (or, Thinking out Loud). Monitoring, monitoring, monitoring But what, and how? And why? Can we achieve some consistency?
Monitoring and Troubleshooting Resources Davide Salomoni, NIKHEF Presented at the NE ROC Meeting Amsterdam - October 28, 2004
Agenda (or, Thinking out Loud) • Monitoring, monitoring, monitoring • But what, and how? And why? • Can we achieve some consistency? • How is the software (middleware and bottomware) doing? • Do we care about the topware, by the way? • How do we interact with • Users • Other centers • Other regions • The Rest NE ROC Meeting, 28/10/2004
Why is Monitoring/Troubleshooting Complex? • Well, in our context, at least • “I have attached a picture of the LCG-2 job submission chain, showing how many things have to be in good shape for one's job to run OK...”(Maarten Litmaath to LCG-ROLLOUT, 8/10/2004) • The picture on the side is simplified: a “successful user experience” also involves: • Farm setup • Network, Firewalls • Configurations • Dependencies NE ROC Meeting, 28/10/2004
Farm Monitoring • Both NIKHEF and SARA use ganglia • With some extensions, e.g. a ganglia/pbs interface • Available from ftp://ftp.sara.nl/pub/outgoing/ • Several stats are available for both admin and user consumption; for example: • The usual ganglia pages • Job details NE ROC Meeting, 28/10/2004
Use of Resources • Batch system/scheduler: torque/maui • Way better than OpenPBS • Custom RPMs • Basically to support the transient $TMPDIR patch (automatic removal of temporary directories upon job completion) – a feature present in PBSpro and other batch systems • Extensive use of maui’s fairshare mechanism to set targets for users (both local and grid), groups, classes, etc. • Flexible, and complex; and there are some annoyances • For example, how one specifies the max # of CPUs in the farm (static configuration) • Do not forget MAXJOBQUEUED or your system may get unhappy • Packages and configuration examples are available at http://www.dutchgrid.nl/Admin/Nikhef/ NE ROC Meeting, 28/10/2004
Use of Resources (2) • Check fairshare usage: Aggregate use of the NDPF from 2004-09-01 to 2004-09-23 inclusive For all users that are in groups XXXX. Date CPU time WallTime GHzHours #jobs 2004-09-04 00:00:00 00:00:10 0.00 6 2004-09-06 49:38:00 49:41:26 127.61 10 2004-09-07 155:32:36 159:15:56 388.77 9 2004-09-08 559:31:19 579:12:23 1336.88 14 2004-09-09 523:15:21 524:14:17 1202.94 25 2004-09-10 1609:29:32 1617:20:42 3685.88 89 2004-09-11 319:18:39 331:14:29 662.48 13 2004-09-12 96:58:59 97:24:11 194.81 2 2004-09-13 131:43:08 133:06:45 266.23 6 2004-09-14 214:41:10 215:44:00 431.47 11 2004-09-15 59:56:58 65:24:52 130.83 5 2004-09-16 38:50:30 39:06:36 78.22 3 2004-09-17 432:55:49 452:22:26 938.97 6 2004-09-18 95:35:22 96:00:23 192.01 1 2004-09-19 95:26:31 96:00:17 192.01 1 2004-09-20 10:09:34 10:17:38 20.59 22 2004-09-21 49:06:40 49:45:10 99.51 3 2004-09-22 88:14:41 88:37:06 177.24 2 2004-09-23 184:45:49 214:44:09 429.47 3 Summed 4715:10:38 4819:32:56 10555.91 231 NE ROC Meeting, 28/10/2004
Babysitting the (local) Grid • A number of home-built scripts try to keep the system under control • Indeed, an often-heard lament in the LCG world is that, regardless of the quality of the middleware, most problems occur because of site misconfigurations/problems. • Check for unusually short wallclock times repeated in short succession on the same node(s) – often an indication of a black hole • Check that nodes have a standard set of open ports (e.g. ssh, nfs, ganglia); if not, take them out of the batch system • Periodically remove old (stale) state files, lest they are taken into account by the job manager (noticeable burden on the pbs server in that case) • Monitor the pbs log and accounting files in various ways to check for errors and abnormal conditions • Standard node monitoring, e.g. CPU temperature, disk space NE ROC Meeting, 28/10/2004
Job Submission, the Grid Way • The job manager/grid monitor agents spawn several (potentially long-running) processes. These can be, depending on various factors, per RB, per job, per user. • These processes in the end all issue qstat calls, i.e. query the pbs server • These calls gather detailed job info for every job owned by a given user – even jobs that for some reason died long time ago, but left some traces on the system (in the form of GRAM state files) • With high job submission rates (e.g. DC), and a high number of nodes in the farm, this can lead to 25+ qstat calls/second and 100% CPU on decent hardware (2 x XEON 2.8 GHz) • In this case, “job submission” really means “the server submits itself” = dies, and brings the CE to an halt • And if you run e.g. GridICE, you will have even more qstat queries NE ROC Meeting, 28/10/2004
PBS Caching • Waiting for somebody to fix this madness, we have now a qstat/qsub/pbsnodes caching mechanism in place • CPU load is much more reasonable • At least this is not the bottleneck anymore • With a farm of our size, and apparently also with bigger farms (e.g. CNAF) • But there are many other players in the chain, so scalability may be at risk anyway • The caching wrappers are available at http://www.dutchgrid.nl/Admin/Nikhef/ NE ROC Meeting, 28/10/2004
On Being Monitored • We certainly want to beon this map • But for a while we had to askto be removed • There seem to be fartoo many testing scripts • Two main problems: • Given the way the existing job manager works, current monitoring can create disruption on busy sites. David Smith promised to change some important things in this area. • GOC polling seems (by default) too frequent (see recent mail from Dave Kant on LCG-ROLLOUT) • “we can [submit] fewer jobs to you [… but this] is only a temporary measure” (DK). Need to implement a different strategy. • Many things are apparently in the works for what regards the GOC monitoring future (John Gordon, GDB 13/10/2004) NE ROC Meeting, 28/10/2004
Grid Support, pre-EGEE • There are too many points of contact here and there, and they often seem not very correlated • LCG GOC @ RAL: http://goc.grid-support.ac.uk/gridsite/gocmain/ • Testzone page at CERN: http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/lastreport.cgi • Useful, but it often says that a site failed a test. Next time the site is OK. Furthermore, no actions seem to be taken anymore when a site fails. • GridICE @ RAL: http://grid-ice.esc.rl.ac.uk/gridice/site/site.php • Most sites do not seem to be tracked though, and I am not sure that the published numbers reflect reality (they don’t for NIKHEF, for example) • FAQs abound (but are too uncorrelated): • GOC FAQ @ RAL: http://www.gridpp.ac.uk/tb-support/faq/ • GOC Wiki FAQ @ Taipei: http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory • GGUS FAQ @ Karlsruhe: https://gus.fzk.de/ (seems tailored to GridKA) • LCG-ROLLOUT seems a good troubleshooting forum • GGUS • We (NL) have also our own FAQ and support pages: http://www.dutchgrid.nl/ • See Ron’s presentation for more user support details NE ROC Meeting, 28/10/2004
Questions (1) • How can we share expertise for e.g. monitoring, or batch systems, or system/farm configurations? (w/ and w/o firewalls) • But is everybody using the same tools? Probably not, not even within the same region. Need inventory • Many answers can be found (for LCG) in the LCG-ROLLOUT archives • About 4000 messages in the last 10 months – can we consolidate? • Another interesting (and probably less used) problem database is Savannah • There are things that will not be answered, or fixed (e.g. because of dependencies, people that left, politics, etc): what do we do in that case? • Accounting: how is this currently done across the region? NE ROC Meeting, 28/10/2004
Questions (2) • How does one keep current with e.g. pbs and maui? Good-will? Or should we explicitly suggest (at least within a region) to upgrade? • A potential problem is to identify dependencies when we talk about upgrades to non-middleware software (or even to middleware software, actually – GridKA had a good example) • Eventually, an issue for SLA • Applications: do we care about what they do to our nodes? • In theory, no. In practice, we may want to think about this • But this is “Grid”™, so it may well involve extra-region considerations • Future of the GOC? Need to make better use of the monitoring infrastructure soon • Given the complexity of some problems, we can’t rely on reactive monitoring only – it just doesn’t work well in a complex 24x7 environment NE ROC Meeting, 28/10/2004