150 likes | 277 Views
LCG-2 Operational Experience in Amsterdam. Davide Salomoni NIKHEF GDB, 13 October 2004 – Amsterdam. Talk Outline. The Dutch Tier-1 Center Resource usage, monitoring User/grid support. NL Tier-1. NIKHEF – 140 WNs (280 CPUs) Around 4+ TB disk space
E N D
LCG-2 Operational Experience in Amsterdam Davide Salomoni NIKHEF GDB, 13 October 2004 – Amsterdam
Talk Outline • The Dutch Tier-1 Center • Resource usage, monitoring • User/grid support
NL Tier-1 • NIKHEF – 140 WNs (280 CPUs) • Around 4+ TB disk space • The farm is fairly heterogeneous, having been built over time • 3 people • SARA – 30 WNs (60 CPUs) • Homogeneous farm (dual-Xeon 3.06 GHz) in Almere • TERAS (SGI Origin 3800 w/ a total of 1024 CPUs) as SE with automatic migration to tape (capacity 1.2 PB) and a disk front-end cache of (currently) 500 GB • Second tape library TBI soon w/ 250 TB – can grow up to 60 PB • 5 people
Farm Monitoring • Both NIKHEF and SARA use ganglia • With some extensions, e.g. a ganglia/pbs interface • Several stats are available for both admin and user consumption, for example: • The usual Ganglia pages • Job details, per-experiment farm usage
Use of Resources • torque/maui as batch system/scheduler • Way better than OpenPBS • Custom RPMs • Basically to support the transient $TMPDIR patch (automatic removal of temporary directories upon job completion) – implements a feature present in PBSpro and other batch systems. • See http://www.dutchgrid.nl/install/edg-testbed-stuff/torque/ and http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/ • Extensive use of maui’s fairshare mechanism to set targets for (grid and local) users, groups, classes, etc. Sample config at http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/ • Flexible, and complex
Use of Resources (2) • Check fairshare usage: Aggregate use of the NDPF from 2004-09-01 to 2004-09-23 inclusive For all users that are in groups XXXX. Date CPU time WallTime GHzHours #jobs 2004-09-04 00:00:00 00:00:10 0.00 6 2004-09-06 49:38:00 49:41:26 127.61 10 2004-09-07 155:32:36 159:15:56 388.77 9 2004-09-08 559:31:19 579:12:23 1336.88 14 2004-09-09 523:15:21 524:14:17 1202.94 25 2004-09-10 1609:29:32 1617:20:42 3685.88 89 2004-09-11 319:18:39 331:14:29 662.48 13 2004-09-12 96:58:59 97:24:11 194.81 2 2004-09-13 131:43:08 133:06:45 266.23 6 2004-09-14 214:41:10 215:44:00 431.47 11 2004-09-15 59:56:58 65:24:52 130.83 5 2004-09-16 38:50:30 39:06:36 78.22 3 2004-09-17 432:55:49 452:22:26 938.97 6 2004-09-18 95:35:22 96:00:23 192.01 1 2004-09-19 95:26:31 96:00:17 192.01 1 2004-09-20 10:09:34 10:17:38 20.59 22 2004-09-21 49:06:40 49:45:10 99.51 3 2004-09-22 88:14:41 88:37:06 177.24 2 2004-09-23 184:45:49 214:44:09 429.47 3 Summed 4715:10:38 4819:32:56 10555.91 231
Babysitting the (local) Grid • A number of home-built scripts try to keep the system under control • Check for unusually short wallclock times repeated in short succession on the same node(s) – often an indication of a black hole • Check that nodes have a standard set of open ports (e.g. ssh, nfs, ganglia); if not, take them out of the batch system • Periodically remove old (stale) state files, lest they are taken into account by the job manager (noticeable burden on the pbs server in that case) • Monitor the pbs log and accounting files in various ways to check for errors and abnormal conditions • Cache various pbs utilities to work around many (= sometimes 30/sec) unnecessary queries coming from the job manager(s)
On Being Monitored • We certainly want to beon this map • But there seem to betoo many testing scripts • Two main problems: • Given the way the existing job manager works, current monitoring can create disruption on busy sites. David Smith promised to change some important things in this area. • GOC polling seems (by default) too frequent (see recent mail from Dave Kant on LCG-ROLLOUT) • “we can [submit] fewer jobs to you [… but this] is only a temporary measure” (DK). Need to probably work out a different strategy.
Grid Software • Not always clear when new releases are to be out, with which upgrade policies • Configuration management getting worse sometimes: • Due to schema limitations, we now need one queue per experiment. But this has to be done manually with some care (or your GRIS risks to loose some info – see ce-static.ldif and ceinfo-wrapper.sh) • The LCG [software, infrastructure] is not used by LHC experiments only, although this seems assumed here and there (proliferation of environment variables [e.g. WN_DEFAULT_SE_exp], some default files assume you want to support some VOs [e.g. SIXT]). Nothing dramatic, of course. But for a Tier-X center supporting VOs other than those of the canonical 4 experiments, this can lead to time/effort inefficiencies.
Grid Support • There are too many points of contact here and there, and they seem often not very correlated • LCG GOC @ RAL: http://goc.grid-support.ac.uk/gridsite/gocmain/ • Testzone page at CERN: http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/lastreport.cgi • Useful, but it often says that a site failed a test. Next time the site is OK. Furthermore, no actions seem to be taken anymore when a site fails. • GridICE @ RAL: http://grid-ice.esc.rl.ac.uk/gridice/site/site.php • Most sites do not seem to be tracked though, and I am not sure that the published numbers reflect reality (they don’t for NIKHEF, for example) • FAQs abound (but are too uncorrelated): • GOC FAQ @ RAL: http://www.gridpp.ac.uk/tb-support/faq/ • GOC Wiki FAQ @ Taipei: http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory • GGUS FAQ @ Karlsruhe: https://gus.fzk.de/ (seems tailored to GridKA) • LCG-ROLLOUT seems a good troubleshooting forum • GGUS
Grid Support • We have, like many, the usual set of web pages to support users. See e.g. http://www.dutchgrid.nl • Yes we also have our own FAQ list • Ticketing system, telephone, email contact, etc. • Grid tutorials are quite popular (one is running today) • Common problems: • Users: access to UI, how to run jobs, how to access resources (e.g. store/retrieve data) • Support for specific packages (not every VO uses lcg-ManageSoftware/lcg-ManageVO), suggestions on how to best (not to) use grid resources • Mandatory hardware problems • Batch system fun • Firewall considerations (examples: number of streams in edg-rm, SITE_GLOBUS_TCP_RANGE for WNs) • Supported VOs include alice, cms, esr, ncf, lhcb, atlas, dteam, dzero, pvier, astron, astrop, tutor, vle, asci, nadc, magic[, biomed] • Some scripts developed to ease addition of new VOs
What Now? • EGEE, of course • Or not? • Quattor • as Jeff says, the speed we react to installation/change requests is proportional to E-n, where E is the effort required and n some number > 1[, n being higher if any perl is involved] • Improve proactive monitoring and batch system efficiency; develop experience and tune high-speed [real] data transfer
TERAS Storage Element • SGI Origin3800 • 32x R14k mips 500 MHz, 1GB/processor • TERAS Interactive node and Grid SE • Mass Storage environment • SGI TP9100, 14 TB RAID5, FC-based • CXFS SAN shared file system • Home file systems • Grid SE file system = Robust Data Challenge FS • max 400 MB/s • Batch scratch file systems • DMF/TMF Hierarchical Storage Management • Transparent data migration to tape • Home file systems • Grid SE file system