LCG-2 Operational Experience in Amsterdam

LCG-2 Operational Experience in Amsterdam Davide Salomoni NIKHEF GDB, 13 October 2004 – Amsterdam

Talk Outline • The Dutch Tier-1 Center • Resource usage, monitoring • User/grid support

NL Tier-1 • NIKHEF – 140 WNs (280 CPUs) • Around 4+ TB disk space • The farm is fairly heterogeneous, having been built over time • 3 people • SARA – 30 WNs (60 CPUs) • Homogeneous farm (dual-Xeon 3.06 GHz) in Almere • TERAS (SGI Origin 3800 w/ a total of 1024 CPUs) as SE with automatic migration to tape (capacity 1.2 PB) and a disk front-end cache of (currently) 500 GB • Second tape library TBI soon w/ 250 TB – can grow up to 60 PB • 5 people

Farm Monitoring • Both NIKHEF and SARA use ganglia • With some extensions, e.g. a ganglia/pbs interface • Several stats are available for both admin and user consumption, for example: • The usual Ganglia pages • Job details, per-experiment farm usage

Use of Resources • torque/maui as batch system/scheduler • Way better than OpenPBS • Custom RPMs • Basically to support the transient $TMPDIR patch (automatic removal of temporary directories upon job completion) – implements a feature present in PBSpro and other batch systems. • See http://www.dutchgrid.nl/install/edg-testbed-stuff/torque/ and http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/ • Extensive use of maui’s fairshare mechanism to set targets for (grid and local) users, groups, classes, etc. Sample config at http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/ • Flexible, and complex

Use of Resources (2) • Check fairshare usage: Aggregate use of the NDPF from 2004-09-01 to 2004-09-23 inclusive For all users that are in groups XXXX. Date CPU time WallTime GHzHours #jobs 2004-09-04 00:00:00 00:00:10 0.00 6 2004-09-06 49:38:00 49:41:26 127.61 10 2004-09-07 155:32:36 159:15:56 388.77 9 2004-09-08 559:31:19 579:12:23 1336.88 14 2004-09-09 523:15:21 524:14:17 1202.94 25 2004-09-10 1609:29:32 1617:20:42 3685.88 89 2004-09-11 319:18:39 331:14:29 662.48 13 2004-09-12 96:58:59 97:24:11 194.81 2 2004-09-13 131:43:08 133:06:45 266.23 6 2004-09-14 214:41:10 215:44:00 431.47 11 2004-09-15 59:56:58 65:24:52 130.83 5 2004-09-16 38:50:30 39:06:36 78.22 3 2004-09-17 432:55:49 452:22:26 938.97 6 2004-09-18 95:35:22 96:00:23 192.01 1 2004-09-19 95:26:31 96:00:17 192.01 1 2004-09-20 10:09:34 10:17:38 20.59 22 2004-09-21 49:06:40 49:45:10 99.51 3 2004-09-22 88:14:41 88:37:06 177.24 2 2004-09-23 184:45:49 214:44:09 429.47 3 Summed 4715:10:38 4819:32:56 10555.91 231

Babysitting the (local) Grid • A number of home-built scripts try to keep the system under control • Check for unusually short wallclock times repeated in short succession on the same node(s) – often an indication of a black hole • Check that nodes have a standard set of open ports (e.g. ssh, nfs, ganglia); if not, take them out of the batch system • Periodically remove old (stale) state files, lest they are taken into account by the job manager (noticeable burden on the pbs server in that case) • Monitor the pbs log and accounting files in various ways to check for errors and abnormal conditions • Cache various pbs utilities to work around many (= sometimes 30/sec) unnecessary queries coming from the job manager(s)

On Being Monitored • We certainly want to beon this map • But there seem to betoo many testing scripts • Two main problems: • Given the way the existing job manager works, current monitoring can create disruption on busy sites. David Smith promised to change some important things in this area. • GOC polling seems (by default) too frequent (see recent mail from Dave Kant on LCG-ROLLOUT) • “we can [submit] fewer jobs to you [… but this] is only a temporary measure” (DK). Need to probably work out a different strategy.

Grid Software • Not always clear when new releases are to be out, with which upgrade policies • Configuration management getting worse sometimes: • Due to schema limitations, we now need one queue per experiment. But this has to be done manually with some care (or your GRIS risks to loose some info – see ce-static.ldif and ceinfo-wrapper.sh) • The LCG [software, infrastructure] is not used by LHC experiments only, although this seems assumed here and there (proliferation of environment variables [e.g. WN_DEFAULT_SE_exp], some default files assume you want to support some VOs [e.g. SIXT]). Nothing dramatic, of course. But for a Tier-X center supporting VOs other than those of the canonical 4 experiments, this can lead to time/effort inefficiencies.

Grid Support • There are too many points of contact here and there, and they seem often not very correlated • LCG GOC @ RAL: http://goc.grid-support.ac.uk/gridsite/gocmain/ • Testzone page at CERN: http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/lastreport.cgi • Useful, but it often says that a site failed a test. Next time the site is OK. Furthermore, no actions seem to be taken anymore when a site fails. • GridICE @ RAL: http://grid-ice.esc.rl.ac.uk/gridice/site/site.php • Most sites do not seem to be tracked though, and I am not sure that the published numbers reflect reality (they don’t for NIKHEF, for example) • FAQs abound (but are too uncorrelated): • GOC FAQ @ RAL: http://www.gridpp.ac.uk/tb-support/faq/ • GOC Wiki FAQ @ Taipei: http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory • GGUS FAQ @ Karlsruhe: https://gus.fzk.de/ (seems tailored to GridKA) • LCG-ROLLOUT seems a good troubleshooting forum • GGUS

Grid Support • We have, like many, the usual set of web pages to support users. See e.g. http://www.dutchgrid.nl • Yes we also have our own FAQ list • Ticketing system, telephone, email contact, etc. • Grid tutorials are quite popular (one is running today) • Common problems: • Users: access to UI, how to run jobs, how to access resources (e.g. store/retrieve data) • Support for specific packages (not every VO uses lcg-ManageSoftware/lcg-ManageVO), suggestions on how to best (not to) use grid resources • Mandatory hardware problems • Batch system fun • Firewall considerations (examples: number of streams in edg-rm, SITE_GLOBUS_TCP_RANGE for WNs) • Supported VOs include alice, cms, esr, ncf, lhcb, atlas, dteam, dzero, pvier, astron, astrop, tutor, vle, asci, nadc, magic[, biomed] • Some scripts developed to ease addition of new VOs

What Now? • EGEE, of course • Or not? • Quattor • as Jeff says, the speed we react to installation/change requests is proportional to E-n, where E is the effort required and n some number > 1[, n being higher if any perl is involved] • Improve proactive monitoring and batch system efficiency; develop experience and tune high-speed [real] data transfer

TERAS Storage Element • SGI Origin3800 • 32x R14k mips 500 MHz, 1GB/processor • TERAS Interactive node and Grid SE • Mass Storage environment • SGI TP9100, 14 TB RAID5, FC-based • CXFS SAN shared file system • Home file systems • Grid SE file system = Robust Data Challenge FS • max 400 MB/s • Batch scratch file systems • DMF/TMF Hierarchical Storage Management • Transparent data migration to tape • Home file systems • Grid SE file system

SAN Amsterdam

External Network Connectivity

LCG-2 Operational Experience in Amsterdam

LCG-2 Operational Experience in Amsterdam

Presentation Transcript

LCG Phase-2 Planning

Accounting in LCG

Operational Experience with

T2K Horn Operational Experience

IP-Multicast operational experience

Accounting in LCG

LCG Workshop on Operational Issues

LCG-2 Plan in Taiwan

Operational Experience with CMS Tier-2 Sites

Operational Experience with IPv6

The experience with LCG GRID in Russia

CMS Experience with LCG

Experience with LCG in DC04

Experience Walking Tour in Amsterdam

Accounting in LCG

Operational Experience @ SNS

Operational Experience with IPv6

IP-Multicast operational experience

LCG-1 Deployment and usage experience

Operational Experience with

LCG/EGEE Operational Security Coordination