Tier1 Site Report

Tier1 Site Report HEPSysMan @ RAL 19-20 June 2008 Martin Bly

Overview • New Building • Site issues • Tier1 Tier1 Site Report - HEPSysMan @ RAL

New Computing Building • New computing building being constructed opposite new reception building at RAL • November 2007 – looked like a sparse Meccano construction – just girders • Now has walls, a roof, windows, skylight • shell is almost complete • External ‘beautification’ starting • Internal fitting of machine room level yet to start • Completion due late 2008 • Migration planning starting • Target: To move most of Tier1 hardware Jan-Mar 2009 Tier1 Site Report - HEPSysMan @ RAL

Tier1 Site Report - HEPSysMan @ RAL

Portable Device Encryption • Big concern in the UK over data-loss by ‘government’, like everywhere else. • Mostly careless custodianship rather than ‘enemy action’ • Many stolen/lost laptops, CDs/DVDs going missing in transit… • Government has mandated that all public service organisations must ensure all portable devices taken off their site have the data storage encrypted by an approved tool • This means: all laptops and other portable devices (PDAs, phones) which have access to ‘data’ on the RAL network must have encryption before they leave site • ‘Data’ means any data that can identify or be associated with any individual – thus Outlook caches, email lists, synchronised file caches of ‘corporate’ data of any sort • Many staff have rationalised what they keep on their laptop/PDA • Why do you need it? If you don’t need it, don’t keep it! • Using Pointsec from CheckPoint Software Technologies Ltd • Will do Windows XP, some versions of Linux • …but not Macs, or dual-boot Windows/Linux systems (yet!) • Painful but necessary • Don’t put the data at risk… Tier1 Site Report - HEPSysMan @ RAL

Tier1: Grid Only • Non-Grid access to Tier-1 has ended. Only special cases now have access to: • UIs • Direct job submission (qsub) • Until end of May 2008: • IDs were maintained (disabled) • Home directories were maintained online • Mail forwarding was maintained. • After end of May 2008 • IDs will be deleted • Home directories will be backed up • Mail spool will be backed up • Mail forwarding will stop • AFS service continues for Babar (and just in case for LCG) Tier1 Site Report - HEPSysMan @ RAL

CASTOR • CASTOR: production version is v2.1.6-12 hot-fix 2 • Recently much more stable and reliable • Good support from developers at CERN – working well • Some problems appear at RAL that don’t show in testing at CERN because we use features not exercised at CERN – speedy investigation and fixing • Considerable effort with CMS on tuning disk server and tape migration performance • Recent work with developers on migration strategies has improved performance considerably • Migration to v2.1.7-7 imminent Tier1 Site Report - HEPSysMan @ RAL

dCache closure • dCache service closure was announced was announced for the end May 2008 • Migration of data is proceeding • Some work to do to provide generic Castor instance for small VOs • Likely the closure deadline will extend some months Tier1 Site Report - HEPSysMan @ RAL

Hardware: New Capacity Storage • 182 x 9TB 16-bay 3U servers: 1638TB data capacity • Two Lots based on same Supermicro chassis with different disk OEM (WD, Seagate) and CPU (AMD, Intel) • Dual RAID controllers – data and system disks separate: • 3Ware 9650SX-16ML, 14 x 750GB data drives • 3Ware 9650SX-4, 2 x 250GB or 400GB system drives • Twin CPUs (quad-core Intel, dual-core AMD), 8GB RAM, dual 1GB NIC • Intel set being deployed • Used in CCRC08 • AMD set: some issues with forcedeth network driver (SL4.4) under high sustained load Tier1 Site Report - HEPSysMan @ RAL

Backplane Failures (Supermicro) • 3 servers “burn out” backplane • 2 of which set off VESDA • 1 called out fire-brigade! • Safety risk assessment: Urgent rectification needed • Good response from supplier/manufacturer • PCB fault in “bad batch” • Replacement complete Tier1 Site Report - HEPSysMan @ RAL

Hardware: CPU • 2007: production capacity ~1500KSI2K on 600 systems • Late 2007: upgraded about 50% of capacity to 2GB/core • FY07/08 procurement (~3000KSI2K - but YMMV) • Streamline • 57 x 1U servers (114 systems, 3 racks), each system: • dual Intel E5410 (2.33GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD • Clustervision • 56 x 1U servers (112 systems, 4 racks), each system: • dual Intel E5440 (2.83GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD • Configuration based on 15kW per rack maximum, from supplied ‘full-load’ power consumption data. Required to meet power supply and cooling restrictions (2 x 32A supplies). Tier1 Site Report - HEPSysMan @ RAL

Hardware: non-Capacity • Servers for Grid services (CEs, WMSs, FTS etc • 11 ‘twin’ systems, same as batch workers but two disks • Low capacity storage • 6 x 2U servers, 8GB RAM, dual chip dual-core AMD CPU, 2 x 250GB HDD (RAID1 system), 4 x 750GB HDD (RAID5 data), 3Ware controller • For AFS and Home filesystem, installation repositories… • Xen • 4 ‘monster’ systems for virtualisation • 2 x dual core AMD 2222 CPUs, 32GB RAM, 4 x 750 GB HDDs on HW RAID controller • For PPS service and Tier1 testing. • Oracle Databases • 5 x servers (redundant PSUs, HW RAID disks) and 7TB data array (HW RAID) • To provide additional RAC nodes for 3D services, LFC/FTS backend, Atlas TAG etc. Tier1 Site Report - HEPSysMan @ RAL

FY08/09 Procurements • Capacity procurements for 2008/9 started • PQQs issued to OJEU, responses due mid July • Evaluation and issue of technical documents for limited second stage expected by early August • Second stage evaluation September/October • Delivery … • Looking for ~1800TB usable storage and around the same compute capacity as last year • Additional non-capacity hardware required to replace aging re-tasked batch workers. Tier1 Site Report - HEPSysMan @ RAL

Hardware: Network RAL Site CPUs + Disks CPUs + Disks ADS Caches RAL Tier 2 N x 1Gb/s 2 x 5510 + 5530 3 x 5510 + 5530 5510 5530 10Gb/s Router A Firewall Force10 C300 8 slot Router (64*10Gb) Stack 4 x Nortel 5530 10Gb/s bypass OPN Router 10Gb/s Site Access Router 5 x 5510 + 5530 6 x 5510 + 5530 Oracle systems 1Gb/sLancaster (test network) 10Gb/s to SJ5 CPUs + Disks CPUs + Disks Tier 1 10Gb/s to CERN Tier1 Site Report - HEPSysMan @ RAL

Services • New compute capacity enabled for CCRC08 • Exposed weakness under load in single CE configuration • Deployed three extra CEs as previously planned • Moved LFC backend to single-node Oracle • To move to RAC with FTS backend shortly • Maui issues with buffer sizes caused by large increase in number of jobs running • Monitoring task killing maui at 8 hour intervals • Rebuild with larger buffer sizes cured the problem Tier1 Site Report - HEPSysMan @ RAL

Monitoring / On-Call • Cacti – network traffic and power • Ganglia - performance • Nagios – alerts • 24x7 callout now operational • Using Nagios to signal existing pager system to initiate callouts • Working well • But still learning • Blogging • UK T2s and the Tier1 have blogs: • http://planet.gridpp.ac.uk/ Tier1 Site Report - HEPSysMan @ RAL

Power Failure: Thursday 7th Feb ~12:15 • Work on building power supplies since December • Down to 1 transformer (of 2) for extended periods (weeks). Increased risk of disaster • Single transformer running at (close to) maximum operating load • No problems until work finished and casing being closed • control line crushed and power supply tripped! • First power interruption for over 3 years • Restart (Effort > 200 FTE hours) • Most Global/National/Tier-1 core systems up by Thursday evening • Most of CASTOR/dCache/NFS data services and part of batch up by Friday • Remaining batch on Saturday/Sunday • Still problems to iron out in CASTOR on Monday/Tuesday • Lessons • Communication was prompt and sufficient but ad-hoc • Broadcast unavailable as RAL run the GOCDB (now fixed by caching) • Careful restart of disk servers slow and labour intensive (but worked) will not scale http://www.gridpp.rl.ac.uk/blog/2008/02/18/review-of-the-recent-power-failure/ Tier1 Site Report - HEPSysMan @ RAL

Power Glitch: Tuesday 6th May ~07:03 • County-wide power interruption • At RAL, lost power to ISIS, Atlas, lasers etc • Single phase (B) • Knocked off some systems, caused reboots of others • Blew several fuses in upper machine room • Recovery quite quick • No opportunity for controlled restart • most of the systems automatically restarted and had gone though fsck or journal recovery before T1/Castor staff arrived. Tier1 Site Report - HEPSysMan @ RAL

Tier1 Site Report