1 / 18

Tier1 Site Report

Tier1 Site Report. HEPSysMan @ RAL 19-20 June 2008 Martin Bly. Overview. New Building Site issues Tier1. New Computing Building. New computing building being constructed opposite new reception building at RAL November 2007 – looked like a sparse Meccano construction – just girders

Download Presentation

Tier1 Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tier1 Site Report HEPSysMan @ RAL 19-20 June 2008 Martin Bly

  2. Overview • New Building • Site issues • Tier1 Tier1 Site Report - HEPSysMan @ RAL

  3. New Computing Building • New computing building being constructed opposite new reception building at RAL • November 2007 – looked like a sparse Meccano construction – just girders • Now has walls, a roof, windows, skylight • shell is almost complete • External ‘beautification’ starting • Internal fitting of machine room level yet to start • Completion due late 2008 • Migration planning starting • Target: To move most of Tier1 hardware Jan-Mar 2009 Tier1 Site Report - HEPSysMan @ RAL

  4. Tier1 Site Report - HEPSysMan @ RAL

  5. Portable Device Encryption • Big concern in the UK over data-loss by ‘government’, like everywhere else. • Mostly careless custodianship rather than ‘enemy action’ • Many stolen/lost laptops, CDs/DVDs going missing in transit… • Government has mandated that all public service organisations must ensure all portable devices taken off their site have the data storage encrypted by an approved tool • This means: all laptops and other portable devices (PDAs, phones) which have access to ‘data’ on the RAL network must have encryption before they leave site • ‘Data’ means any data that can identify or be associated with any individual – thus Outlook caches, email lists, synchronised file caches of ‘corporate’ data of any sort • Many staff have rationalised what they keep on their laptop/PDA • Why do you need it? If you don’t need it, don’t keep it! • Using Pointsec from CheckPoint Software Technologies Ltd • Will do Windows XP, some versions of Linux • …but not Macs, or dual-boot Windows/Linux systems (yet!) • Painful but necessary • Don’t put the data at risk… Tier1 Site Report - HEPSysMan @ RAL

  6. Tier1: Grid Only • Non-Grid access to Tier-1 has ended. Only special cases now have access to: • UIs • Direct job submission (qsub) • Until end of May 2008: • IDs were maintained (disabled) • Home directories were maintained online • Mail forwarding was maintained. • After end of May 2008 • IDs will be deleted • Home directories will be backed up • Mail spool will be backed up • Mail forwarding will stop • AFS service continues for Babar (and just in case for LCG) Tier1 Site Report - HEPSysMan @ RAL

  7. CASTOR • CASTOR: production version is v2.1.6-12 hot-fix 2 • Recently much more stable and reliable • Good support from developers at CERN – working well • Some problems appear at RAL that don’t show in testing at CERN because we use features not exercised at CERN – speedy investigation and fixing • Considerable effort with CMS on tuning disk server and tape migration performance • Recent work with developers on migration strategies has improved performance considerably • Migration to v2.1.7-7 imminent Tier1 Site Report - HEPSysMan @ RAL

  8. dCache closure • dCache service closure was announced was announced for the end May 2008 • Migration of data is proceeding • Some work to do to provide generic Castor instance for small VOs • Likely the closure deadline will extend some months Tier1 Site Report - HEPSysMan @ RAL

  9. Hardware: New Capacity Storage • 182 x 9TB 16-bay 3U servers: 1638TB data capacity • Two Lots based on same Supermicro chassis with different disk OEM (WD, Seagate) and CPU (AMD, Intel) • Dual RAID controllers – data and system disks separate: • 3Ware 9650SX-16ML, 14 x 750GB data drives • 3Ware 9650SX-4, 2 x 250GB or 400GB system drives • Twin CPUs (quad-core Intel, dual-core AMD), 8GB RAM, dual 1GB NIC • Intel set being deployed • Used in CCRC08 • AMD set: some issues with forcedeth network driver (SL4.4) under high sustained load Tier1 Site Report - HEPSysMan @ RAL

  10. Backplane Failures (Supermicro) • 3 servers “burn out” backplane • 2 of which set off VESDA • 1 called out fire-brigade! • Safety risk assessment: Urgent rectification needed • Good response from supplier/manufacturer • PCB fault in “bad batch” • Replacement complete Tier1 Site Report - HEPSysMan @ RAL

  11. Hardware: CPU • 2007: production capacity ~1500KSI2K on 600 systems • Late 2007: upgraded about 50% of capacity to 2GB/core • FY07/08 procurement (~3000KSI2K - but YMMV) • Streamline • 57 x 1U servers (114 systems, 3 racks), each system: • dual Intel E5410 (2.33GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD • Clustervision • 56 x 1U servers (112 systems, 4 racks), each system: • dual Intel E5440 (2.83GHz) quad-core CPUs • 2GB/core, 1 x 500GB HDD • Configuration based on 15kW per rack maximum, from supplied ‘full-load’ power consumption data. Required to meet power supply and cooling restrictions (2 x 32A supplies). Tier1 Site Report - HEPSysMan @ RAL

  12. Hardware: non-Capacity • Servers for Grid services (CEs, WMSs, FTS etc • 11 ‘twin’ systems, same as batch workers but two disks • Low capacity storage • 6 x 2U servers, 8GB RAM, dual chip dual-core AMD CPU, 2 x 250GB HDD (RAID1 system), 4 x 750GB HDD (RAID5 data), 3Ware controller • For AFS and Home filesystem, installation repositories… • Xen • 4 ‘monster’ systems for virtualisation • 2 x dual core AMD 2222 CPUs, 32GB RAM, 4 x 750 GB HDDs on HW RAID controller • For PPS service and Tier1 testing. • Oracle Databases • 5 x servers (redundant PSUs, HW RAID disks) and 7TB data array (HW RAID) • To provide additional RAC nodes for 3D services, LFC/FTS backend, Atlas TAG etc. Tier1 Site Report - HEPSysMan @ RAL

  13. FY08/09 Procurements • Capacity procurements for 2008/9 started • PQQs issued to OJEU, responses due mid July • Evaluation and issue of technical documents for limited second stage expected by early August • Second stage evaluation September/October • Delivery … • Looking for ~1800TB usable storage and around the same compute capacity as last year • Additional non-capacity hardware required to replace aging re-tasked batch workers. Tier1 Site Report - HEPSysMan @ RAL

  14. Hardware: Network RAL Site CPUs + Disks CPUs + Disks ADS Caches RAL Tier 2 N x 1Gb/s 2 x 5510 + 5530 3 x 5510 + 5530 5510 5530 10Gb/s Router A Firewall Force10 C300 8 slot Router (64*10Gb) Stack 4 x Nortel 5530 10Gb/s bypass OPN Router 10Gb/s Site Access Router 5 x 5510 + 5530 6 x 5510 + 5530 Oracle systems 1Gb/sLancaster (test network) 10Gb/s to SJ5 CPUs + Disks CPUs + Disks Tier 1 10Gb/s to CERN Tier1 Site Report - HEPSysMan @ RAL

  15. Services • New compute capacity enabled for CCRC08 • Exposed weakness under load in single CE configuration • Deployed three extra CEs as previously planned • Moved LFC backend to single-node Oracle • To move to RAC with FTS backend shortly • Maui issues with buffer sizes caused by large increase in number of jobs running • Monitoring task killing maui at 8 hour intervals • Rebuild with larger buffer sizes cured the problem Tier1 Site Report - HEPSysMan @ RAL

  16. Monitoring / On-Call • Cacti – network traffic and power • Ganglia - performance • Nagios – alerts • 24x7 callout now operational • Using Nagios to signal existing pager system to initiate callouts • Working well • But still learning • Blogging • UK T2s and the Tier1 have blogs: • http://planet.gridpp.ac.uk/ Tier1 Site Report - HEPSysMan @ RAL

  17. Power Failure: Thursday 7th Feb ~12:15 • Work on building power supplies since December • Down to 1 transformer (of 2) for extended periods (weeks). Increased risk of disaster • Single transformer running at (close to) maximum operating load • No problems until work finished and casing being closed • control line crushed and power supply tripped! • First power interruption for over 3 years • Restart (Effort > 200 FTE hours) • Most Global/National/Tier-1 core systems up by Thursday evening • Most of CASTOR/dCache/NFS data services and part of batch up by Friday • Remaining batch on Saturday/Sunday • Still problems to iron out in CASTOR on Monday/Tuesday • Lessons • Communication was prompt and sufficient but ad-hoc • Broadcast unavailable as RAL run the GOCDB (now fixed by caching) • Careful restart of disk servers slow and labour intensive (but worked) will not scale http://www.gridpp.rl.ac.uk/blog/2008/02/18/review-of-the-recent-power-failure/ Tier1 Site Report - HEPSysMan @ RAL

  18. Power Glitch: Tuesday 6th May ~07:03 • County-wide power interruption • At RAL, lost power to ISIS, Atlas, lasers etc • Single phase (B) • Knocked off some systems, caused reboots of others • Blew several fuses in upper machine room • Recovery quite quick • No opportunity for controlled restart • most of the systems automatically restarted and had gone though fsck or journal recovery before T1/Castor staff arrived. Tier1 Site Report - HEPSysMan @ RAL

More Related