180 likes | 284 Views
RAL Site Report. Martin Bly HEPiX Fall 2009, LBL, Berkeley CA. Overview. New Building Tier1 move Hardware Networking Developments. New Building + Tier1 move. New building handed over in April Half the department moved in to R89 at the start of May
E N D
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA
Overview • New Building • Tier1 move • Hardware • Networking • Developments HEPiX Fall 2009 LBL - RAL Site Report
New Building + Tier1 move • New building handed over in April • Half the department moved in to R89 at the start of May • Tier1 staff and the rest of the department moved in 6 June • Tier1 2008 procurements delivered direct to new building • Including new SL8500 tape silo (commissioned then moth-balled) • New hardware entered testing as soon as practicable • Non-Tier1 kit including HPC clusters moved starting early June • Tier1 moved 22 June – 6 July • Complete success, to schedule • 4 contractor firms, all T1 staff • 43 racks, a c300 switch and 1 tape silo • Shortest practical service down times HEPiX Fall 2009 LBL - RAL Site Report
Building issues and developments • Building generally working well but it is usual to have teething troubles in new buildings… • Two air-con failures • Machine room air temperature reached >40 ºC in 30 minutes • Moisture where it shouldn’t be • The original building plan included a Combined Heat and Power unit (CHP) so only enough chilled water capacity was installed until the CHP was installed and working • Plan changed to remove CHP => shortfall in chilled water capacity • Two extra 750kW chillers ordered for installation early in 2010 • Provide planned cooling until 2012/13 • Timely – planning now underway for first water-cooled racks (for non-Tier1 HPC facilities) HEPiX Fall 2009 LBL - RAL Site Report
Recent New Hardware • CPU • ~3000kSi2K (~1850 cores) in Supermicro ‘twin’ systems • E5420/San Clemente & L5420/Seaburg: 2GB/core, 500GB HDD • Now running SL5/x86_64 in production • Disk • ~2PB in 4U 24-bay chassis, 22 data disks in RAID6, 2 system disks in RAID1 • – 2 vendors: • 50 with single Areca controller and 1TB WD data drives • Deployed • 60 with dual LSI/3ware/AMCC controllers and 1TB Seagate data drives • Second SL8500 silo, 10K slots, 10PB (1TB tapes) • Delivered to new machine room – pass-through to existing robot • Tier1 use – GridPP tape drives have been transferred HEPiX Fall 2009 LBL - RAL Site Report
Recent / Next Hardware • ‘Services’ nodes • 10 ‘twins’ (20 systems), twin disks • 3 Dell PE 2950 III servers and 4 EMC AX4-5 array units for Oracle RACs • Extra SAN hardware for resilience • Procurements running • ~15000 HEP-SPEC06 for batch, 3GB RAM and 100GB disk per core • => 24GB RAM and 1TB drive for 8 core system • ~3PB disk storage in two lots of two tranches, January and April • Additional tape drives: 9 x T10KB • Initially for CMS • Total 18 x T10KA and 9 x T10KB for PP use • To come • More services nodes HEPiX Fall 2009 LBL - RAL Site Report
Disk Storage • ~350 servers • RAID6 on PCI-e SATA controllers, 1Gb/s NIC • SL4 32bit with ext3 • Capacity ~4.2PB in 6TB, 8TB, 10TB, 20TB servers • Mostly deployed for Castor service • Three partitions per server • Some NFS (legacy data, xrootd (Babar) • Single/multiple partitions as required • Array verification using controller tools • 20% of capacity in any Castor service class done in a week • Tuesday to Thursday, servers that have gone longest since last verify • Fewer double throws, decrease in overall throw rates • Also using CERN fsprobe to look for silent data corruption HEPiX Fall 2009 LBL - RAL Site Report
Hardware Issues I • Problem during acceptance testing of part of 2008 storage procurement • 22 x 1TB SATA drives on PCI-e RAID controller • Drive timeouts, arrays inaccessible • Working with supplier to resolve issue • Supplier is working hard on our behalf • Regular phone conferences • Engaged with HDD and controller OEMs • Appears to be two separate issues • HDD • Controller • Possible that resolution of both issues is in sight HEPiX Fall 2009 LBL - RAL Site Report
Hardware Issues II – Oracle databases • New resilient hardware configuration for Oracle Databases SAN using EMC AX4 array sets • Used in ‘mirror’ pairs at Oracle ASM level. • Operated well for Castor pre-move and for non-Castor post-move but increasing instances of controller dropout on Castor kit • Eventual crash of one castor array, followed some time later but the second array • Non-Castor array pair also unstable, eventually both crashed together • Data loss from Castor databases due to side effect of having arrays crashing at different times and therefore being out of sync. No unique files ‘lost’. • Investigations continuing to find cause – possibly electrical HEPiX Fall 2009 LBL - RAL Site Report
Networking • Force10 C300 in use as core switch since Autumn 08 • Up to 64 x 10GbE at wire speed (32 ports fitted) • Not implementing routing on C300 • Turns out the C300 doesn’t support policy-base routing … • … but policy-based routing is on roadmap for C300 software • Next year sometime • Investigating possibilities for added resilience with additional C300 • Doubled up link to OPN gateway to alleviate bottleneck caused by routing UK T2 traffic round site firewall • Working on doubling links to edge stacks • Procuring fallback link for OPN to CERN using 4 x 1GbE • Added resilience HEPiX Fall 2009 LBL - RAL Site Report
Developments I - Batch Services • Production service: • SL5.2/64bit with residual SL4.7/32bit (2%) • ~4000 cores, ~32000 HEP-SPEC06 • Opteron 270, • Woodcrest E5130 • Harpertown E5410, E5420, L5420 and E5440 • All with 2GB RAM/core • Torque/Maui on SL5/64bit host with 64bit Torque server • Deployed with Quattor in September • Running 50% over-commit on RAM to improve occupancy • Previous service: • 32bit Torque/Maui server (SL3) and 32bit CPU workers all retired • Hosts used for testing etc HEPiX Fall 2009 LBL - RAL Site Report
Developments II - Dashboard • A new dashboard to provide an operational overview of services and the Tier1 ‘state’ for operations staff, VOs … • Constantly evolving • Components can be added/updated/removed • Pulls data from lots of sources • Present components • SAM Tests • Latest test results for critical services • Locally cached for 10 minutes to reduce load • Downtimes • Notices • Latest information on Tier 1 operations • Only Tier 1 staff can post • Ganglia plots of key components from the Tier1 farm • Available at http://www.gridpp.rl.ac.uk/status HEPiX Fall 2009 LBL - RAL Site Report
Developments III - Quattor • Fabric management using Quattor • Will replace existing hand crafted PXE/kickstart and payload scripting • Successful trial of Quattor using virtual systems • Production deployment of SL5/x86_64 WNs and Torque / Maui for 64bit batch service in mid September • Now have additional nodes types under Quattor management • Working on disk servers for Castor • See Ian Collier’s talk on our Quattor experiences: http://indico.cern.ch/contributionDisplay.py?contribId=52&sessionId=21&confId=61917 HEPiX Fall 2009 LBL - RAL Site Report
Towards data taking • Lots of work in last 12 months to make services more resilient • Take advantage of LHC delays • Freeze on service updates • No ‘fiddling’ with services • Increased stability • Reduced downtimes • Non-intrusive changes • But need to do some things such as security updates • Need to manage to avoid service down time HEPiX Fall 2009 LBL - RAL Site Report