170 likes | 265 Views
Tier1 Report. HEPSysMan @ Cambridge 23rd October 2006 Martin Bly. Overview. Tier-1 Hardware changes Services. RAL Tier-1. RAL hosts the UK WLCG Tier-1 Funded via GridPP2 project from PPARC Supports WLCG and UK Particle Physics users and collaborators VOs:
E N D
Tier1 Report HEPSysMan @ Cambridge 23rd October 2006 Martin Bly
Overview • Tier-1 • Hardware changes • Services HEPSysMan @ Cambridge
RAL Tier-1 • RAL hosts the UK WLCG Tier-1 • Funded via GridPP2 project from PPARC • Supports WLCG and UK Particle Physics users and collaborators • VOs: • LHC: Atlas, CMS, LHCb, Alice, (dteam, ops) • Babar CDF, D0, H1, Zeus • bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, … • Other experiments: • Mice, SNO, UKQCD • Theory users • … HEPSysMan @ Cambridge
Staff / Finance • Bid to PPARC for ‘GridPP3’ project • For exploitation phase of LHC • September 2007 to March 2011 • Increase in staff and hardware resources • Result early 2007 • Tier-1 is recruiting • 2 x systems admins, 1 x hardware technician • 1 x grid deployment • Replacement for Steve Traylen to head grid deployment and user support group • CCLRC internal reorganisation • Business Units • Tier1 service is run by E-Science department which is now part of the Facilities Business Unit (FBU) HEPSysMan @ Cambridge
New building • Funding approved for a new computer centre building • 3 floors • Computer rooms on ground floor, offices above • 240m2 low power density room • Tape robots, disk servers etc • Minimum heat density 1.0 kW/m2, rising to 1.6kW/m2 by 2012 • 490m2 high power density room • Servers, CPU farms, HPC clusters • Minimum heat density 1.8kW/m2, rising to 2.8Kw/m2 by 2012 • UPS computer room • 8 racks + 3 telecoms racks • UPS system to provide continuous power of 400A/92KVA three phase for equipment plus power to air conditioning (total approx 800A/184KVA) • Overall • Space for 300 racks (+ robots, telecoms) • Power: 2700kVA initially, max 5000kVA by 2012 (inc air-con) • UPS capacity to meet estimated 1000A/250KVA for 15-20 minutes for specific hardware for clean shutdown / surviving short breaks • Shared with HPC and other CCLRC computing facilities • Planned to be ready by summer 2008 HEPSysMan @ Cambridge
Hardware changes • FY05/06 capacity procurement March 06 • 52 x 1U twin dual-core AMD 270 units • Tyan 2882 motherboard • 4GB RAM, 250GB SATA HDD, dual 1GB NIC • 208 job slots, 200kSI2K • Commissioned May 06, running well • 21 x 5U 24-bay disk servers • 168TB (210TB) data capacity • Areca 1170 PCI-X 24-port controller • 22 x 400GB (500GB) SATA data drives, RAID 6 • 2 x 250GB SATA system drives, RAID 1 • 4GB RAM, dual 1GB NIC • Commissioning delayed (more…) HEPSysMan @ Cambridge
Hardware changes (2) • FY 06/07 capacity procurements • 47 x 3U 16-bay disk servers: 282TB data capacity • 3Ware 9550SX-16ML PCI-X 16-port SATA RAID controller • 14 x 500GB SATA data drives, RAID 5 • 2 x 250GB SATA system drives, RAID 1 • Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC • Delivery expected October 06 • 64 x 1U twin dual-core Intel Woodcrest 5130 units (550kSI2K) • 4GB RAM, 250GB SATA HDD, dual 1GB NIC • Delivery expected November 06 • Upcoming in FY 06/07: • Further 210TB disk capacity expected December 06 • Same spec as above • High Availability systems with UPS • Redundant PSUs, hot-swap paired HDDs etc • AFS replacement • Enhancement to Oracle services (disk arrays or RAC servers) HEPSysMan @ Cambridge
Hardware changes (3) • SL8500 tape robot • Expanded from 6,000 to 10,000 slots • 10 drives shared between all users of service • Additional 3 x T10K tape drives for PP • More when CASTOR service working • STK Powderhorn • Decommissioned and removed HEPSysMan @ Cambridge
Storage commissioning • Problems with March 06 procurement: • WD4000YR on Areca 1170, RAID 6 • Many instances of multiple drive dropouts • Un-warranted drive dropouts and then re-integrating the same drive • Drive electronics (ASIC) on 4000YR (400GB) units changed with no change of model designation • We got the updated units • Firmware updates to Areca cards did not solve the issues • WD5000YS (500GB) units swapped-in by WD • Fixes most issues but… • Status data and logs from drives showing several additional problems • Testing under high load to gather statistics • Production further delayed HEPSysMan @ Cambridge
Air-con issues • Setup • 13 x 80KW units in lower machine room, several paired units work together • Several ‘hot’ days (for the UK) in July • Sunday: dumped ~70 jobs • Alarm system failed to notify operators • Pre-emptive automatic shutdown not triggered • Ambient air temp reached >35C, machine exhaust temperature >50C ! • HPC services not so lucky • Mid week 1: problems over two days • attempts to cut load by suspending batch services to protect data services • forced to dump 270 jobs • Mid week 2: 2 hot days predicted • pre-emptive shutdown of batch services in lower machine room • no jobs lost, data services remain available • Problem • High ambient air temperature tripped high pressure cut-outs in refrigerant gas circuits • Cascade failure as individual air-con units work harder • Loss of control of machine room temperature • Solutions • Sprinklers under units • Successful but banned due to Health and Safety concerns • Up-rated refrigerant gas pressure settings to cope with higher ambient air temperature HEPSysMan @ Cambridge
Operating systems • Grid services, batch workers, service machines • SL3, mainly 3.0.3, 3.0.5, 4.2, all ix86 • SL4 before Xmas • Considering x86_64 • Disk storage • SL4 migration in progress • Tape systems • AIX: caches • Solaris: controller • SL3/4: CASTOR systems, newer caches • Oracle systems • RHEL3/4 • Batch system • Torque/MAUI • Fare-shares, allocation by User Board HEPSysMan @ Cambridge
Databases • 3D project • Participating since early days • Single Oracle server for testing • Successful • Production service • 2 x Oracle RAC clusters • Two servers per RAC • Redundant PSUs, hot-swap RAID1 system drives • Single SATA/FC data array • Some transfer rate issues • UPS to come HEPSysMan @ Cambridge
Storage Resource Management • dCache • Performance issues • LAN performance very good • WAN performance and tuning problems • Stability issues • Now better: • increased number of open file descriptors • increased number of logins allowed. • ADS • In-house system many years old • Will remain for some legacy services • CASTOR2 • Replace both dCache disk and tape SRMs for major data services • Replace T1 access to existing ADS services • Pre-production service for CMS • LSF for transfer scheduling HEPSysMan @ Cambridge
Monitoring • Nagios • Production service implemented • 3 servers (1 master + 2 slaves) • Almost all systems covered • 600+ • Replacing SURE • Add call-out facilities HEPSysMan @ Cambridge
Networking • All systems have 1Gb/s connections • Except oldest fraction of the batch farm • 10GB/s links almost everywhere • 10Gb/s backbone within Tier-1 • Complete November 06 • Nortel 5530/5510 stacks • 10Gb/s link to RAL site backbone • 10Gb/s backbone links at RAL expected end November 06 • 10Gb/s link to RAL Tier-2 • 10Gb/s link to UK academic network SuperJanet5 (SJ5) • Expected in production by end of November 06 • Firewall still an issue • Planned bypass for Tier1 data traffic as part of RAL<->SJ5 and RAL backbone connectivity developments • 10Gb/s OPN link to CERN active • September 06 • Using pre-production SJ5 circuit • Production status at SJ5 handover HEPSysMan @ Cambridge
Security • Notified of intrusion at Imperial College London • Searched logs • Unauthorised use of account from suspect source • Evidence of harvesting password maps • No attempt to conceal activity • Unauthorised access to other sites • No evidence of root compromise • Notified sites concerned • Incident widespread • Passwords changed • All inactive accounts disabled • Cleanup • Changed NIS to use shadow password map • Reinstall all interactive systems HEPSysMan @ Cambridge