1 / 17

Tier1 Report

Tier1 Report. HEPSysMan @ Cambridge 23rd October 2006 Martin Bly. Overview. Tier-1 Hardware changes Services. RAL Tier-1. RAL hosts the UK WLCG Tier-1 Funded via GridPP2 project from PPARC Supports WLCG and UK Particle Physics users and collaborators VOs:

flynn
Download Presentation

Tier1 Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tier1 Report HEPSysMan @ Cambridge 23rd October 2006 Martin Bly

  2. Overview • Tier-1 • Hardware changes • Services HEPSysMan @ Cambridge

  3. RAL Tier-1 • RAL hosts the UK WLCG Tier-1 • Funded via GridPP2 project from PPARC • Supports WLCG and UK Particle Physics users and collaborators • VOs: • LHC: Atlas, CMS, LHCb, Alice, (dteam, ops) • Babar CDF, D0, H1, Zeus • bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, … • Other experiments: • Mice, SNO, UKQCD • Theory users • … HEPSysMan @ Cambridge

  4. Staff / Finance • Bid to PPARC for ‘GridPP3’ project • For exploitation phase of LHC • September 2007 to March 2011 • Increase in staff and hardware resources • Result early 2007 • Tier-1 is recruiting • 2 x systems admins, 1 x hardware technician • 1 x grid deployment • Replacement for Steve Traylen to head grid deployment and user support group • CCLRC internal reorganisation • Business Units • Tier1 service is run by E-Science department which is now part of the Facilities Business Unit (FBU) HEPSysMan @ Cambridge

  5. New building • Funding approved for a new computer centre building • 3 floors • Computer rooms on ground floor, offices above • 240m2 low power density room • Tape robots, disk servers etc • Minimum heat density 1.0 kW/m2, rising to 1.6kW/m2 by 2012 • 490m2 high power density room • Servers, CPU farms, HPC clusters • Minimum heat density 1.8kW/m2, rising to 2.8Kw/m2 by 2012 • UPS computer room • 8 racks + 3 telecoms racks • UPS system to provide continuous power of 400A/92KVA three phase for equipment plus power to air conditioning (total approx 800A/184KVA) • Overall • Space for 300 racks (+ robots, telecoms) • Power: 2700kVA initially, max 5000kVA by 2012 (inc air-con) • UPS capacity to meet estimated 1000A/250KVA for 15-20 minutes for specific hardware for clean shutdown / surviving short breaks • Shared with HPC and other CCLRC computing facilities • Planned to be ready by summer 2008 HEPSysMan @ Cambridge

  6. Hardware changes • FY05/06 capacity procurement March 06 • 52 x 1U twin dual-core AMD 270 units • Tyan 2882 motherboard • 4GB RAM, 250GB SATA HDD, dual 1GB NIC • 208 job slots, 200kSI2K • Commissioned May 06, running well • 21 x 5U 24-bay disk servers • 168TB (210TB) data capacity • Areca 1170 PCI-X 24-port controller • 22 x 400GB (500GB) SATA data drives, RAID 6 • 2 x 250GB SATA system drives, RAID 1 • 4GB RAM, dual 1GB NIC • Commissioning delayed (more…) HEPSysMan @ Cambridge

  7. Hardware changes (2) • FY 06/07 capacity procurements • 47 x 3U 16-bay disk servers: 282TB data capacity • 3Ware 9550SX-16ML PCI-X 16-port SATA RAID controller • 14 x 500GB SATA data drives, RAID 5 • 2 x 250GB SATA system drives, RAID 1 • Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC • Delivery expected October 06 • 64 x 1U twin dual-core Intel Woodcrest 5130 units (550kSI2K) • 4GB RAM, 250GB SATA HDD, dual 1GB NIC • Delivery expected November 06 • Upcoming in FY 06/07: • Further 210TB disk capacity expected December 06 • Same spec as above • High Availability systems with UPS • Redundant PSUs, hot-swap paired HDDs etc • AFS replacement • Enhancement to Oracle services (disk arrays or RAC servers) HEPSysMan @ Cambridge

  8. Hardware changes (3) • SL8500 tape robot • Expanded from 6,000 to 10,000 slots • 10 drives shared between all users of service • Additional 3 x T10K tape drives for PP • More when CASTOR service working • STK Powderhorn • Decommissioned and removed HEPSysMan @ Cambridge

  9. Storage commissioning  • Problems with March 06 procurement: • WD4000YR on Areca 1170, RAID 6 • Many instances of multiple drive dropouts • Un-warranted drive dropouts and then re-integrating the same drive • Drive electronics (ASIC) on 4000YR (400GB) units changed with no change of model designation • We got the updated units • Firmware updates to Areca cards did not solve the issues • WD5000YS (500GB) units swapped-in by WD • Fixes most issues but… • Status data and logs from drives showing several additional problems • Testing under high load to gather statistics • Production further delayed HEPSysMan @ Cambridge

  10. Air-con issues • Setup • 13 x 80KW units in lower machine room, several paired units work together • Several ‘hot’ days (for the UK) in July • Sunday: dumped ~70 jobs • Alarm system failed to notify operators • Pre-emptive automatic shutdown not triggered • Ambient air temp reached >35C, machine exhaust temperature >50C ! • HPC services not so lucky • Mid week 1: problems over two days • attempts to cut load by suspending batch services to protect data services • forced to dump 270 jobs • Mid week 2: 2 hot days predicted • pre-emptive shutdown of batch services in lower machine room • no jobs lost, data services remain available • Problem • High ambient air temperature tripped high pressure cut-outs in refrigerant gas circuits • Cascade failure as individual air-con units work harder • Loss of control of machine room temperature • Solutions • Sprinklers under units • Successful but banned due to Health and Safety concerns • Up-rated refrigerant gas pressure settings to cope with higher ambient air temperature HEPSysMan @ Cambridge

  11. Operating systems • Grid services, batch workers, service machines • SL3, mainly 3.0.3, 3.0.5, 4.2, all ix86 • SL4 before Xmas • Considering x86_64 • Disk storage • SL4 migration in progress • Tape systems • AIX: caches • Solaris: controller • SL3/4: CASTOR systems, newer caches • Oracle systems • RHEL3/4 • Batch system • Torque/MAUI • Fare-shares, allocation by User Board HEPSysMan @ Cambridge

  12. Databases • 3D project • Participating since early days • Single Oracle server for testing • Successful • Production service • 2 x Oracle RAC clusters • Two servers per RAC • Redundant PSUs, hot-swap RAID1 system drives • Single SATA/FC data array • Some transfer rate issues • UPS to come HEPSysMan @ Cambridge

  13. Storage Resource Management • dCache • Performance issues • LAN performance very good • WAN performance and tuning problems • Stability issues • Now better: • increased number of open file descriptors • increased number of logins allowed. • ADS • In-house system many years old • Will remain for some legacy services • CASTOR2 • Replace both dCache disk and tape SRMs for major data services • Replace T1 access to existing ADS services • Pre-production service for CMS • LSF for transfer scheduling HEPSysMan @ Cambridge

  14. Monitoring • Nagios • Production service implemented • 3 servers (1 master + 2 slaves) • Almost all systems covered • 600+ • Replacing SURE • Add call-out facilities HEPSysMan @ Cambridge

  15. Networking • All systems have 1Gb/s connections • Except oldest fraction of the batch farm • 10GB/s links almost everywhere • 10Gb/s backbone within Tier-1 • Complete November 06 • Nortel 5530/5510 stacks • 10Gb/s link to RAL site backbone • 10Gb/s backbone links at RAL expected end November 06 • 10Gb/s link to RAL Tier-2 • 10Gb/s link to UK academic network SuperJanet5 (SJ5) • Expected in production by end of November 06 • Firewall still an issue • Planned bypass for Tier1 data traffic as part of RAL<->SJ5 and RAL backbone connectivity developments • 10Gb/s OPN link to CERN active • September 06 • Using pre-production SJ5 circuit • Production status at SJ5 handover HEPSysMan @ Cambridge

  16. Security • Notified of intrusion at Imperial College London • Searched logs • Unauthorised use of account from suspect source • Evidence of harvesting password maps • No attempt to conceal activity • Unauthorised access to other sites • No evidence of root compromise • Notified sites concerned • Incident widespread • Passwords changed • All inactive accounts disabled • Cleanup • Changed NIS to use shadow password map • Reinstall all interactive systems HEPSysMan @ Cambridge

  17. Questions?

More Related