1 / 11

RAL Site Report

RAL Site Report. HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL. Overview. General Hardware Storage Networking …. General. New CEO for STFC John Womersley takes over from Keith Mason on 1st November To 31 st March 2015 Staffing @ Tier1

gaston
Download Presentation

RAL Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAL Site Report HEPiX 20th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL

  2. Overview • General • Hardware • Storage • Networking • … RAL Site Report - HEPiX Spring 2011

  3. General • New CEO for STFC • John Womersley takes over from Keith Mason on 1st November • To 31st March 2015 • Staffing @ Tier1 • 5 staff posts open due to staff moving • Replacements agreed despite restrictions • Recruitments underway • Power • ‘Partial Discharge’ (arcing) detected in 11kV bus in transformer room • Isolated to the join between two bus segments (bus-coupler) • Loose bolt in bus bar identified and tightened up – fixed RAL Site Report - HEPiX Spring 2011

  4. Hardware changes • Summary of previous report: • 13 x Dell R610 tape servers (10GbE) for T10KC drives • 14 x T10KC tape drives • Arista 7124S 24-port 10GbE switch + twinax copper interconnects • 5 x Avaya 5650 switches + various 10/100/1000 switches • New since May • Various Dell R510s for small data servers for Facilities Data Service, provides interfaces into Castor for RAL site facilities and others. • 68 x 40TB 4U servers ordered for capacity storage – two suppliers • 10GbE, 2TB HDD, single CPU, 24GB RAM, 2.66PB total • Note that disks may be hard to get  • 15,000 HEP-SPEC tender completed evaluation, result just announced • To come • 40GbE/10GbE and 10Gbe/1GbE switches, management switches, more tape servers, T10KC tape drives and tapes, iSCSI arrays, ... • Gone: 22 x 10TB servers - 2005 generation • To go: 86 x 6TB servers – 2006 generation RAL Site Report - HEPiX Spring 2011

  5. Storage Issues • Issue with some 3ware controllers throwing perfectly healthy WD drives • Due to firmware not recognising and handling failure mode on newer WD drives of the same model • Firmware update has fixed this, rollout completed • Issue with Adaptec controllers and StorageManager software • SM reports many SMART errors when drives are healthy • reports unhealthy ones too • Firmware update has fixed this, rolling out shortly • Problem with T10KC drives • Early production batch issue • Firmware fix • No recurrence • Production storage now using most recent sets of hardware with older (smaller capacity) hardware ‘spinning reserve’ RAL Site Report - HEPiX Spring 2011

  6. Castor Status • Castor manages disk and tape storage • 18 million files (at Oct 2011) • Recent news: • Moved to T10KC tape media in production in September (Atlas, LHCb) • New (non-Tier1) production instance for Diamond synchrotron • Part of a new complete Facilities Data Service which provides data transparent aggregation (StorageD) metadata service (ICAT) and web (TopCAT) and FUSE frontends to access data • Coming up (Jan-Mar): • Move to new database hardware and better resilient architecture (using DataGuard) over next 6 months • Major upgrade of CASTOR with a new optimized scheduler and new tape functionality – better for small files • New service ’head nodes’ in test: Dell R410 and Transtec RAL Site Report - HEPiX Spring 2011

  7. Networking • WAN • UK NREN JANET now has a 100Gb/s backbone. • Funding for the next upgrade of the NREN SuperJANet6 has recently been approved • Site • Sporadic packet loss in site core networking (few %) • Still present to a very small degree – intermittent problems with access to LFC dropping for remote users (T2s). May be load related. • Asymmetric Data Transfer rates in/out of Tier1 • Many possible causes: Load; FTS settings, disk server settings; TCP/IP tuning, network (LAN & WAN performance) • Have modified FTS settings with some success • Looking at Tier1-UK Tier2 transfers • LAN • Another failed 10GbE XFP transceiver, and a death in service of a Nortel 5510 • Three subnets in use for Tier1 • Lots of packet discards into stacks, investigating... • Developments • Looking to provide large bandwidth in Tier1 core with ‘mesh-type’ arrangement linked at multiple 40Gb/s with storage connectivity at 10Gb/s. RAL Site Report - HEPiX Spring 2011

  8. Databases • Small but significant Oracle installation • Castor, 3D, LFC, FTS • Castor database server hardware to be replaced • Old: 2 x 5-node (32bit) RACs, EMC AX4 arrays • New: 2 pairs of 3-node (64bit) RACs, EMC AX4 + Infortrend Arrays • Different ASM architecture – single volumes rather than paired • Dataguard from Production RAC to Standby RAC for resilience • Standby RACs in different building • Backups off the Standby set • LFC/FTS • Standby set to be added to the existing setup, Dataguard and backup as per Castor, single volume data, ASM volume architecture changes • 3D • ASM volume architecture changes RAL Site Report - HEPiX Spring 2011

  9. Virtualisation • Evaluated MS Hyper-V for services virtualization platform • Beginning to roll out local-storage virtualisation for services that don’t need fast failover • Struggled for a long time with iSCSI storage arrays (and poor support) • New iSCSI arrays ordered • To support fast-failover etc • Cloud project • Department initiative looking at cloud use • Talk by Ian Collier RAL Site Report - HEPiX Spring 2011

  10. Projects • Quattor • Batch and Storage systems under Quattor management • ~6200 cores, 700+ systems (batch), 500+ system (storage) • Significant time saving • Significant rollout on Grid services node types • CernVM-FS • Major deployment at RAL to cope with software distribution issues • More news in talk by Ian Collier later this week RAL Site Report - HEPiX Spring 2011

  11. Questions? RAL Site Report - HEPiX Spring 2011

More Related