110 likes | 233 Views
RAL Site Report. HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL. Overview. General Hardware Storage Networking …. General. New CEO for STFC John Womersley takes over from Keith Mason on 1st November To 31 st March 2015 Staffing @ Tier1
E N D
RAL Site Report HEPiX 20th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL
Overview • General • Hardware • Storage • Networking • … RAL Site Report - HEPiX Spring 2011
General • New CEO for STFC • John Womersley takes over from Keith Mason on 1st November • To 31st March 2015 • Staffing @ Tier1 • 5 staff posts open due to staff moving • Replacements agreed despite restrictions • Recruitments underway • Power • ‘Partial Discharge’ (arcing) detected in 11kV bus in transformer room • Isolated to the join between two bus segments (bus-coupler) • Loose bolt in bus bar identified and tightened up – fixed RAL Site Report - HEPiX Spring 2011
Hardware changes • Summary of previous report: • 13 x Dell R610 tape servers (10GbE) for T10KC drives • 14 x T10KC tape drives • Arista 7124S 24-port 10GbE switch + twinax copper interconnects • 5 x Avaya 5650 switches + various 10/100/1000 switches • New since May • Various Dell R510s for small data servers for Facilities Data Service, provides interfaces into Castor for RAL site facilities and others. • 68 x 40TB 4U servers ordered for capacity storage – two suppliers • 10GbE, 2TB HDD, single CPU, 24GB RAM, 2.66PB total • Note that disks may be hard to get • 15,000 HEP-SPEC tender completed evaluation, result just announced • To come • 40GbE/10GbE and 10Gbe/1GbE switches, management switches, more tape servers, T10KC tape drives and tapes, iSCSI arrays, ... • Gone: 22 x 10TB servers - 2005 generation • To go: 86 x 6TB servers – 2006 generation RAL Site Report - HEPiX Spring 2011
Storage Issues • Issue with some 3ware controllers throwing perfectly healthy WD drives • Due to firmware not recognising and handling failure mode on newer WD drives of the same model • Firmware update has fixed this, rollout completed • Issue with Adaptec controllers and StorageManager software • SM reports many SMART errors when drives are healthy • reports unhealthy ones too • Firmware update has fixed this, rolling out shortly • Problem with T10KC drives • Early production batch issue • Firmware fix • No recurrence • Production storage now using most recent sets of hardware with older (smaller capacity) hardware ‘spinning reserve’ RAL Site Report - HEPiX Spring 2011
Castor Status • Castor manages disk and tape storage • 18 million files (at Oct 2011) • Recent news: • Moved to T10KC tape media in production in September (Atlas, LHCb) • New (non-Tier1) production instance for Diamond synchrotron • Part of a new complete Facilities Data Service which provides data transparent aggregation (StorageD) metadata service (ICAT) and web (TopCAT) and FUSE frontends to access data • Coming up (Jan-Mar): • Move to new database hardware and better resilient architecture (using DataGuard) over next 6 months • Major upgrade of CASTOR with a new optimized scheduler and new tape functionality – better for small files • New service ’head nodes’ in test: Dell R410 and Transtec RAL Site Report - HEPiX Spring 2011
Networking • WAN • UK NREN JANET now has a 100Gb/s backbone. • Funding for the next upgrade of the NREN SuperJANet6 has recently been approved • Site • Sporadic packet loss in site core networking (few %) • Still present to a very small degree – intermittent problems with access to LFC dropping for remote users (T2s). May be load related. • Asymmetric Data Transfer rates in/out of Tier1 • Many possible causes: Load; FTS settings, disk server settings; TCP/IP tuning, network (LAN & WAN performance) • Have modified FTS settings with some success • Looking at Tier1-UK Tier2 transfers • LAN • Another failed 10GbE XFP transceiver, and a death in service of a Nortel 5510 • Three subnets in use for Tier1 • Lots of packet discards into stacks, investigating... • Developments • Looking to provide large bandwidth in Tier1 core with ‘mesh-type’ arrangement linked at multiple 40Gb/s with storage connectivity at 10Gb/s. RAL Site Report - HEPiX Spring 2011
Databases • Small but significant Oracle installation • Castor, 3D, LFC, FTS • Castor database server hardware to be replaced • Old: 2 x 5-node (32bit) RACs, EMC AX4 arrays • New: 2 pairs of 3-node (64bit) RACs, EMC AX4 + Infortrend Arrays • Different ASM architecture – single volumes rather than paired • Dataguard from Production RAC to Standby RAC for resilience • Standby RACs in different building • Backups off the Standby set • LFC/FTS • Standby set to be added to the existing setup, Dataguard and backup as per Castor, single volume data, ASM volume architecture changes • 3D • ASM volume architecture changes RAL Site Report - HEPiX Spring 2011
Virtualisation • Evaluated MS Hyper-V for services virtualization platform • Beginning to roll out local-storage virtualisation for services that don’t need fast failover • Struggled for a long time with iSCSI storage arrays (and poor support) • New iSCSI arrays ordered • To support fast-failover etc • Cloud project • Department initiative looking at cloud use • Talk by Ian Collier RAL Site Report - HEPiX Spring 2011
Projects • Quattor • Batch and Storage systems under Quattor management • ~6200 cores, 700+ systems (batch), 500+ system (storage) • Significant time saving • Significant rollout on Grid services node types • CernVM-FS • Major deployment at RAL to cope with software distribution issues • More news in talk by Ian Collier later this week RAL Site Report - HEPiX Spring 2011
Questions? RAL Site Report - HEPiX Spring 2011