RAL Tier1/A Site Report

RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory 18-20 October 2004 HEPiX - Brookhaven

Overview • Introduction • Hardware • Software • Security HEPiX - Brookhaven

RAL Tier1/A • RAL the Tier 1 centre in the UK • Supports all VOs but priority to ATLAS, CMS, LHCb • LCG Core site • Babar collaboration Tier A • Support for other experiments: • D0, H1, SNO, UKQCD, MINOS, Zeus, Theory, … • Various test environments for grid projects HEPiX - Brookhaven

Pre-Grid Upgrade 1 October 2000 1 July 2000 HEPiX - Brookhaven

Post-GRID Upgrade GRID Load 21-28 July Full again in 8 hours! HEPiX - Brookhaven

LCG in Production • Since June Tier1 LCG service has evolved to become a full scale production facility • Sort of sneaked up on us! Gradual change from test/development environment to full scale production. • Availability and reliability of the LCG service are now a high priority for RAL staff. • Now the largest single CPU resource at RAL HEPiX - Brookhaven

GRID Production HEPiX - Brookhaven

Hardware • Main Farms: 884 CPUs, approx 880kSI2K • 312 CPUs x P3 @ 1.4GHz, • 160 CPUs x P4/Xeon @ 2.66GHz, HT off • 512 CPUs x P4/Xeon @ 2.8GHz, HT off • Disk: approx 226TB • 52 x 800GB R5 IDE/SCSI arrays, • 22 x 2TB R5 IDE/SCSI arrays, • 40 x 4TB R5 EonStor SATA/SCSI arrays • Tape: • 6000 slot Powderhorn Silo, 200GB/tape, 8 drives. • Misc: • SUN disk servers, AIX (AFS cell) • 140 CPUs x P3 @ 1GHz HEPiX - Brookhaven

Hardware Issues • CPU and disks delivered June 16 • CPU units: • 6 in 256 failed under testing – memory, motherboard • Installed into production after ~4 weeks • Disk systems: • Riser cards failing. Looks to be the batch. • Issues with EonStor firmware – fixes from vendor • Into production about now HEPiX - Brookhaven

Enhancements • FY 2004/05 CPU/disk procurement starting shortly • expect lower volume of CPU and disk • CPU technology: Xeon/Opteron • Disk technology: SATA/SCSI, SATA/FC, … • Sun systems services and data migrating to SL3 • mail, NIS -> SL3 • data -> RH7.3, SL3 • Due Xmas ’04. • AFS cell migration to SL3/OpenAFS • Investigating SANs, iSCSI, SAS HEPiX - Brookhaven

Environment • Farms dispersed over three machine rooms • Extra temporary air conditioning capacity for summer • Actually survived with it mostly idle! • New air conditioning for lower machine room (A5L), independent from main building air-con system. 5 Units, 400kW; arrives November • Extra power distribution (but not new power) • All new rack kit to be located in A5L, shared with other high availability services (HPC etc). • Issues: • New Nocona chips use more power – and create more heat • Rack weight on raised floors – latest kit is around 8 tonnes • Air con unit weight + power HEPiX - Brookhaven

HEPiX - Brookhaven

Network • Site link – 2.5Gb/s to TVN • Site backbone @ 1Gb/s. • Tier1/A backbone @ 1Gb/s on Summit 7i and 3Com switches. • Latest purchases have single or dual 1Gb/s NIC • All batch workers connected @ 100Mb/s to 3Com fan-out switches with 1Gb/s uplink • Disk servers connected @ 1Gb/s to backbone switches • Upgrades • All new hardware to have 1Gb/s NIC • Upgrade CPU rack network switches where necessary to 1Gb/s fan-out • New backbone switches: • stackable units with 40Gb/s interlink and where possible, with 10Gb/s upgrade path to site router • Joining UKLight network • 10Gb/s • Fewer hops to HEP sites • Multiple Gb/s links to Tier1/A HEPiX - Brookhaven

Software • Transition to SL3 • Farms: • Scientific Linux 3 (Fermi) • Babar batch, prototype frontend • RedHat 7.n • 7.3: LCG batch, Tier1 batch, frontend systems • 7.2: Babar frontend systems • Servers: • SL3 • Systems services (mail, NIS, loggers, scheduler) • Redhat 7.2/7.3 • Disk servers (custom Kernels) • Fedora Core • Consoles, personal desktops • Solaris 2.6, 8, 9 • SUN systems • AIX • AFS cell HEPiX - Brookhaven

Software Issues • SL3 • Easy to install with PXE/Kickstart • Migration of Babar community from RH 7.3 batch service smooth after installation validated by Babar for batch work • Batch system using Torque/Maui versions from LCG rebuilt for SL3, with some local patches to config parameters (more jobs, more classes). Stable. • RedHat 7.n • Security a big concern (!) • Speed of patching • Custom kernels a problem • Enterprise (RHEL, SL) • Disk i/o (both read and write) performance not as good as can be achieved with RH 7.n (9). (SL, 2.4.21-15.0.n) • Need to test the more recent kernels • NFS, LVM and Megaraid controllers don’t mix! HEPiX - Brookhaven

Projects • Quattor • Ongoing preparation for implementation • Infrastructure data challenge • Joining effort to test high speed / high availability / high bandwidth data transfers to simulate LCG requirements • RSS news service • dCache • disk pool manager with SRM combined • Software complex to configure • Multiple layers – difficult to drill down to find exactly why a problem has occurred, somewhat sensitive to hardware/system configurations • Working test deployment • 1 head node, 2 pool nodes • Next steps: • create a multi-terabyte instance for CMS in LCG HEPiX - Brookhaven

Security • Firewall at RAL is default Deny inbound • Keeps many but not all badguys™ out • Specific hosts have inbound Permit for specific ports • Sets of rules for LCG components (CE, SE, RB etc) or services (AFS) • Outbound: generally open, port 80 via cache • X11 port was open but not to Tier1/A (closed 1997!) • Now closed site-wide as of 8th Oct • The badguys™ still get in… HEPiX - Brookhaven

Recent Incident (1) • Keyboard logger installed at a remote site A exposes password of account at remote site B • Access to exposed@siteB • Scans account known_hosts for possible targets • exposed@siteB has ssh keys unprotected by a pass-phrase • Unchallenged access to any account@host on list in known_hosts on which unprotected public key installed • !”£$%^&*#@;¬?>| HEPiX - Brookhaven

Recent Incident (2) • Aug 26 at 23:05 BST, Badguy™ uses unprotected key of compromised account at remote site B to enter two systems at RAL: RedHat 7.2 systems. • Downloads custom IRC bot based on Energy Mech • Contains a klogd binary which is the IRC bot • Possibly tries for privilege escalation • Installs IRC bot (klogd), attempting to usurp the system klogd or possibly other rogue klogds. Fails to kill system klogd. • Two klogd now running: system on owned by root and badguy™ version owned by compromised user. • At some time later the directory containing the bot code (/tmp/.mc) is deleted. HEPiX - Brookhaven

Recent Incident (3) • Oct 7, am: we are told system has been exhibiting suspicious activity by legitimate remote IRC server admins who are monitoring for suspicious activity. Systems removed from network and forensic investigation begins • Dump of bot/klogd process shows 4800+ hosts listed – it appears system was part of an IRC network • Badguy™ bot/klogd listens on ports tcp:8181 and udp:34058 • Contacts IRC servers at 4 addresses (port 6667), as "XzIbIt" • Firewall logs show relatively small amount of traffic from affected host • No trace of root exploits • Second host was a user frontend system: no evidence of any IRC activity or root compromise HEPiX - Brookhaven

Lessons • Unprotected ssh keys are bad news • If it is unprotected on your system then all keys owned everywhere by that user are likely unprotected too • Use ssh-agent or similar • There are still .netrc files in use for production userids • Communication • Lack of news from upstream sites a disappointment • If we had been told of exploit at the remote site and the time frames involved we would have found the IRC bot within hours • Protect infrastructure from user accessible hosts • Firewalling • Staff time: 2-3 staff weeks HEPiX - Brookhaven

RAL Tier1/A Site Report