RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003

RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003

Contents • GRID Stuff – clusters and interfaces • Hardware and utilisation • Software and utilities

Layout

EDG Status • EDG 2.0.x deployed on production test-bed since early September. Provides: • EDG RGMA info catalogue • RLS for lhcb, biom, eo, wpsix, tutor and Babar • EDG 2.1 deployed on dev test-bed. VOMS integration work underway. May be found useful by small GRIDPP experiments (eg NA48, MICE and MINOS) • EDG 2.0 gatekeeper provides gateway into main CSF production farm. Provides access for some of Babar and ATLAS work. Being prepared for forthcoming D0 production via SAMGrid • Along with IN2P3, CSFUI provides main UI for EDG • Many WP3 and WP5 mini test-beds • Further GRID integration into production farm via LCG – not EDG

LCG Integration • LCG-0 mini test-bed deployed March • LCG-1 test-bed deployed in July • LCG 1 upgraded to LCG1-1_0_1 in August/September. Consists of: • Lcgwest regional GIIS • RB, CE, SE, UI, BDII, PROXY, 5*WN • WN = 2*1GHz/1GB RAM, SE = 540GB • Soon need to make important decisions about how much hardware to deploy into LCG – driven by what the Experiment Board want. • Issues: • Installation and configuration still difficult for non-experts. • Documentation still thin in many places. • Support often very helpful but answers not always forthcoming for some problems. • Not everything works – all of the time. • Beginning to discuss internally how to interoperate with production farm.

SRB Service for CMS • SDSC Storage Resource Broker • SRB MCAT for whole CMS production. Consists of enterprise class ORACLE servers and “thin” MCAT ORACLE client. • SRB interface into Datastore • SRB enabled disk server to handle data imports. • SRB clients on disk servers for data moving • Needed some work to deploy • Very good support from developers SDSC • ADS interface integrated into main SRB source • Considerable learning experience for Datastore team (and CMS)!

P4 Xeon Experiences • Disappointing performance with gcc • Hope for 2.66P4/1.4P3=1.5 • see 1.2 - 1.3 • Can obtain more by exploiting hyper-threading but Linux CPU scheduling causes difficulties (ping-pong effects) • Performance better with Intel Compiler • Efforts to run `0(1)’ scheduler unsuccessful • CPU accounting now depends on number of jobs running. • Beginning to look closely at Opteron solutions.

Datastore Upgrade • STK 9310 robot, 6000 slots • IBM 3590 drives being phased out (10GB 10MB/Sec) • STK 9940B drives in production (200GB 30MB/sec) • 4 IBM 610+ servers with two FC connections and Gbit networking on PCI-X • 9940 drives FC connected via 2 switches for redundancy • SCSI raid 5 disk with hot spare for 1.2Tbytes cache space

STK 9310 “Powder Horn” 9940B 9940B 9940B 9940B 9940B 9940B 9940B 9940B A A A A A A A A Switch_1 1 2 3 4 Switch_2 5 6 7 8 11 12 13 14 15 11 12 13 14 15 RS6000 fsc0 fsc1 fsc0 RS6000 fsc1 fsc0 RS6000 fsc1 fsc0 RS6000 fsc1 rmt1 rmt5-8 rmt2 rmt5-8 rmt3 rmt5-8 rmt4 rmt5-8 1.2TB 1.2TB 1.2TB 1.2TB Gbit network

Operating Systems • Redhat 6.2 closed end of August (Babar build-box) • Redhat 7.2 • Babar 7.2 service migrated to Redhat 7.3 during October. • Residual `bulk’ batch service closing soon. • Three front-ends for Babar. • Redhat 7.3 • Service now main workhorse for LHC experiments and Babar batch work. • `Bulk’ service opening soon. • Three front-ends. • LCG-1 • Need to start looking at what to do next (Fedora, Debian, RH-ES/AS, …)! • Need to deploy Redhat Advanced Server 

Next Procurement • Based on experiments expected demand profile (as best they can estimate). • Exact numbers still being finalised, but about: • 250 dual processor CPU nodes • 70TB available disk • 100TB tape

CPU Requirements (KSI2K)

New Helpdesk • Need to deploy new helpdesk (had Remedy). Wanted: • Web based. • Free open source. • Multiple queues and personalities. • Looked at Bugzilla, OTRS and RequestTracker. • Finally selected RequestTracker. • http://helpdesk.gridpp.rl.ac.uk/. • Available for other Tier 2 sites and other GRIDPP projects if needed.

YUMIT: RPM Monitoring • Hundreds of nodes on the farm. Need to make sure RPMs are up to date. • Wanted light-weight solution until full fabric management tools are deployed. • Package written by Steve Traylen: • Yum installed on all systems • Nightly comparison with YUM database uploaded to MYSQL server. • Simple web based display utility in perl

Exception Monitoring: Nagios • Already have an exception handling system (CERN’s SURE coupled with the commercial Automate). • Looking at alternatives – no firm plans yet but currently looking at NAGIOS:http://www.nagios.org/

Summary: Outstanding Issues • Many new developments and new services deployed this year. • We have to run many distinct services. For example, FERMI Linux, RH 7.2/7.3, EDG testbeds, LCG, CMS DC03, SRB etc. • Waiting to hear when the experiments want LCG in volume. • The Pentium 4 processor is performing poorly. • Redhat’s changing policy is a major concern

RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003