Tier-1 – Final preparations for data

Tier-1 – Final preparations for data Andrew Sansum 9th September 2009

Themes (last 9 months) • Improve planning • Recruitment • Re-engineer production and operations processes • Enhance resilience • Test it works (STEP09) • Move to R89 • “Test” new Disaster Management System • Final preparations for data taking Tier-1 Status

The Plan prepare for STEP prepare for R89 Prepare fordata taking Update Freeze Update Freeze contingency Apr Nov Sep May Jun Aug Oct Jul SRM + nameserver SL5 upgrade CASTOR upgrade STEP LFC/FTS3D R89 Migration Test disasterManagement system  New Hardware CASTOR Hardware Resilience

Recruitment complete • Recruitment has been tough (but good team in place now) • Initially STFC freeze • Later, hard to recruit STFCfreeze Tier-1 Status

Meeting Experiment Needs • VO survey carried out in April • Based on a series of qualitative and quantitative questions • Very helpful and considered feedback from most significant Vos • Generally very positive: Key findings • Communication between Tier-1 and VOs generally working well • Production team have made a big difference • Meeting commitments/expectations of LHC VOs • VOs not always clear on Tier-1 priorities (since tried to address this by liaison meeting) • Non LHC VOs particularly commented that although support was good Tier-1 did not always deliver service on agreed timescales (unfortunately intentional, reflecting priorities – expectations management?) - • Documentation poor (need to work on this still) Tier-1 Status

Production Team/Production ops • Daytime team of 3 staff (Gareth Smith, John Kelly, Tiju Idiculla) • Handle operation exceptions (NAGIOS alerts/pager callouts) • track tickets • Monitor routine metrics, loads, network rates • Ensure operational status is communicated to VOs • Represent Tier-1 to WLCG daily operations • Oversee downtime planning, agree near term downtime plan • Oversee progression of Service Incident reports • (re-)engineer operational processes • Night-time/weekend team of 5 staff on-call at any time (2 hour response): • Primary on-call (triage and fix easy faults) • Secondary on-call: CASTOR, Grid on-call, Fabric, Database Tier-1 Status

Callout rate • Big improvement over 2009 – recent deterioration owing to recent development activity and major incidents Tier-1 Status

Process Improvement • Service is complex • Frequent routine interventions – eg:. • Add disk servers to class • take disk servers offline • Mistakes occur if not engineered out. • Work in progress but critical if we are to meet high expectations Tier-1 Status

CASTOR (I) • Process of gradual improvement, tracking down causes of individual transfer failures. Improving processes (eg disk server intervention status) • Applied ORACLE patch to fix the Big ID bug • Series of CASTOR minor version upgrades to 2.1.7-27. These have predominantly included bug-fixes, including one workaround to prevent the ORACLE Crosstalk bug from reoccurring • Reconfiguration of internal LSF scheduler to improve stability and scalability (move from NFS to HTTP) • Tuning changes • ORACLE migration to new hardware (two EMC RAID arrays) which provides additional resilience, improved performance and better maintenance. • SRM upgrades to version 2.7.15 Tier-1 Status

CASTOR: Downtime (2008-2009) 2.1.7 upgrade R89 Tier-1 Status

CASTOR (III): Plans • September • Nameserver upgraded to 2.1.8 • SRM upgrade to version 2.8 • CIP upgrade to version 2 (in progress) • 2009Q4 • optimizing the ORACLE database • Additional resilience • Disaster recovery testing Tier-1 Status

STEP09: Operations Overview • Generally very smooth operation: • Most service systems relatively unloaded plenty of spare capacity • Calm atmosphere. • Daytime “production team” monitored service • Only one callout, • Most of the team even took two days out off site for department meeting! • Very good liaison with VOs and good idea what was going on. • In regular informal contact with UK representatives • Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments. • Robot broke down for several hours (stuck handbot led to all drives de-configured in CASTOR). Caught up quickly. • Very useful exercise – learned a lot, but very reassuring • More at: http://www.gridpp.rl.ac.uk/blog/category/step09/

STEP09: Batch Service Farm typically running > 2000 jobs. By 9th June at equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%) • Problem 1: ATLAS job submission exceeded 32K files on CE • See hole on 9th. We thought ATLAS had paused  took time to spot. • Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. • Need more ATLAS jobs in queue faster. Manually cap ALICE. Fixed by 9th June. See decrease in (red) ALICE work. • Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

STEP09: Network Batch Farm drawing approx 3Gb/s from CASTOR during reprocessing. Peaked at 30Gb/s for CMS reprocessing without lazy download. Total OPN traffic. Inbound 3.5Gb/s, outbound 1Gb/s RAL->Tier-2 outbound rate average 1.5Gb/s but 6Gb/s spikes!

STEP09: Tape • Tape system worked well. Sustained 4Gb/s during peak load on 13 drives (ATLAS+CMS), 15 drives with LHCB. We played with a mix of dedicated (4 ATLAS, 4 CMS, 2 LHCB, 5 shared). • Typical average rate of 35MB/s per drive (1 day average) • Lower than we would like (looking for nearer 45MB/s) • On CMS instance, modified write policy gave > 60MB/s but reads more challenging to optimise.

R89: Migration • Migration planning started early 2008 (building early 2006) • Detailed equipment documentation together with a requirements document was sent to vendors during September 2008 • Workshop hosted during November. Vendors committed to 3 racks (each) per day (we believe 5-6 was feasible) • Orders placed at the end of November to move 77 racks of equipment (and robot) to an agreed schedule (T1=43 racks). • Started 22nd July and ended 6th August • Completed to schedule Tier-1 Status

R89 Migration • 43 racks moved Batch workerscomplete Disk complete CASTORrestarting Restart Mon 6 Fri 3 Fri 26 Wed 24 Mon 22 Wed 1 Fri 19 Mon 29 Wed 17 Drain CEs CastorCore + Disk DrainFTS batch workersstart Drain WMS Critical Services Tier-1 Status

Disasters: Swine Flu • First to test new disaster management system • Easy to handle – trivial to generate a contingency plan based on existing template. • Situation regularly assessed. Tier-1 response initially running ahead of RAL site planning. • Reached level 2 in DMS with assessment meetings every 2 weeks. Work mainly on remote working and communication strategy • Now downgraded to level 1 until significant rise in case frequency • Expect to dust off again before Christmas Tier-1 Status

Disasters: Air-conditioning(I) • Two cooling failures in 3 days • Monday day: both chiller systems shutdown, restarted quickly • Tuesday: one chiller shutdown and failed over to second chiller • Wednesday night: both chillers shutdown could not restart • After third event decided not to restart Tier-1 room reaches equilibrium 45 chiller restart 35 hot isle 25 cold isle 15 shutdown Tier-1 Status

Disasters: Air –conditioning (II) • Initial post mortem started after first (daytime) event • Thermal monitoring, callout and automated shutdown in R89 not fully implemented/working correctly • urgent remedial work underway • Second, night-time incident raised further concerns • Tier-1 called out and rapidly escalated • But automated shutdown still in test mode • Forced to do manual shutdown • Operations thermal callout failed to work as required • Site security did not escalate BMS alarm (not expected alarm) • Escalation to building services very slow (owing toR89 being still under warranty/acceptance) • Chillers could not be restarted • No explanation of cause of outage • Concluded we would not restart Tier-1 until issues resolved Tier-1 Status

Disasters: Air-conditioning (III) • Critical Services continued to run: • Separate, redundant cooling system in UPS room. • Tape robotics and CASTOR core OK too (low temperature room) • By Friday: • Tier-1 response at disaster level 3 (meeting held with VOs and PMB) • Building services believed that cooling was stable and fault could not recur. • all necessary automation, callout and escalation processes in place • Nevertheless Tier-1 team not prepared to run hardware unattended over the weekend. • On Monday: • Full service restart • plan to baby-sit service during Mon/Tue evening • Forensics and post-mortem continued Tier-1 Status

Disasters: Air-conditioning (IV) • Monday 10th incident believed to be caused by a planned reboot of the Building Management System (BMS) • Caused pumps to stop • Low pressure caused chiller valves to close • BMS returned but system deadlocked • Tuesday 11th – single chiller trip followed by failover • logs do not allow diagnosis. • Wednesday 12th – BMS detected overpressurein cooling system and triggered shutdown • Probably true over pressure (1.9 Bar) • Settings (1.7bar) considered to be too low • Now raised to 2.5 Bar and only calls out • System tested to 6 bar. • Investigations continue Tier-1 Status

Disasters: Water Leak • Water found dripping on tape robot!!!!!! • “I don’t believe this is happening” moment • Should not be able to happen as no planned water supplies above machine room. • “Fortunately” Tier-1 already shut-down so turnoff robot too. • STK engineer investigates and concludes that damage is mainly superficial splash damage,drive heads not contaminated, tapes (60 splashed)probably OK. • Indication that had been occurring occasionally for several weeks Tier-1 Status

Disasters: Water leak • Cause: condensation from 1st floor cooling system • Incorrect damper setting (air intake) led to excess condensation • Condensation collected in “drip tray” and pumped • Tray too small and pump inadequate • Water overflowed tray and tracked along floor to hole • Remedy • Place umbrella over robot • Chillers switched off – 1st floor inspected daily! • Planning underway to re-engineer drip trays/pumpsalarms, etc. • Monitor tape error rate Tier-1 Status

Procurements • Disk, CPU and robotics procurements delayed from January/February delivery dates • New SL8500 tape robot entirely for GRIDPP, 2PB of disk – 24 drive units (50% Areca/WD, 50% 3Ware/Seagate), CPU capacity • Eventually delivered in May, but entangled in R89 migration, • New Robot in production in July • CPU completed acceptance test and deploying into SL5 • One Lot of disk (1PB) ready for deployment • Second Lot failed acceptance (many drive ejects) • Positive aspects of acceptance failure • Two Lot risk avoidance strategy worked • Vendor 1 week load test failed to find fault • Our 28 day acceptance caught fault before kit reachedproduction Tier-1 Status

LFC , FTS and 3D • Now complete • Upgrade back end RAID arrays and Oracle servers • Replace elderly RAID arrays with pair of new EMC RAID arrays • Better support (we hope) • Better performance • Move to ORACLE RAC for LFC/FTS (increased resilience) • Separate ATLAS LFC from general LFC • Upgrade 3D servers and move to new RAID arrays • Work commenced on testing replication of LFC for disaster contingency Tier-1 Status

Quattor – Story so Far • Began work in earnest in June 2009 • Set up Quattor Working Group instance to manage deployment and configuration of new hardware. • leverages strong QWG support for gLite • Have SL5 torque/maui server under Quattor control • Are (as of today) deploying 220+ new WNs in SL5 batch service • Significant work to get up and running. New way of working. • Have uncovered and helped fix a number of bugs and issues in the process

Quattor – Next Steps • As we move existing WNs them to SL5 (need 75% of our capacity in SL5) we will quattorise them • Move CEs and other grid service nodes to Quattor • Gradually migrate non-grid services to Quattor control • AQUILON • Database backend to Quattor developed by Morgan Stanley • Improves scalability and manageability (MS are managing >15,000 nodes) • Will first deploy at RAL • Then plan to make Aquilon make usable by other grid sites as well

Dashboard • Available at http://www.gridpp.rl.ac.uk/status • Constantly evolving • Components can be added/updated/removed • Present components • SAM Tests • Latest test results for critical services • Locally cached for 10 minutes to reduce load • Downtimes • Ongoing and upcoming downtimes pulled from GOCDB • Red colour for OUTAGE and yellow for AT_RISK • Notices • Latest information on Tier 1 operations • Only Tier 1 staff can post • Ganglia plots of key components from the Tier1 farm • Feedback welcome

SL5 Migration (I) • Next week - 14th - 18th September! • LHC only (for now) – but all VOs affected • New batch service - lcgbatch01 • Quattorised torque/maui server • Quattorised worker nodes • New LCG-CEs (6-8) for LHC vos – old LHC CEs (3-5) being retired, other CEs reconfigured • Same queue configuration • Use submit filter script on CEs to add SLX property requirement as required

SL5 Migration (II) • CPU08 going straight into SL5 now (~1800 job slots) • All 64-bit capable existing WNs will be reinstalled eventually • Non-LHC vos will get new CE for migration after dust settles • No plan to retire SL4 WNs completely yet

October Freeze • No planned upgrades beyond September except possibly network upgrade. • Recognise that some change will have to take place • Need to put in place lightweight change control process • Allow changes where benefit outweighs risk • Expect increased stability as downtimes reduce • Apply pressure once more to reduce low grade failures. Tier-1 Status

Conclusion • Recent staff additions have had a huge impact on quality of service we operate. • Tier-1 development plan for 2009 nearly complete. • Positive feedback from STEP09 that service meets requirements. • Still a few major items (like SL5) to get through (fingers crossed). • Probably still some R89 suprises in pipeline. • Looking forward to start of data taking Tier-1 Status

Tier-1 – Final preparations for data