290 likes | 305 Views
Summary of service incidents and change assessments since the beginning of March. Details on a restored LHCb condition data schema and ATLAS SRM upgrade assessment. Follow-ups and post-change assessments included.
E N D
WLCG Service Report Maria.Girone@cern.ch Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 23rd March 2010
Introduction • Period covered: since the beginning of March • Only one Service Incident Report • Replication of LHCb conditions Tier0->Tier1, Tier0->online partially down (18 hours). • An important “Change Assessment” for upgrade of ATLAS SRM to 2.9 • Site availability plots show good service availability throughout this period • Alarm tests: generally OK but… • Update from WLCG Tier1 Service Coordination meeting(s)
Service Incidents & Change Assessments • Full Change Assessment here • Summary from assessment (IT-DSS) & ATLAS +IT-ES viewpoints follows • Details in next slides
Streams SIR • Description • Main schema containing condition data of LHCb experiment (LHCB_CONDDB@LHCBR) had to be restored to a point in time in the past as a result of a logical corruption that happened on Tuesday morning. The problem has been reported to PhyDB team around 4 pm on Wednesday. The restore was completed successfully at around 6 pm but unfortunately it disturbed streams replication to LHCB online and 5 Tier1 sites (GridKa, IN2P3, PIC, RAL and SARA). Only replication to CNAF kept working. • Impact • Condition data of LHCb experiment was effectively not available: • at LHCb online, GridKa and SARA between 4pm and 10 pm on Wednesday 3rd March • at IN2P3, PIC and RAL between 4pm on 3rd March and 10 am on Thursday 4th March
Time line of the incident • 4pm, Wednesday 3rd March - request to restore the central Conditions data schema submitted to PhyDB service. • 6pm, 3rd March - restore successfully completed • 6pm - 8pm, 3rd March - several streams apply processes aborted. Investigation started. • 8pm - 10pm, 3rd March - replication to LHCb online, GridKa and SARA fixed • 9pm - 10am, 4th March - replication to IN2P3, PIC and RAL re-instantiated • Analysis • The problem has been traced down to be caused by the default settings of the apply processes on affected sites. In the past, we had to define apply rules on the destinations running two apply processes due to an Oracle bug. These rules, by default, ignore all LCRs (changes) which are tagged. Unfortunately, a datapump import session sets a Streams tag and as the result the data import operation (being part of the schema restore procedure) has been filtered out by the Streams processes. • Follow up • Configuration of apply processes is been reviewed and will made uniform (apply rules will be removed at all the destination sites).
ATLAS SRM 2.9 Upgrade • A full Change Assessment was prepared by IT-DSS and discussed with ATLAS • This follows on from discussions at the January GDB prompted by ATLAS’ experience with interventions on CASTOR + related in 2009 • Highlights: an upgrade schedule was agreed with the possibility to downgrade if the change did not work satisfactorily and the downgrade was tested prior to giving the green light to upgrade the production instance • Details from the change assessment are in “backup slides”
CASTOR SRM 2.9 upgrade • IT/ES contributed to the upgrade: • Testing new functionality in Pre-Production • and validating them against the ATLAS stack • Organizing the stress testing right after the upgrade • Monitoring performance/availability in subsequent days • From the IT/ES perspective, the procedure is a good compromise between an aggressive and a conservative approach • Allows to fully validate new functionalities • Thanks to Pre-Production SRM endpoint, pointing to CASTOR production instance • Allows scale testing • Not at the same scale of CCRC08 or STEP09 • But on the other side, in more realistic situations (recent problems caused by chaotic activity rather than high transfer rates) • Possibility of quick rollback guarantees a security margin
SRM 2.9 – Post Change Assessment • After the upgrade and testing by ATLAS the change assessment was updated with the experience • Details of what happened during the upgrade and during the testing period • Not everything went according to plan – one of the main motivations for producing and updating such assessments! • Globally successful: • Tue 16th Agreement that Atlas remains on SRM 2.9 as their production version.
ATLAS ALICE 1 March 0.1 0.1 1.2 1.1 0.2 CMS LHCb 4.2 4.1 0.3 0.1 0.1 0.3 4.3 0.2 3.1
Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: Planned outage for maintenance of batch and mass storage 0.2 TAIWAN: Scheduled downtime Wednesday morning. Most services got recovered quickly except for lfc and FTS due to some oracle block error, a 2-hour unscheduled downtime was created for this. These 2 services got recovered at 14:15 0.3 IN2P3: SAM SRM tests failures, disappeared after ~4 hours (problems with BDII. Known performance problem of SL5 BDII) ATLAS 1.1 RAL: SRM overload (tests hitting 10 minutes timeouts). Two ATLASDATADISK servers out with independent problems 1.2 NIKHEF: Problem with one disk server (seem to be due to an Infiniband driver. Kernel timeout values need to be increased) ALICE Nothing to report CMS 3.1 RAL: Temporary tests failures due to deployment of the new version of the File Catalog LHCb 4.1 GRIDKA: SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server 4.2 CNAF: Problems with local batch system, investigating 4.3 NIKHEF: The critical File Access SAM test failing has been understood by the core application devs as some libraries (libgsitunnel) for slc5 platforms not properly deployed in the AA. 1 March
ATLAS ALICE 8 March 1.1 1.4 1.2 1.3 0.1 LHCb CMS 3.3 3.2 4.1 3.1 4.2 0.1 4.3
Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 TAIWAN: Unscheduled power cut ATLAS 1.1 INFN: Enabled checksums for INFN-T1 in FTS. Problems observed, the checksum switched off 1.2 NIKHEF: Unscheduled downtime (FTS is down). Unable to start the transfer agents 1.3 RAL: Disk server out of action, part of ATLAS MC DISK. SRM overload (tests hitting 10 minutes timeouts) 1.4 NDGF: LFC’s host certificates have expired, fixed. LFC daemon giving a core dump, under investigation ALICE Nothing to report CMS 3.1 CNAF: CE SAM test failures - LSF master dying 3.2 IN2P3: SRM test failure (authentication problems) 3.3 CERN: Temporary SAM test failure (timeout) LHCb 4.1 IN2P3: Temporary test failures (software missing) 4.2 NIKHEF: Temporary test failures due to migrating and testing a new test code 4.3 PIC: Application related issues during the w/e on the certification system, accidentally were published in SAM. Experts contacted, fixed 8 March
ATLAS ALICE 15 March 0.1 2.1 1.1 1.3 1.5 1.2 0.2 1.4 CMS LHcb 4.6 3.1 4.1 4.5 4.1 4.4 4.7 4.2 4.2 4.2 4.2 3.2 4.3
Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 IN2P3: FTS Upgrade to 2.2.3 0.2 NIKHEF: Two of the disk servers have problems accessing their file systems. ATLAS 1.1 NDGF:LFC's certificate expired. Cloud set offline in Panda and blacklisted in DDM 1.2 NIKHEF:FTS down till morning(15-March). Downgraded to FTS 2.1 .Put back in DDM. 1.3 NDGF: Installed new certificate on server. But occasional crashes of lfcdaemon. 1.4 SARA: Some SRM problems observed at SARA quickly fixed. 1.5 NDGF:SAM Test Failure(Timeout when executing test SRMv2-ATLAS-lcg-cp after 600 seconds) ALICE 2.1 KIT: Proxy Registration not working. User is not allowed to register his proxy within the VOBOX CMS 3.1 KIT: SAM test Failure(connection errors to SRM) 3.2 RAL: SAM Tests Failure(Timeout when executing test CE-cms-analysis after 1800 seconds) LHCb 4.1 CNAF: Authentication issues on SRM and gridftp after STORM upgrade. 4.2 PIC:SAM Test Failure.(da Vinci Installation going on) 4.3 RAL:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER) 4.4 GRIDKA:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER Some Authentication problem) 4.5 CNAF:SAM Test Failure(Missing Software) 4.6 CERN:SAM Test Failure(SRMv2-lhcb-DiracUnitTestUSER) 4.7 NIKHEF: Problem with disk server 15 March
Alarm Ticket Tests (March 10) • Initiated by the GGUS developers to the Tier1s, as part of the service verification procedure in 3 slices: • Asia/Pacific right after the release, • European sites early afternoon (~12:00 UTC), • US sites and Canada late afternoon (~18:00 UTC). • The alarm tests in general went well except for… • For BNL and FNAL the alarm notifications were sent correctly but the ticket creation in OSG FP system failed due to missing submitter name in the tickets. • The alarm tickets for BNL and FNAL have been submitted from home without using a certificate (which had been assumed) • “Improved” – retested & OK… • One genuine alarm in last 4 weeks – drill down later
GGUS Alarms • Of the 20 alarm tickets in the last month all are tests except for one - https://gus.fzk.de/ws/ticket_info.php?ticket=56152 • TAIWAN LFC not working Detailed description:Dear TAIWAN site adminwe found that you come out from your downtimehttps://goc.gridops.org/downtime/list?id=71605446but your LFC is not working[lxplus224] /afs/cern.ch/user/d/digirola > lfc-ping -h lfc.grid.sinica.edu.twsend2nsd: NS000 - name server not available on lfc.grid.sinica.edu.twnsping: Name server not activeCan you please urgently check?To contact the Expert On Call+41 76 487 5907 • Submitted 2010-03-03 10:57 UTC by Alessandro Di Girolamo • Solved 2010-03-03 19:52 but with a new ticket for possibly missing files…
WLCG T1SCM Summary • 4 meetings held so far this year (5th is this week…) • Have managed to stick to agenda times – minutes available within a few days (max) of the meeting • “Standing agenda” (next) + topical issues: • FTS 2.2.3 status and rollout; • CMS initiative on handling prolonged site downtimes; • Review of Alarm Handling & Problem Escalation; • Oracle 11g client issues; • Good attendance from experiments, service providers at CERN and Tier1s – attendance list will be added
WLCG T1SCM Standing Agenda • Data Management & Other Tier1 Service Issues • Includes update on “baseline versions”, outstanding problems, release update etc. • Conditions Data Access & related services • Experiment Database Service issues • AOB • A full report on the meetings so far will be given to tomorrow’s GDB
Summary • Stable service during the period of this report • Streamlined daily operations meetings & MB reporting • Successful use of “Change Assessment” for a significant upgrade of ATLAS SRM – positive feedback encouraging • WLCG T1SCM is addressing a wide range of important service related problems on a timescale of 1-2 weeks+ (longer in the case of Prolonged Site downtime strategies) • More details on T1SCM in tomorrow’s report to the GDB • Critical Services support at Tier0 reviewed at T1SCM • Propose an Alarm Test for all such services to check end-end flow
ATLAS SRM 2.9 • Description • SRM-2.9 fixes issues seen several times by ATLAS, but has major parts rewritten. Deployment should be tested, but ATLAS cannot drive sufficient traffic via the PPS instance to gain enough confidence in the new code. As discussed on 2010-02-11, one way to test would be to update the SRM-ATLAS production instance for a defined period (during the day) then rollback (unless ATLAS decides to stay on the new version). • The change date is 9 Mar 2010 with a current risk assessment of Medium (potential high impact, but have tested rollback).
Testing Performed • standard functional test suite on development testbed for 2.9-1 and 2.9-2 [FIXME: pointer to test scope] • standard stress test on development testbed for 2.9-1 [FIXME: pointer to test scope] • SRM-PPS runs 2.9-1 since 2010-02-12 (but receives little load); runs 2.9-2 since 2010-03-05; short functional test passed • SRM-2.9-1 to 2.8-6 downgrade procedure (config/RPM/database) has been validated on SRM-PPS • SRM-2.9-2 to 2.8-6 downgrade procedure (config/RPM/database) has been validated on SRM-PPS • SRM-2.9-1 DB upgrade script successfully (modulo DB jobs) applied to a DB snapshot
Extended downtime during upgrade • Is the upgrade transparent or downtime required? DOWNTIME (both on update and downgrade), outstanding SRM transaction will be failed • Are there major schema changes ? What is the timing of the downtime? yes, major schema changes. Expected downtime: 30min each • What is the impact if the change execution overruns such as a limited change window before production has to restart? No ATLAS data import / export; impact depends on exact date. • Risk if upgrade is not performed • What problems have been encountered with the current version which are fixed in the new one (tickets/incidents)? • major change: remove synchronous stager callbacks, these are responsible for PostMortem12Dec09 and IncidentsSrmSlsRed24Jan2010 (as well as several other periods of unavailability) • feature: bug#60503 allows ATLAS to determine whether a file is accessible; should avoid scheduling some FTS that wold otherwise fail (i.e. lower error rate). • Divergence between current tested/supported version and installed versions? • No divergence yet - SRM-2.9 isn't yet rolled out widely
Change Completion Report • Did the change go as planned ? What were the problems ? • update: took longer than expected (full 30min slot; expected: 5min) • mismatch between "production" DB privileges and those on PPS (and on the "snapshot"): DB upgrade script fails with "insufficient privileges". Fixed By Nilo (also on other instances). • "scheduled" SW update: introduced other/unrelated RPM changes into the test. Rebooted servers to apply new kernel (within update window) • during the test: • SLS probe has "lcg_cp: Invalid argument" at 9:40 (NAGIOS dteam/ops tests at 9:30 were OK) - understood, upgrade procedure not fully followed. • Peak of activity observed at 9:47:30 (mail from Stephane "we start the test" at 9:35), resulting in ~700 requests being rejected because of thread exhaustion • DB high row lock contention observed, cleared up by itself - due to user activity (looping on bringOnline for few files, + SRM-2.9 inefficiency addressed via hotfix) • This led to a number of extra requests rejected because of thread exhaustion • result from initial tests: ChangesCASTORATLASSRMAtlas29TestParamChange, applied on Thu morning. • Downgrade was postponed after postponed after meeting Tue 15:30 (ATLAS ELOG) as the new version had been running smoothly. • Downgrade reviewed Thu 15:30 → keep on 2.9 until after weekend. Consider 2.9 to be "production" if no major issue until Tuesday 16 March. • Tue 16th Agreement that Atlas remains on SRM 2.9 as their production version.