470 likes | 484 Views
Explore lessons from the past, key achievements, and unresolved issues of the Worldwide LHC Computing Grid service. Delve into performance tests, Murphy's Law, and strategies for successful commissioning. Discover what to expect in 2008.
E N D
The Worldwide LHC Computing Grid WLCG Service Schedule The WLCG Service has been operational (as a service) since late 2005. A few critical components still need to be added or upgraded. But we have no time left prior to Full Dress Rehearsals in the summer…
Talk Outline • Lessons from the past – flashback to CHEP 2K • Expecting the un-expected • Targets set at the Sep ’06 LHCC Comprehensive review • What we have achieved so far • What we have failed to achieve • The key outstanding issues • What can we realistically achieve in the very few remaining weeks (until the Dress Rehearsals…) • Outlook for 2008...
“Conventional wisdom” - 2000 • “Either you have been there or you have not” • Translation: you need to test everything both separately and together under full production conditions before you can be sure that you are really ready. For the expected. • There are still significant things that have not been tested by a single VO, let alone by all VOs together • CMS CSA06 preparations: careful preparation and testing of all components over several months - basically everything broke first time (but was then fixed)… This is a technique that has been proven (repeatedly) to work… • This is simply “Murphy’s law for the Grid”… • How did we manage to forget this so quickly?
The 1st Law Of (Grid) Computing • Murphy's law (also known as Finagle's law or Sod's law) is a popular adage in Western culture, which broadly states that things will go wrong in any given situation. "If there's more than one way to do a job, and one of those ways will result in disaster, then somebody will do it that way." It is most commonly formulated as "Anything that can go wrong will go wrong." In American culture the law was named after Major Edward A. Murphy, Jr., a development engineer working for a brief time on rocket sled experiments done by the United States Air Force in 1949. • … first received public attention during a press conference … it was that nobody had been severely injured during the rocket sled [of testing the human tolerance for g-forces during rapid deceleration.]. Stapp replied that it was because they took Murphy's Law under consideration. • “Expect the unexpected” – Bandits (Bruce Willis)
LHC Commissioning (CHEP ’06) Expect to be characterised by: • Poorly understood detectors, calibration, software, triggers etc. • Most likely no AOD or TAG from first pass – but ESD will be larger? • Significant increases – particularly ATLAS – which pose major problems • 800MB/s (TDR); 1GB/s (Megatable January); 1.9GB/s (April 23 estimate) • AOD has exploded from 50KB/event to 250-500KB/event • The pressure will be on to produce some results as soon as possible! • There will not be sufficient resources at CERN to handle the load • We need a fully functional distributed system, aka Grid • There are many Use Cases we did not yet clearly identify • Nor indeed test --- this remains to be done in the coming 9 months!
WLCG Commissioning Schedule • Still an ambitious programme ahead • Timely testing of full data chain from DAQ to T-2 chain was major item from last CR • DAQ T-0 still largely untested WLCG Comprehensive Review - Service Challenge 3/4 Status - Jamie.Shiers@cern.ch
2007 2008 services WLCG CommissioningSchedule Introduce residual servicesFile Transfer Services for T1-T2 traffic Distributed Database Synchronisation Storage Resource Manager v2.2 VOMS roles in site scheduling; SL4/64bit experiments • Continued testing of computing models, basic services • Testingthe full data flow DAQTier-0Tier-1Tier-2 • Building up end-user analysis support • Dress Rehearsals • Exercising the computing systems, ramping up job rates, data management performance, …. Commissioning the service for the 2007 run– increase performance, reliability, capacity to target levels, monitoring tools, 24 x 7 operation, …. 01jul07 - service commissioned - full 2007 capacity, performance 01jul07 - service commissioned - full 2007 capacity, performance Detectors live ready for first LHC collisions (see later…) 1st April was the target to have required services in place to prepare for Dress Rehearsals! This was to allow 3 months for experiment integration & testing prior to DRs. Unsurprisingly, we did not fullymeet the – even revised – April 1 targets. In addition, a few ‘surprises’ have emerged…
What have we (not) achieved? • Of the list of residual services, what precisely have we managed to deploy? • How well have we managed with respect to the main Q1 milestones? • What are the key problems remaining?
WLCG Service - Status • LFC with bulk methods (1.6.3) • Has been tested by ATLAS and deployed at all concerned sites • DPM / LFC 1.6.4 (secondary groups) in certification but will require coordination of deployment • FTS 2.0 • Currently in pilot phase, being tested by experiments. This will take all May, including at least one bug fix release. Push out to sites (Tier1s) during June. • Distributed DB services for ATLAS & LHCb • Still some delays in production deployment /usage. See caveats from SCx (hidden) and also below. • Procedures / testing (SAM) for VO services [incl. Frontier, Squid etc.] • Monitoring / logging / reporting / dashboards • VOMS roles in job priorities • SRM 2.2 available for experiment testing: • Significantly delayed. CASTOR2 problems have changed priorities. • WN & UI for SLC4 (32bit mode) • gLite WMS • IMHO, it is to be expected that service changes are significantly slower at this stage of the project than during early deployment. We should not forget it!
FTS 2.0: what (March 20 MB) • Based on feedback from Amsterdam workshop • Certificate delegation • Security issue: No more MyProxy passphrase in the job • Improved monitoring capabilities • This is critical for the reliability of the ‘overall transfer service’ • Alpha SRM 2.2 support • Better database model • Improved performance and scalability • Better administration tools • Make it easier to run the service • Placeholders for the future functionality • To minimise the service impact of future upgrades Target: CNAF to have upgraded by early June
DB Service Concerns • Have the known database services – distributed or otherwise – been fully tested under production conditions? • e.g. conditions DB: one look-up (per sub-detector?) per job? Re-use of connections? Caching of information? ‘Flooding’ of services when many jobs start together? Are services at Tier0 and Tier1 ready for this? • Additional DB services – only conditions currently in the planning. I heard some additional ‘candidates’ on Monday (not for the first time). What is the schedule for testing / deciding? Resource plan? Don’t forget the lead times for acquiring h/w and setting up services… Overall resource limits… • See also Sashas’s slides – “Expect the unexpected”
WLCG Service Progress • There clearly has been progress in many areas of the service since the end of SC4 – with more to come • However, we should be mindful of the slope both during the SC period and now – and not dream of significant (and hence unlikely) changes • A relatively straight-forward roll-out, such as LFC 1.6.3, took quite a few weeks at all stages (analysis, design, implementation, testing, certification, pre-production, production roll-out…) • Moreover, the ramp-up to LHC running will increase pressure for stability, not further change – beyond what is absolutely mandatory
Q1 2007 – Tier0 / Tier1s • Demonstrate Tier0-Tier1 data export at 65% of full nominal rates per site using experiment-driven transfers • Mixture of disk / tape endpoints as defined by experiment computing models, i.e. 40% tape for ATLAS; transfers driven by experiments • Period of at least one week; daily VO-averages may vary (~normal) • Demonstrate Tier0-Tier1 data export at 50% of full nominal rates (as above) in conjunction with T1-T1 / T1-T2 transfers • Inter-Tier transfer targets taken from ATLAS DDM tests / CSA06 targets • Demonstrate Tier0-Tier1 data export at 35% of full nominal rates (as above) in conjunction with T1-T1 / T1-T2 transfers and Grid production at Tier1s • Each file transferred is read at least once by a Grid job • Some explicit targets for WMS at each Tier1 need to be derived from above • Provide SRM v2.2 endpoint(s) that implement(s) all methods defined in SRM v2.2 MoU, all critical methods pass tests • See attached list; Levels of success: threshold, pass, success, (cum laude)
CMS CSA07 Export Targets Above table shows rates read off from GridView plots. CMS goal is success transfers on 95% of the challenge days ➨ Target rate on 50% of the challenge days
What’s (still) left to show? • We clearly have to show that we can at least handle ATLAS data export in parallel with the other VOs • The current ATLAS event sizes and corresponding rates / storage requirements need to be further understood. • As we have learnt(?), running the various steps of production & analysis in parallel – as well as supporting multiple VOs concurrently – will inevitably lead to more bottlenecks and other problems • The fact that we did not fully exercise the whole chains during SC3 / SC4 = ‘late surprises’… • For 2007, we have little choice but to stick within the boundaries of what does already work… • And progressively fix things on a longer time scale… e.g. (well) in time for Physics run of 2008…
Q2 2007 – Tier0 / Tier1s • As Q1, but using SRM v2.2 services at Tier0 and Tier1, gLite 3.x-based services and SL(C)4 as appropriate • It now looks clear that the above will not be fully ready for Q2 – perhaps not even for the pilot run! • Provide services required for Q3 dress rehearsals • Basically, what we had at end of SC4 + Distributed Database Services; LFC bulk methods; FTS 2.0 • Work also ongoing on VO-specific services, SAM tests, experiment dashboards and Joint Operations issues Have to revise this milestone – SRM v2.2 schedule and major issues with storage systems surely have priority?
Summary by Experiment Experiment Top 5 Issues; Castor Status & Plans- 28
Integration of DM Components • We agree that this is a very complex and important issue that must be addressed with high priority • It is necessary that experiment, site and network service experts are involved in the debugging exercise, as all of these are intimately involved – a complete end-to-end solution must be demonstrated with adequate performance • We propose a ‘back-to-basics’ approach, separating the main problems, which can be investigated (initially) independently and in parallel (& then ramping up…) • We should not forget the role of the host lab in the WLCG model – we must be able to distribute data at the required rates for all VOs and to all sites • We still have not managed this – andwe’ve been trying a long time! MoU: Distribution of an agreed share of the raw data (+ESD) to each Tier1 Centre, in-line with data acquisition
Possible Initial Goals • Stable transfers of (each of) the two main (by rate) VOs to an agreed set of common sites. • Each VO can act as a ‘control’ for the other - any significant differences between them should be understood • filesize distribution, # active transfers / of streams, other key parameters etc. • Concurrent multi-VO transfers – first raised as a concern by ATLAS – also need to be demonstrated (WLCG milestones…) • The goal should be to obtain stable transfers with daily/weekly averages by site and by VO at an agreed fraction of the nominal rates e.g. initially 25%, then 50%, etc.) with a daily and weekly analysis of any fluctuations / other problems. • Then ramp-up in complexity: WLCG milestones, FDR preparations • Once stable transfers have been demonstrated, add complexity in a controlled fashion until full end-to-end testing has been achieved • In parallel, the 'heartbeat monitoring' proposed by both ATLAS and CMS should be better defined with the goal of reaching a common agreement that can be rapidly deployed across initially the T0 and T1s. • This is clearly needed both in the short-medium term, as well as in the long run, in order to provide a stable and reliable service
WLCG Service Interventions Status, Procedures & Directions Jamie Shiers, CERN, March 22 2007
WLCG Interventions • Scheduled service interventions shall normally be performed outside of the announced period of operation of the LHC accelerator. • In the event of mandatory interventions during the operation period of the accelerator – such as a non-critical security patch[1] – an announcement will be made using the Communication Interface for Central (CIC) operations portal and the period of scheduled downtime entered in the Grid Operations Centre (GOC) database (GOCDB). • Such an announcement shall be made at least one working day in advance for interventions of up to 4 hours. • Interventions resulting in significant service interruption or degradation longer than 4 hours and up to 12 hours shall be announced at the Weekly Operations meeting prior to the intervention, with a reminder sent via the CIC portal as above. • Interventions exceeding 12 hours must be announced at least one week in advance, following the procedure above. • A further announcement shall be made once normal service has been resumed. • [deleted] • Intervention planning should also anticipate any interruptions to jobs running in the site batch queues. If appropriate the queues should be drained and the queues closed for further job submission. CERN uses GMOD & SMOD to ensure announcements are made correctly. (CIC portal and CERN IT status board respectively.)
Recommendations & Actions • The use of the EGEE broadcast tool for announcing both scheduled and unscheduled interruptions has greatly improved. Improvements in the tool to clarify broadcast targets are underway. Sites are requested to ensure the nature and scope of the event are clear both from the subject and text of the announcement (and are not, for example, deduced from the e-mail address of the sendee); • Tape robot maintenance at CERN 10.30-16.00 Thursday 13 July • Tape access interrupted • All times should be given in UTC! (or local + UTC)
Unscheduled Service Interventions These guidelines should also be used for scheduled interventions in the case of problems.
GMOD Role • The main function of the GMOD is to ensure that problems reported for CERN-IT-GD managed machines are properly followed up and solved. • Also performs all EGEE broadcasts for CERN services • Wiki page: https://twiki.cern.ch/twiki/bin/view/LCG/GmodRoleDescription • MAIL: it-dep-gd-gmod@cern.ch • PHONE: • primary: 164111 (+41764874111) • backup: 164222 (+41764874222)
WLCG Service Meetings • CERN has a week-daily operations meeting at 09:00 – dial-in access {sites, experiments} possible / welcomed • A tradition of at least 30 years – used to be by mainframe – merged since early LEP + 41227676000 access code 017 5012 (leader 017 5011) • Regular LCG Service Coordination Meeting • Tier0 Services, their impact Grid-wide + roll-out of fixes / features • LCG Experiment Coordination Meeting – dial-in access • Medium term planning, e.g. FDR preparations & requirements; • 41227676000 access code 016 6222 (leader 016 6111) • Checks the experiments’ Resource and Production plans; • Discusses issues such as ‘VO-shares’ on FTS channels; • Medium term interventions, impact & scheduling • Prepares input to WLCG-EGEE-OSG operations meeting
Logging Service Operations • In order to help track / resolve problems, these should be logged in the ‘site operations log’ • Examples include adding / rebooting nodes, tape movements between robots, restarting daemons, deploying new versions of m/w or configuration files etc. • Any interventions by on-call teams / service managers • At CERN, these logs are reviewed daily at 09:00 • Cross-correlation between problems & operations is key to rapid problem identification & resolution • And the lack of logging has in the past led to problems that have been extremely costly in time & effort • For cross-site issues – i.e. Grid issues - UTC is essential
(WLCG) Site Contact Detailswlcg-tier1-contacts@cern.ch includes all except ggus
Site Offline Procedure • If a site goes completely off-line (e.g. major power or network failure) they should contact their Regional Operations Centre (ROC) by phone and ask them to make the broadcast. • If the site is also the ROC, then the ROC should phone one of the other ROCs and ask them to make the broadcast. • We already have a backup grid operator-on-duty team each week, so if the primary one goes off-line, then they call the backup who takes over. This covers WLCG Tier0, Tier1 and Tier2 sites (as well as all EGEE sites)
Important,urgent Important,not urgent Urgent, not important Not important, not urgent
Expecting the un-expected • The Expected: • When services / servers don’t respond or return an invalid status / message; • When users use a new client against an old server; • When the air-conditioning / power fails (again & again & again); • When 1000 batch jobs start up simultaneously and clobber the system; • A disruptive and urgent security incident… (again, we’ve forgotten…) • The Un-expected: • When disks fail and you have to recover from backup – and the tapes have been overwritten; • When a ‘transparent’ intervention results in long-term service instability and (significantly) degraded performance; • When a service engineer puts a Coke into a machine to ‘warm it up’… • The Truly Un-expected: • When a fishing trawler cuts a trans-Atlantic network cable; • When a Tsunami does the equivalent in Asia Pacific; • When Oracle returns you someone else’s data… • When mozzarella is declared a weapon of mass destruction… • All of these (and more) have happened!
Summary • The basic programme for this year is: • Q1 / Q2: prepare for the experiments’ Dress Rehearsals • Q3: execute Dress Rehearsals (several iterations) • Q4: Engineering run of the LHC – and indeed WLCG! • 2008 will probably be: • Q1: analyse results of pilot run • Q2: further round of Dress Rehearsals • Q3: data taking • Q4: (re-)processing and analysis WLCG workshoparound here….
WLCG Workshop in BC INFN? User(s)?
Conclusions • Main outstanding concern is still: meeting production & analysis requirements of ALL VOs simultaneously • Adding / upgrading services in a non-disruptive manner • Remaining calm, efficient and service oriented – particularly in the face of crises! (A definition of service) • This is much more effective than ‘crisis-mode’ – and also much more reassuring for the users • For those of you who have seen accelerator start-ups before – welcome back! • To those who have not – good luck!
Transparent Interventions - Definition • Have reached agreement with the LCG VOs that the combination of hardware / middleware / experiment-ware should be resilient to service “glitches” • A glitch is defined as a short interruption of (one component of) the service that can be hidden – at least to batch – behind some retry mechanism(s) • How long is a glitch? • All central CERN services are covered for power ‘glitches’ of up to 10 minutes • Some are also covered for longer by diesel UPS but any non-trivial service seen by the users is only covered for 10’ • Can we implement the services so that ~all interventions are ‘transparent’? • YES – with some provisos to be continued… EGI Preparation Meeting, Munich, March 19 2007 - Jamie.Shiers@cern.ch