420 likes | 430 Views
This event introduces the CCRC’08 for LHC experiments, highlighting data challenges, computing models, lessons learned, and future readiness. It focuses on the complexity of LHC computing, shared team efforts, and the need for efficient communication and collaboration. Participants discuss computing infrastructure readiness, challenges faced, and the importance of preparing in advance for smooth operations once the LHC starts. The proposed scope includes test data transfers, processing simulations, and performance evaluations, drawing comparisons with past computing experiences at CERN. The event emphasizes the significance of coordination between centers and experiments to address potential flaws and bottlenecks for effective data handling.
E N D
W CCRC’08 CCRC’08 Planning & Requirements Jamie Shiers ~~~ LHC OPN, 10th March 2008
Agenda • Common Computing Readiness Challenge (CCRC’08) – What is it? Who does it concern? Why? • Brief reminder of Computing Models of LHC experiments – what has changed • Status & Outlook • Lessons Learned • Conclusions
Background • For many years, the LHC experiments have been preparing for data taking • On the Computing side, this has meant a series of “Data Challenges” designed to verify their computing models and offline software / production chains • To a large extent, these challenges have been independent of each other, whereas in reality, we (almost all sites) have to support (almost all experiments) simultaneously • Are there some bottlenecks or unforeseen couplings between the experiments and / or the services? • There certainly is at the level of support personnel!
LHC: One Ring to Bind them… pp, B-Physics,CP Violation ALICE LHC : 27 km long 100m underground ATLAS General Purpose,pp, heavy ions Heavy ions, pp CMS +TOTEM Introduction Status of LHCb ATLAS ALICE CMS Conclusions G. Dissertori
LHC Computing is Complicated! • Despite high-level diagrams (next), the Computing TDRs and other very valuable documents, it is very hard to maintain a complete view of all of the processes that form part of even one experiment’s production chain • Both detailed views of the individual services, together with the high-level “WLCG” view are required… • It is ~impossible (for an individual) to focus on both… • Need to work together as a team, sharing the necessary information, aggregating as required etc. • The needed information must be logged & accessible! • (Service interventions, changes etc.) • This is critical when offering a smooth service with affordable manpower
What if: LHC is operating and experiments take data? All experiments want to use the computing infrastructure simultaneously? The data rates and volumes to be handled at the Tier0, the Tier1 and Tier2 centers are the sum of ALICE, ATLAS, CMS and LHCb as specified in the experiments computing model Each experiment has done data challenges, computing challenges, tests, dress rehearsals, …. at a schedule defined by the experiment This will stop: we will no longer be the master of our schedule……. Once LHC starts to operate. We need to prepare for this … together …. A combined challenge by all Experiments should be used to demonstrate the readiness of the WLCG Computing infrastructure before start of data taking at a scale comparable to the data taking in 2008. This should be done well in advance of the start of data taking on order to identify flaws, bottlenecks and allow to fix those. We must do this challenge as WLCG collaboration: Centers and Experiments CCRC’08 – Motivation and Goals WLCG Workshop: Common VO Challenge
Test data transfers at 2008 scale: Experiment site to CERN mass storage CERN to Tier1 centers Tier1 to Tier1 centers Tier1 to Tier2 centers Tier2 to Tier2 centers Test Storage to Storage transfers at 2008 scale: Required functionality Required performance Test data access at Tier0, Tier1 at 2008 scale: CPU loads should be simulated in case this impacts data distribution and access Tests should be run concurrently CMS proposes to use artificial data Can be deleted after the Challenge CCRC’08 – Proposed Scope (CMS) WLCG Workshop: Common VO Challenge
A Comparison with LEP… • In January 1989, we were expecting e+e- collisions in the summer of that year… • The “MUSCLE” report was 1 year old and “Computing at CERN in the 1990s” was yet to be published (July 1989) • It took quite some time for the offline environment (CERNLIB+experiment s/w) to reach maturity • Some key components had not even been designed! • Major changes in the computing environment were about to strike! • We had just migrated to CERNVM – the Web was around the corner, as was distributed computing (SHIFT) • (Not to mention OO & early LHC computing!)
CCRC’08 Preparations… • Monthly Face-to-Face meetings held since time of “kick-off” during WLCG Collaboration workshop in BC • Fortnightly con-calls with A-P sites started in January 2008 • Weekly planning con-calls suspended during February: restart? • Daily “operations” meetings @ 15:00 started mid-January • Quite successful in defining scope of challenge, required services, setup & configuration at sites… • Communication – including the tools we have – remains a difficult problem… but… • Feedback from sites regarding the information they require, plus “adoption” of common way of presenting information (modelled on LHCb) all help • We are arguably (much) better prepared than for any previous challenge • There are clearly some lessons for the future – both the May CCRC’08 challenge as well as longer term
Pros & Cons – Managed Services • Predictable service level and interventions; fewer interventions, lower stress level and more productivity, good match of expectations with reality, steady and measurable improvements in service quality, more time to work on the physics, more and better science, … • Stress, anger, frustration, burn-out, numerous unpredictable interventions, including additional corrective interventions, unpredictable service level, loss of service, less time to work on physics, less and worse science, loss and / or corruption of data, … Design, Implementation, Deployment & Operation
Middle- / Storage-ware Versions • The baseline versions that are required at each site were defined iteratively – particularly during December and January • The collaboration between and work of the teams involved was highly focused and responsive • Some bugs took longer to fix than might be expected • Some old bugs re-appeared • Some fixes did not make it in time for kick-off • Let alone pre-challenge “week of stability” • Some remaining (hot) issues with storage • Very few new issues discovered! (Load related) • On occasion, lack of clarity on motivation and timescales for proposed versions • These are all issues that can be fixed relatively easily – goals for May preparation…
Weekly Operations Review • Based on 3 agreed metrics: • Experiments' scaling factors for functional blocks exercised • Experiments' critical services lists • MoU targets Need to follow-up with experiments on “check-lists” for “critical services” – as well as additional tests
CCRC’08 Production Data Taking & Processing • CCRC’08 leads directly into production data taking & processing • Some ‘rough edges’ are likely to remain for most of this year (at least…) • An annual “pre-data-taking” exercise – again with February and May (earlier?) phases – may well make sense (CCRC’09) • Demonstrate that we are ready for this year’s data taking with any revised services in place and debugged… Possibly the most important: • Objectives need to be SMART:Specific, Measurable, Achievable, Realistic & Time-bounded • We are still commissioning the LHC Computing Systems but need to be no later than – and preferably ahead of – the LHC!
WLCG Services – In a Nutshell… • Summary slide on WLCG Service Reliability shown to OB/MB/GDB during December 2007 • On-call service established beginning February 2008 for CASTOR/FTS/LFC (not yet backend DBs) • Grid/operator alarm mailing lists exist – need to be reviewed & procedures documented / broadcast
Critical Service Follow-up • Targets (not commitments) proposed for Tier0 services • Similar targets requested for Tier1s/Tier2s • Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable) • The MoU lists targets for responding to problems (12 hours for T1s) • Tier1s: 95% of problems resolved <1 working day ? • Tier2s: 90% of problems resolved < 1 working day ? • Post-mortem triggered when targets not met!
Tier0 – Tier1 Data Export • We need to sustain 2008-scale exports for at least ATLAS & CMS for at least two weeks • The short experience that we have is not enough to conclude that this is a solved problem • The overall system still appears to be too fragile – sensitive to ‘rogue users’ (what does this mean?) and / or DB de-tuning • (Further) improvements in reporting, problem tracking & post-mortems needed to streamline this area • We need to ensure that this is done to all Tier1 sites at the required rates and that the right fraction of data is written to tape • Once we are confident that this can be done reproducibly, we need to mix-in further production activities • If we have not achieved this by end-February, what next? • Continue running in March & April – need to demonstrate exports at required rates for weeks at a time – reproducibly! • Re-adjust targets to something achievable? • e.g. reduce from assumed 55% LHC efficiency to 35%?
Recommendations • To improve communications with Tier2s and the DB community, 2 new mailing lists have been setup, as well as regular con-calls with Asia-Pacific sites (time zones…) • Follow-up on the lists of “Critical Services” must continue, implementing not only the appropriate monitoring, but also ensuring that the WLCG “standards” are followed for Design, Implementation, Deployment and Operation • Clarify reporting and problem escalation lines (e.g. operator call-out triggered by named experts, …) and introduce (light-weight) post-mortems when MoU targets not met • We must continue to improve on open & transparent reporting, as well as further automations in monitoring, logging & accounting • We should foresee “data taking readiness” challenges in future years – probably with a similar schedule to this year – to ensure that full chain (new resources, new versions of experiment + AA s/w, middleware, storage-ware) is ready
And Record Openly Any Problems… • The intervention is now complete and tier1 and tier2 services are operational again except for enabling of internal scripts. • Two problems encountered. • A typo crept in somewhere, dteam became deam in the configuration. Must have happened a while ago and was a reconfiguration problem waiting to happen. • fts103 when rebooted for the kernel upgrade (as were the rest) decided it wanted to reinstall itself instead and failed since not a planned install. Again an accident waiting to happen. • Something to check for next time. • Consequently the tiertwo service is running in degraded with only one webservice box. If you had to choose a box for this error to occur on it would be this one. • Should be running non-degraded mode sometime later this afternoon. • People are actively using the elog-books – even though we will have to review overlap with other tools, cross-posting etc.
What Has Changed? (wrt 2005…) • View of Computing Models was clearly too simplistic • These have evolved with experience – and will probably continue to evolve during first data taking… • Various activities at the pit / Tier0 “smooth out” peaks & troughs from accelerator cycle and desynchronize the experiments from each other • Each merging of small files, first-pass processing, … • Currently assuming 50Ks / 24h accelerator operation • Even though accelerator operations assumes 35%... • Bulk (pp) data is driven by ATLAS & CMS – their models still differ significantly • ATLAS has ~twice the number of Tier1s and keeps ~2(.8) copies of RAW and derived data across these. Tier1 sites are “paired” so that output from re-processing must be sent between Tier1s. They also use the “FTS” to deliver calibration data (KB/s) to some sites • CMS does not make “hard” associations between Tier2s and Tier1s – for reliability (only), a Tier2 may fetch (or store?) data from any accessible Tier1 • All experiments have understood – and demonstrated – “catch-up” – buffers required “everywhere” to protect against “long weekend” effects
Preparations for May and beyond… • Aim to agree on baseline versions for May during April’s F2F meetings • Based on versions as close to production as possible at that time (and not (pre-)pre-certification!) • Aim for stability from April 21st at least! • The start of the collaboration workshop… • This gives very little time for fixes! • Beyond May we need to be working in continuous full production mode! • March & April will also be active preparation & continued testing – preferably at full-scale! • CCRC’08 “post-mortem” workshop: June 12-13
Service Summary – No Clangers! • From a service point of view, things are running reasonably smoothly and progressing (reasonably) well • There are issues that need to be followed up (e.g. post-mortems in case of “MoU-scale” problems, problem tracking in general…) but these are both relatively few and reasonably well understood • But we need to hit all aspects of the service as hard as is required for 2008 production to ensure that it can handle the load! • And resolve any problems that this reveals…
Scope & Timeline • We will not achieve sustained exports from ATLAS+CMS(+others) at nominal 2008 rates for 2 weeks by end February 2009 • There are also aspects of individual experiments’ work-plans that will not fit into Feb 4-29 slot • Need to continue thru March, April & beyond • After all, the WLCG Computing Service is in full production mode & this is its purpose! • Need to get away from mind-set of “challenge” then “relax” – its full production, all the time!
LHC Outlook • LHC upgrade – “Super LHC” – now likely to be phased: • Replace the final focus (inner triplets) aiming at β*=0.25 m. • during shutdown 2013 • Improve the injector chain in steps: • first a new proton linac (twice the energy, better performance) • Replace the booster by a Low Power Superconducting Proton Linac • LPSPL, 4 GeV • Replace PS by PS2 (up to 50 GeV) • The latter could be operational in 2017 • Further, more futuristic steps could be a superconducting SPS (up to 1000 GeV) or even doubling the LHC energy.
Summary • “It went better than we expected but not as well as we hoped.” • Sounds a little like Bilbo Baggins “A Long Expected Party”: • “I don't know half of you half as well as I should like; and I like less than half of you half as well as you deserve.” • But we agreed to measure our process against quantitative metrics: • Specific, Measurable, Achievable, Realistic, Timely
Well, How Did We Do? • Remember that prior to CCRC’08 we: • Were not confident that we were / would be able to support all aspects of all experiments simultaneously • Had discussed possible fall-backs if this were not demonstrated • The only conceivable “fall-back” was de-scoping… • Now we are reasonably confident of the former • Do we need to retain the latter as an option? • Despite being rather late with a number of components (not desirable), things settled down reasonably well • Given the much higher “bar” for May, need to be well prepared!
CCRC’08 Summary • The preparations for this challenge have proceeded (largely) smoothly – we have both learnt and advanced a lot simply through these combined efforts • As a focusing activity, CCRC’08 has already been very useful • We will learn a lot about our overall readiness for 2008 data taking • We are also learning a lot about how to run smooth production services in a more sustainable manner than previous challenges • It is still very manpower intensive and schedules remain extremely tight: full 2008 readiness still to be shown! • More reliable – as well as automated – reporting needed • Maximize the usage of up-coming F2Fs (March, April) as well as WLCG Collaboration workshop to fully profit from these exercises • June on: continuous production mode (all experiments, all sites), including tracking / fixing problems as they occur
Handling Problems… • Need to clarify current procedures for handling problems – some mismatch of expectations with reality • e.g. no GGUS TPMs on weekends / holidays / nights… • c.f. problem submitted with max. priority at 18:34 on Friday… • Use of on-call services & expert call out as appropriate • {alice-,atlas-}grid-alarm; {cms-,lhcb-}operator-alarm; • Contacts are needed on all sides – sites, services & experiments • e.g. who do we call in case of problems? • Complete & open reporting in case of problems is essential! • Only this way can we learn and improve! • It should not require Columbo to figure out what happened… • Trigger post-mortems when MoU targets not met • This should be a light-weight operation that clarifies what happened and identifies what needs to be improved for the future • Once again, the problem is at least partly about communication!
FTS “corrupted proxies” issue • The proxy is only delegated if required • The condition is lifetime < 4 hours. • The delegation is performed by the glite-transfer-submit CLI. The first submit client that sees that the proxy needs to be redelegated is the one that does it - the proxy then stays on the server for ~8 hours or so • Default lifetime is 12 hours. • We found a race condition in the delegation - if two clients (as is likely) detect at the same time that the proxy needs to be renewed, they both try to do it and this can result in the delegation requests being mixed up - so that that what finally ends up in the DB is the certificate from one request and the key from the other. • We don’t detect this and the proxy remains invalid for the next ~8 hours. • The real fix requires a server side update (ongoing). • The quick fix. There are two options: … [ being deployed ]
ATLAS CCRC’08 Problems 14-18 Feb • There seem to have been 4 unrelated problems causing full or partial interruption to the Tier0 to Tier1 exports of ATLAS. • On Thursday 14th evening the Castor CMS instance developed a problem which built up an excessive load on the server hosting the srm.cern.ch request pool. This is the SRM v1 request spool node shared between all endpoints. By 03:00 the server was at 100% cpu load. It recovered at 06:00 and processed requests till 08:10 when it stopped processing requests until 10:50. There were 2 service outings totalling 4:40 hours. S.Campana entered in the CCRC08 elog the complete failure of ATLAS exports at 10:17, in the second failure time window, and also reported the overnight failures as being from 03:30 to 05:30. This was replied to by J.Eldik at 16:50 as a 'site fixed' notification with the above explanation asking SC for confirmation from their Atlas monitoring. This was confirmed by SC in the elog at 18:30. During the early morning of 15th the operator log received several high load alarms for the server followed by a 'no contact' at 06:30. This lead to a standard ticket being opened. The server is on contract type D with importance 60. It was followed by a sysadmin at 08:30 who were able to connect via the serial console but not receive a prompt and lemon monitoring showed the high load. They requested advice on whether to reboot or not to the castor.support workflow. This was replied to at 11:16 with the diagnosis of a problem of the monitoring because of a pile-up of rfiod processes. • SRM v1.1 deployment at CERN coupled the experiments – this is not the case for SRM v2.2!
ATLAS problems cont • Another srm problem was observed by S.Campana around 18:30 on Friday. • He observed connection timed out errors from srm.cern.ch for some files. He made an entry in the elog, submitted a ggus ticket and sent an email to castor.support hence generating a remedy ticket. ggus tickets are not followed at the weekend nor are castor.support tickets which are handled by the weekly service manager on duty during working hours. The elog is not part of the standard operations workflow. A reply to the castor ticket was made at 10:30 on Monday 18th asking if the problem was still being seen. At this time SC replied he was unable to tell as a new problem, the failure of delegated credentials to FTS, had started. An elog entry that this problem was 'site fixed' was made at 16:50 on the 18th with the information that there was a problem on a disk server (hardware) which made several thousand files unavailable till Saturday. Apparently the server failure did not trigger its removal from Castor as it should have. This was done by hand on Saturday evening by one of the team doing regular checks. The files would then have been restaged from tape. • The ggus ticket also arrived at CERN on Monday. (to be followed)
ATLAS problems – end. • There was a castoratlas interruption at 23.00 on Saturday 16 Feb. This triggered an SMS to a castor support member (not the piquet) who restored the service by midnight. There is an elog entry made at 16:52 on Monday. At the time there was no operator log alarm as the repair pre-empted this. • For several days there have been frequent failures of FTS transfers due to corrupt delegated proxies. This has been seen at CERN and several Tier 1. It is thought to be bug that came in with a recent gLite release. This stopped ATLAS transfers on the Monday morning. The workaround is to delete the delegated proxy and its database entry. The next transfer will recreate them. This is being automated at CERN by a cron job that looks for such corrupted proxies. It is not yet clear how much this affected ATLAS during the weekend. The lemon monitoring shows that ATLAS stopped, or reduced, the load generator about midday on Sunday.
Some (Informal) Observations (HRR) • The CCRC'08 elog is for internal information and problem solving but does not replace, and is not part of, existing operational procedures. • Outside of normal working hours ggus and CERN remedy tickets are not looked at. Currently the procedure for ATLAS to raise critical operations issues themselves is to send an email to the list atlas-grid-alarm. This is seen by the 24 hour operator who may escalate to the sysadmin piquet who can in turn escalate to the FIO piquet. Users who can submit to this list are K.Bos, S.Campana, M.Branco and A.Nairz. It would be good for IT operations to know what to expect from ATLAS operations when something changes. This may be already in the dashboard pages. • (Formal follow-up to come…)
Monitoring, Logging & Reporting • Need to follow-up on: • Accurate & meaningful presentation of status of experiments’ productions wrt stated goals • “Critical Services” – need input from the experiments on “check-lists” for these services, as well as additional tests • MoU targets – what can we realistically measure & achieve? • The various views that are required need to be taken into account • e.g. sites, depending on VOs supported, overall service coordination, production managers, project management & oversite • March / April F2Fs plus collaboration workshop, review during June CCRC’08 “post-mortem”
Supporting the Experiments • Need to focus our activities so that we support the experiments in as efficient & systematic manner as possible • Where should we focus this effort to have maximum effect? • What “best practices” and opportunities for “cross fertilization” can we find? • The bottom line: it is in everybody’s interest that the services run as smoothly and reliably as possible and that the experiments maximize the scientific potential of the LHC and their detectors… • Steady, systematic improvements with clear monitoring, logging & reporting against “SMART” metrics seems to be the best approach to achieving these goals
Draft List of SRM v2.2 Issues Priorities to be discussed & agreed: • Protecting spaces from (mis-)usage by generic users • Concerns dCache, CASTOR • Tokens for PrepareToGet/BringOnline/srmCopy (input) • Concerns dCache, DPM, StoRM • Implementations fully VOMS-aware • Concerns dCache, CASTOR • Correct implementation of GetSpaceMetaData • Concerns dCache, CASTOR • Correct size to be returned at least for T1D1 • Selecting tape sets • Concerns dCache, CASTOR, StoRM • by means of tokens, directory paths, ?? Feedback by Friday 22nd February!