EGEE is a project funded by the European Union under contract IST-2003-508833

“LCG Operation During the Data Challenges” Markus Schulz, IT-GD, CERNmarkus.schulz@cern.ch“Discussion on Operation Models” EGEE is a project funded by the European Union under contract IST-2003-508833

Outline • Building LCG-2 • Data Challenges (very brief) • Problems (not so brief) • Operating LCG • how it was planned • how it happened to be done • how it felt • What’s next? • I will skip many slides to leave room for discussions Comment /Shout in REALTIME!!!!! HEPIX 2004 BNL CERN IT-GD 22 October 2004 2

History • December 2003 LCG-2 • Full set of functionality for DCs, first MSS integration • Deployed in January to 8 core sites (less sites less trouble) • DCs started in February -> testing in production • Large sites integrate resources into LCG (MSS and farms) • Introduced a pre-production service for the experiments • Alternative packaging (tool based and generic installation guides) • Mai 2004 -> now monthly incremental releases • Not all releases are distributed to external sites • Improved services, functionality, stability and packing step by step • Timely response to experiences from the data challenges HEPIX 2004 BNL CERN IT-GD 22 October 2004 3

LCG-2 Status 22 10 2004 new interested sites should look here: release Cyprus Total: 82 Sites ~9400 CPUs ~6.5 PByte HEPIX 2004 BNL CERN IT-GD 22 October 2004 4

Integrating Sites • Sites contact GD Group or Regional Center • Sites go to therelease page • Sites decide on manual or tool based installation (LCFGng) • documentation for both available • WN and UI from next release on tar-ball based release • almost trivial install of WNs and UIs • Sites provide security and contact information • Sites install and use provided tests for debugging • support from regional centers or CERN • CERN GD certifies site and adds it to the monitoring and information system • sites are daily re-certified and problems traced in SAVANNAH • Large sites have integrated their local batch systems in LCG-2 • Adding new sites is now quite smooth • problem is keeping large number of sites correctly configured worked 80+ times failed 3-5 times HEPIX 2004 BNL CERN IT-GD 22 October 2004 5

Data Challenges • Large scale production effort of the LHC experiments • test and validate the computing models • produce needed simulated data • test experiments production frame works and software • test the provided grid middleware • test the services provided by LCG-2 • All experiments used LCG-2 for part of their production HEPIX 2004 BNL CERN IT-GD 22 October 2004 6

Phase I • 120k Pb+Pb events produced in 56k jobs • 1.3 million files (26TByte) in Castor@CERN • Total CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years) • ~25% produced on LCG-2 • Phase II (underway) • 1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU • ~15% on LCG-2 Data Challenges • Phase I • 7.7 Million events fully simulated (Geant 4) in 95.000 jobs • 22 TByte • Total CPU: 972 MSI-2k hours • >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid) HEPIX 2004 BNL CERN IT-GD 22 October 2004 7

~30 M events produced • 25Hz reached • (only once for a full day) Data Challenges HEPIX 2004 BNL CERN IT-GD 22 October 2004 8

Phase I • 186 M events 61 TByte • Total CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites) • Up to 5600 concurrent running jobs in LCG-2 Data Challenges 3-5 106/day LCG restarted LCG paused LCG in action 1.8 106/day DIRAC alone HEPIX 2004 BNL CERN IT-GD 22 October 2004 9

Problems during the data challenges • All experiments encountered on LCG-2 similar problems • LCG sites suffering from configuration and operational problems • not adequate resources on some sites (hardware, human..) • this is now the main source of failures • Load balancing between different sites is problematic • jobs can be “attracted” to sites that have no adequate resources • modern batch systems are too complex and dynamic to summarize their behavior in a few values in the IS • Identification and location of problems in LCG-2 is difficult • distributed environment, access to many logfiles needed….. • status of monitoring tools • Handling thousands of jobs is time consuming and tedious • Support for bulk operation is not adequate • Performance and scalability of services • storage (access and number of files) • job submission • information system • file catalogues • Services suffered from hardware problems (no fail over services) HEPIX 2004 BNL CERN IT-GD 22 October 2004 10 DC summary

Outstanding Middleware Issues • Collection: Outstanding Middleware Issues • Important: 1st systematic confrontation of required functionalities with capabilities of the existing middleware • Some can be patched, worked around, • Those related to fundamental problems with underlying models and architectures have to be input as essential requirements to future developments (EGEE) • Middleware is now not perfect but quite stable • Much has been improved during DC’s • A lot of effort still going into improvements and fixes • Big hole is missing space management on SE’s • especially for Tier 2 sites HEPIX 2004 BNL CERN IT-GD 22 October 2004 11

Operational issues (selection) • Slow response from sites • Upgrades, response to problems, etc. • Problems reported daily – some problems last for weeks • Lack of staff available to fix problems • Vacation period, other high priority tasks • Various mis-configurations (see next slide) • Lack of configuration management – problems that are fixed reappear • Lack of fabric management (mostly smaller sites) • scratch space, single nodes drain queues, incomplete upgrades, …. • Lack of understanding • Admins reformat disks of SE … • Provided documentation often not read (carefully) • new activity started to develop “hierarchical” adaptive documentation • simpler way to install middleware on farm nodes (even remotely in user space) • Firewall issues – • often less than optimal coordination between grid admins and firewall maintainers • PBS problems • Scalability, robustness (switching to torque helps) HEPIX 2004 BNL CERN IT-GD 22 October 2004 12

Site (mis) - configurations • Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems: • – The variable VO <VO> SW DIR points to a non existent area on WNs. • – The ESM is not allowed to write in the area dedicated to the software installation • – Only one certificate allowed to be mapped to the ESM local account • – Wrong information published in the information system (Glue Object Classes not linked) • – Queue time limits published in minutes instead of seconds and not normalized • – /etc/ld.so.conf not properly configured. Shared libraries not found. • – Machines not synchronized in time • – Grid-mapfiles not properly built • – Pool accounts not created but the rest of the tools configured with pool accounts • – Firewall issues • – CA files not properly installed • – NFS problems for home directories or ESM areas • – Services configured to use the wrong/no Information Index (BDII) • – Wrong user profiles • – Default user shell environment too big • Only partly related to middleware complexity integrated all common small problems into 1 BIG PROBLEM HEPIX 2004 BNL CERN IT-GD 22 October 2004 13

Running Services • Multiple instances of core services for each of the experiments • separates problems, avoids interference between experiments • improves availability • allows experiments to maintain individual configuration (information system) • addresses scalability to some degree • Monitoring tools for services currently not adequate • tools under development to implement control system • Access to storage via load balanced interfaces • CASTOR • dCache • Services that carry “state” are problematic to restart on new nodes • needed after hardware problems, or security problems • “State Transition” between partial usage and full usage of resources • required change in queue configuration (faire share, individual queues/VO) • next release will come with description for fair share configuration (smaller sites) HEPIX 2004 BNL CERN IT-GD 22 October 2004 14 DC summary

Support during the DCs • User (Experiment) Support: • GD at CERN worked very close with the experiments production managers • Informal exchange (e-mail, meetings, phone) • “No Secrets” approach, GD people on experiments mail lists and vice versa • ensured fast response • tracking of problems tedious, but both sites have been patient • clear learning curve on BOTH sites • LCG GGUS (grid user support) at FZK became operational after start of the DCs • due to the importance of the DCs the experiments switch slowly to the new service • Very good end user documentation by GD-EIS • Dedicated testbed for experiments with next LCG-2 release • rapid feedback, influenced what made it into the next release • Installation (Site) Support: • GD prepared releases and supported sites (certification, re-certification) • Regional centres supported their local sites (some more, some less) • Community style help via mailing list (high traffic!!) • FAQ lists for trouble shooting and configuration issues: TaipeiRAL HEPIX 2004 BNL CERN IT-GD 22 October 2004 15

Support during the DCs • Operations Service: • RAL (UK) is leading sub-project on developing operations services • Initial prototype http://www.grid-support.ac.uk/GOC/ • Basic monitoring tools • Mail lists for problem resolution • Working on defining policies for operation, responsibilities (draft document) • Working on grid wide accounting • Monitoring: • GridICE (development of DataTag Nagios-based tools) • GridPP job submission monitoring • Information system monitoring and consitency check http://goc.grid.sinica.edu.tw/gstat/ • CERN GD daily re-certification of sites (including history) • escalation procedure under development • tracing of site specific problems via problem tracking tool • tests core services and configuration HEPIX 2004 BNL CERN IT-GD 22 October 2004 16

Screen Shots HEPIX 2004 BNL CERN IT-GD 22 October 2004 17

Screen Shots HEPIX 2004 BNL CERN IT-GD 22 October 2004 18

VO A VO B VO C P-Site-1 P-Site-2 S-Site-1 S-Site-2 S-Site-1 S-Site-2 Problem HandlingPLAN Monitoring/Followup Triage: VO / GRID GOC GGUS (Remedy) GD CERN Escalation HEPIX 2004 BNL CERN IT-GD 22 October 2004 19

P-Site-1 S-Site-1 S-Site-2 Problem HandlingOperation (most cases) Community VO A Rollout Mailing List VO B GGUS VO C Triage S-Site-2 GOC GD CERN S-Site-1 Monitoring Certification Follow-Up FAQs Monitoring FAQs S-Site-3 HEPIX 2004 BNL CERN IT-GD 22 October 2004 20

Problem Tracking • GGUS: REMEDY • Middleware problems: SAVANNAH LCG-OPERATION • Re-certification: SAVANNAH LCG-SITES • Many (MOST) problems only tracked by e-mail • Much confusion on where to put problems • Training needed to get reasonable 1st level user support • canned answers • experts need to focus on more complex tasks • Unification of FAQs (RAL, Taipei, Italy, …) HEPIX 2004 BNL CERN IT-GD 22 October 2004 21

EGEE Impact on Operations • The available effort for operations from EGEE is now ramping up: • LCG GOC (RAL)  EGEE CICs and ROCs, + Taipei • Hierarchical support structure • Regional Operations Centres (ROC) • One per region (9) • Front-line support for deployment, installation, users • Core Infrastructure Centres (CIC) • Four (+ Russia next year) • Evolve from GOC – monitoring, troubleshooting, operational “control” • “24x7” in a 8x5 world ???? • Also providing VO-specific and general services • EGEE NA3 organizes training for users and site admins • “NOW” at HEPiX • Address common issues, experiences • “Operations and Fabric Workshop” • CERN 1-3 Nov HEPIX 2004 BNL CERN IT-GD 22 October 2004 22

PART II • Operation models • How much can be delegated to whom? • autonomy/ availability • What are the consequences? • cost for 24/7 with 8x5 staff • One/multiple models for all sites/regions? • One model for site integration, update, user support, security, operation? • latency, efficiency, distribution of workload ….. • One size fits all? • Next slides are meant to stimulate discussions not give answers HEPIX 2004 BNL CERN IT-GD 22 October 2004 23

CICs and ROCs and Operations • Core Infrastructure Centers (CICs) • run services like RBs, Information Indices, VO/VOMS, Catalogues • are the distributed Grid Operation Center (GOC) • and more…. • Regional Operation Centers (ROCs) • coordinate activities in their region • give support to regional RCs • coordinate setup/upgrades • and more.. • Resource Centers (RC) • computing and storage • Operation Management Center (OMC) • coordination HEPIX 2004 BNL CERN IT-GD 22 October 2004 24

Model I Strict Hierarchy • CICs locates a problem with a RC or CIC in a region • triggered by monitoring/ user alert • CIC enters the problem into the problem tracking tool and assigns it to a ROC • ROC receives a notification and works on solving the problem • region decides locally what the ROC can to do on the RCs. • This can include restarting services etc. • The main emphasis is that the region decides on the depth of the interaction. • ===> different regions, different procedures • CICs NEVER contact a site • .====> ROCs need to be staffed all the time • ROC does it is fully responsible for ALL the sites in the region HEPIX 2004 BNL CERN IT-GD 22 October 2004 25

Model I Strict Hierarchy • Pro: • Best model to transfer knowledge to the ROCs • all information flows through them • Different regions can have their own policies • this can reflect different administrative relation of sites in a region. • Clear responsibility • until it is discovered it is the CICs fault then it is always the ROCs fault • Cons: • High latency • even for trivial operations we have to pass through the ROCs • ROCs have to be staffed (reachable) all the time.$$$$ • Regions will develop their own tools • parallel strands, less quality • Excluded for handling security HEPIX 2004 BNL CERN IT-GD 22 October 2004 26

Model IIDirect Com. Local Contr. • ROCs are active in: • the follow-up of problems that take longer to handle • setup of sites • CICs are active in: • handling problems that can be solved by simple interactions • communicated directly between CICs and RCs • ROCs are informed on all interactions between CICs and RCs • all problems are entered into the problem tracking tool. • restarting of services, etc. are handled by the RCs HEPIX 2004 BNL CERN IT-GD 22 October 2004 27

Model IIDirect Com. Local Contr. • Pros: • Resources are not lost for trivial reasons • Principe of local control is maintained • ROCs are in the loop, • but weak ROCs can't create too severe delays • No complex tools for communication management needed • mail + IRC sufficient • Cons: • RCs need to be reachable at all times • not realistic, and very expensive €€€€€€€€€€ • CICs have to be aware of the level of maturity of O(100) RCs • ROCs have to monitor what is going on to learn the trade • Language problems between the CICs and sysadmins • Unclear responsibility • "This was reported" / "Why didn't the CICs fix it them self" HEPIX 2004 BNL CERN IT-GD 22 October 2004 28

Model IIIDirect Com. Direct Contr. • Like Model II with some modifications • CICs have access to the services on the RCs • can, if the RC is not staffed, manage some of the services • site publishes at any time • whether the local support is reachable or not • what actions are permitted by the CICs. • all interactions are logged and reported to RC and ROC • Some tools that allow very controlled (limited) access like this are under development (GSI enabled remote SUDO) • Variation with ROCs only interaction (IIIa) HEPIX 2004 BNL CERN IT-GD 22 October 2004 29

Model IIIDirect Com. Direct Contr. • Pros: • Resources are not lost for trivial reasons • ROCs are in the loop, • but weak ROCs can't create too severe delays • One set of tools for remote operation • some uniformity ---> chance for better quality • Site decides at any time on balance between local/remote operation • RCs can be run for (short) time unattended • Cons: • Set of tools for secure limited remote operation respecting the sites policies has to be put in place • ROCs have to monitor what is going to learn the trade • Unclear responsibility • "This was reported" / "Why didn't the CICs fix it them self" HEPIX 2004 BNL CERN IT-GD 22 October 2004 30

Sample UseCases • User reports jobs failing on one site • User reports jobs failing on some/all sites • Monitoring shows site dropping in and out of the IS • An acute security incident • Upgrading to a new version • Post mortem after the security incidents • ……. • Good preparation for the Operations Workshop HEPIX 2004 BNL CERN IT-GD 22 October 2004 31

Summary • LCG-2 services have been supporting the data challenges • Many middleware problems have been found – many addressed • Middleware itself is reasonably stable • Biggest outstanding issues are related to providing and maintaining stable operations • Future middleware has to take this into account: • Must be more manageable, trivial to configure and install • Management and monitoring must be built into services from the start on • Outcome of the workshop in November is crucial for EGEE operation HEPIX 2004 BNL CERN IT-GD 22 October 2004 32

EGEE is a project funded by the European Union under contract IST-2003-508833