Operating the LCG and EGEE Production Grid for HEP

Operating the LCG and EGEE Production Grid for HEP Ian Bird IT Department, CERN LCG Deployment Area Manager & EGEE Operations Manager CHEP‘04 28th September 2004 EGEE is a project funded by the European Union under contract IST-2003-508833 Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 1

LCG Operations in 2004 • Goal: - deploy & operate a prototype LHC computing environment • Scope: • Integrate a set of middleware and coordinate and support its deployment to the regional centres • Provide operational services to enable running as a production-quality service • Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support • Deployment Goals for LCG-2 • Production service for Data Challenges in 2004 • Experience in close collaboration between the Regional Centres • Learn how to maintain and operate a global grid • Focus on building a production-quality service • Understand how LCG can be integrated into the sites’ physics computing services • Set up the EGEE project and migrate the existing structure towards EGEE structure • By design LCG and EGEE services and operations teams are the same Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 2

Some history: March 2003 LCG-0 existing middleware, waiting for EDG-2 release September 2003 LCG-1 3 month late -> reduced functionality extensive certification process -> improved stability (RB, Information system) integrated 32 sites ~300 CPUs first use for production December 2003 LCG-2 Full set of functionality for DCs, first MSS integration Deployed in January to 8 core sites DCs started in February -> testing in production Large sites integrate resources into LCG (MSS and farms) Introduced a pre-production service for the experiments Alternative packaging (tool based and generic installation guides) Mai 2004 -> now monthly incremental releases Not all releases are distributed to external sites Improved services, functionality, stability and packing step by step Timely response to experiences from the data challenges The formal certification process has been invaluable The process to stabilise existing middleware and put in production is expensive Testbeds, people, time Now have monthly incremental middleware releases Not all are deployed Expand now with a pre-production service LCG – from certification to production Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 3

LCG-2 Software • LCG-2 core packages: • VDT (Globus2, condor) • EDG WP1 (Resource Broker, job submission tools) • EDG WP2 (Replica Management tools) + lcg tools • One central RMC and LRC for each VO, located at CERN, ORACLE backend • Several bits from other WPs (Config objects, InfoProviders, Packaging…) • GLUE 1.1 (Information schema) + few essential LCG extensions • MDS based Information System with significant LCG enhancements (replacements, simplified (see poster)) • Mechanism for application (experiment) software distribution • Almost all components have gone through some reengineering • robustness • scalability • efficiency • adaptation to local fabrics • The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture) • Still far from perfect data management Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 4

LCG-2/EGEE-0 Status 24-09-2004 Cyprus Total: 78 Sites ~9000 CPUs 6.5 PByte Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 5

Experiences in deployment • LCG covers many sites (>70) now – both large and small • Large sites – existing infrastructures – need to add-on grid interfaces etc. • Small sites want a completely packaged, push-button, out-of-the-box installation (including batch system, etc) • Satisfying both simultaneously is hard – requires very flexible packaging, installation, and configuration tools and procedures • A lot of effort had to be invested in this area • There are many problems – but in the end we are quite successful • System is reasonably stable • System is used in production • System is reasonably easy to install now ~80 sites • Now have a basis on which to incrementally build essential functionality, and from which to measure improvements • This infrastructure now also forms the EGEE production service Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 6

Operations services for LCG – 2004 • Deployment and Operational support • Hierarchical model • CERN acts as 1st level support for the Tier 1 centres • Tier 1 centres provide 1st level support for associated Tier 2s • “Tier 1 sites”  “Primary sites” • Grid Operations Centres (GOC) • Provide operational monitoring, troubleshooting, coordination of incident response, etc. • RAL (UK) led sub-project to prototype a GOC • Operations support from CERN team, GOC, and Taipei, with many individual contributions on the mailing list • User support • Central model • FZK provides user support portal • Problem tracking system web-based and available to all LCG participants • Experiments provide triage of problems • CERN team provide in-depth support and support for integration of experiment sw with grid middleware Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 7

Experiences during the data challenges Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 8

Data Challenges • Large scale production effort of the LHC experiments • test and validate the computing models • produce needed simulated data • test experiments production frame works and software • test the provided grid middleware • test the services provided by LCG-2 • All experiments used LCG-2 for all or part of their productions Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 9

Data Challenges – ALICE • Phase I • 120k Pb+Pb events produced in 56k jobs • 1.3 million files (26TByte) in Castor@CERN • Total CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years) • ~25% produced on LCG-2 • Phase II (underway) • 1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU • ~15% on LCG-2 Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 10

Data Challenges – ATLAS • Phase I • 7.7 Million events fully simulated (Geant 4) in 95.000 jobs • 22 TByte • Total CPU: 972 MSI-2k hours • >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid) Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 11

Data Challenges – CMS • ~30 M events produced • 25Hz reached • (only once for a full day) • RLS, Castor, control systems, T1 storage, … • Not a CPU challenge, but a fullchain demonstration • Pre-challenge production in 2003/04 • 70 M Monte Carlo events (30M with Geant-4) produced • Classic and grid (CMS/LCG-0, LCG-1, Grid3) productions Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 12

Data Challenges – LHCb This is 5-6 times what was possible at CERN alone • Phase I • 186 M events 61 TByte • Total CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites) • Up to 5600 concurrent running jobs in LCG-2 3-5 106/day LCG restarted LCG paused LCG in action 1.8 106/day DIRAC alone Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 13

Data challenges – summary • Probably the first time such a set of large scale grid productions has been done • Significant efforts invested on all sides – very fruitful collaborations • Unfortunately, DCs were first time the LCG-2 system had been used • Adaptations were essential – adapting experiment software to middleware and vice-versa – as limitations/capabilities were exposed • Many problems were recognised and addressed during the challenges • Systematic confrontation of the functional problems with experiment requirements has recently been made (GAG) • Middleware is actually quite stable now • But – job efficiency is not high – for many reasons (see below) • Started to see some basic underlying issues: • Of implementation (lack of error handling, scalability, etc) • Of underlying models (workload management) • Perhaps also of fabric services – batch systems ? • But – single largest issue is lack of stable operations Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 14

Problems during the data challenges • Common functional issues seen by all experiments: • Sites suffering from configuration and operational problems • inadequate resources on some sites (hardware, human..) • this is now the main source of failures • Load balancing between different sites is problematic • jobs can be “attracted” to sites that do not have adequate resources • modern batch systems are too complex and dynamic to summarize their behaviour in a few values in the IS • Identification of problems in LCG-2 is difficult • distributed environment, access to many log files needed….. • status of monitoring tools • Handling thousands of jobs is time consuming and tedious • Support for bulk operation is not adequate • Performance and scalability of services • storage (access and number of files) • job submission • information system • file catalogues • Services suffered from hardware problems Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 15 DC summary

Configuration and stability problems • This is the largest source of problems • Many are “well-known” fabric problems • Batch systems that cause “black holes” • NFS problems • Clock skew at a site • Software not installed or configured correctly • Lack of configuration management – fixed problems reappear • Firewall issues – • often less than optimal coordination between grid admins and firewall maintainers • Others are due to lack of experience • Many grid sites have not run services before, do not have procedures, tools, diagnostics • Not limited to small sites • Lack of support • Maintaining stable operation is labour intensive still – requires adequate operations staff trained in grid management • Slow response – problems reported daily – but may last for weeks • No vacations … • Experiments expect 24x365 stable operation • Grid successfully integrates these problems from 80 sites • Building a stable operation is the highest priority • This is what EGEE is funded to do Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 16

EGEE and Evolving the Operations model Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 17

EGEE Applications Grid infrastructure Geantnetwork • Goal • Create a Europe-wide production quality grid infrastructure on top of present regional grid programs • despite it’s name the project has a worldwide scope • multi science project • Scale • 70 leading institutes in 27 countries • ~300 FTEs • Aim: 20’000 CPUs • Initially: 2 year project • Activities • 48% service activities (operation, support) • 24% middleware re-engineering • 28% management, training, dissemination, international cooperation • Builds on: • LCG to establish a grid operations service • single team for deployment and operations • Experience gained from running services for the LHC experiments • HEP experiments are the pilot application for EGEE, together with bio-medical Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 18

LCG and EGEE Operations • EGEE is funded to operate and support a research grid infrastructure in Europe • The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service • LCG includes US and Asia-Pacific, EGEE includes other sciences • Substantial part of infrastructure common to both • LCG Deployment Manager is the EGEE Operations Manager • CERN team (Operations Management Centre) provides coordination, management, and 2nd level support • Support activities are expanded with the provision of • Core Infrastructure Centres (CIC) (4) • Regional Operations Centres (ROC) (9) • ROCs are coordinated by Italy, outside of CERN (which has no ROC) Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 19

Operations: LCG  EGEE in Europe • User support: • Becomes hierarchical • Through the Regional Operations Centres (ROC) • Act as front-line support for user and operations issues • Provide local knowledge and adaptations • Coordination: • At CERN (Operations Management Centre) and CIC for HEP-LHC • Operational support: • The LCG GOC is the model for the EGEE CICs • CIC’s replace the European GOC at RAL • Also run essential infrastructure services • Provide support for other (non-LHC) applications • Provide 2nd level support to ROCs Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 20

The Regional Operations Centres • The ROC organisation is the focus of EGEE operations activities: • Coordinate and support deployment • Coordinate and support operations • Coordinate Resource Centre management • Negotiate application access to resources within region • Coordinate planning and reporting within region • Negotiate and monitor SLA’s within the region • Teams: • Deployment team • 24 hour support team (answers user and rc problems) • Operations training at RC’s • Organise tutorials for users • The ROC is the first point of contact for all: • New sites joining the grid and support for them • New users and user support Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 21

Core Infrastructure Centres • “Grid Operations Centres” – behaving as a single organisation • Operate infrastructure services, e.g.: • VO services: • VO servers, VO registration service • RBs, UIs, Information services • RLS and other database services • Ensure recovery procedures and fail-over (between CICs) • Act as Grid Operations Centres • Monitoring, proactive troubleshooting • Performance monitoring • Control sites’ participation in production service • Use work done at RAL for LCG GOC as starting point • Support to ROCs for operational problems • Operational configuration management and change control • Accounting and resource usage/availability monitoring • Take responsibility for operational “control” (tbd) – • rotates through 4 CICs Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 22

Future activities Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 23

Future activities • All experiments expect to have significant ongoing productions for the foreseeable future • Some will also have next data challenges 1 year from now • LCG will run a series of “service challenges” • Complementary to data challenges/ongoing productions • Demonstrate essential service-level issues (e.g. Tier0-1 reliable data transfer) • Essential that we are able to build a manageable production service • Based on existing infrastructure • Reasonable improvements • In parallel build a “pre-production” service where: • New middleware (gLite, …) can be demonstrated and validated before being deployed in production • Understand the migration strategy to 2nd generation middleware • Use the existing production service as the baseline comparison • It takes a long time to make new software production quality • Must be careful not to step backwards – even though what we have is far from perfect Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 24

What next?  Service challenges • Proposed to be used in addition to ongoing data challenges and production use: • Goal is to ensure baseline services can be demonstrated • Demonstrate the resolution of problems mentioned earlier • Demonstrate that operational and emergency procedures are in place • 4 areas proposed: • Reliable data transfer • Demonstrate fundamental service for Tier 0  Tier 1 by end 2004 • Job flooding/exerciser • Understand the limitations and baseline performances of the system • May be achieved by the ongoing real productions • Incident response • Ensure the procedures are in place and work – before real life tests them • Interoperability • How can we bring together the different grid infrastructures? Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 25

Issues • Operational management • How much control can/should be assumed by an operations centre? • Small sites with little support – can GOCs restart services? • More intelligence in the services to recognise problems • Strong organisation to take operational responsibility • Ensure that problems are addressed, traced, reported • Need site management to take responsibility • Ensure that Operational security group is in place with good communications • Simplify service configurations – to avoid mistakes • Weight of VOs • EGEE has many VOs (most still national in scope) • Deploying a VO is very heavyweight – must become much simpler Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 26

Summary • Data challenges have been running for 8 months • Major effort and collaboration between all involved • Distributed operations in such a large system are hard • Requires significant effort – EGEE will help here • Many lessons have been learned • Essential that 2nd generation middleware takes account of all these issues • Not just functionality, but manageability, scalability, accountability, robustness  operational requirements are important requirements for users too … • Now moving to a phase of sustained, continuous operation • While building a parallel service to validate next generation middleware • We have come a long way in the last few months • There is still much to be done Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 27

Related papers • Distributed Computing Services track • Evolution of Data Management in LCG-2 (278); Jean-Philippe Baud • Distributed Computing Systems and Experiences track • Deploying and Operating LCG-2 (389); Markus Schulz • Many other papers on experience in using LCG-2 • Poster Session 2 • Several papers on LCG-2 – all aspects: certification, usage, information systems, integration/certification Ian.Bird@cern.ch CHEP’04 – 28 September 2004 - 28

Operating the LCG and EGEE Production Grid for HEP

Operating the LCG and EGEE Production Grid for HEP

Presentation Transcript

MCRunjob: An HEP Workflow Planner for Grid Production Processing

LCG and Grid Operations

Monitoring and Accounting in EGEE/LCG

The EGEE Production Grid

The EGEE project: building an international production grid infrastructure

The EGEE project – Building a Global Production Grid

Experiences with the GLUE information schema in the LCG/EGEE production Grid

Migration to the GLUE 2.0 information schema in the LCG/EGEE/EGI production Grid

EGEE A Large-scale Production Grid Infrastructure

The LCG-GRID infrastructure of HEP/NTUA Lab

LCG/EGEE Grid Incident Response

IAG – Israel Academic Grid, EGEE and HEP in Israel

EGEE – A Large-Scale Production Grid Infrastructure

Laboratory: Hands using EGEE Grid (LCG, gLite)

The CCIN2P3 and its role in EGEE/LCG

Middleware Planning for LCG/EGEE Bob Jones EGEE Technical Director

LCG/EGEE Grid Incident Response

LCG/EGEE Security Coordination

Monitoring and Accounting in EGEE/LCG

IAG – Israel Academic Grid, EGEE and HEP in Israel

GLOBAL GRID USER SUPPORT THE MODEL AND EXPERIENCE IN LCG/EGEE

LCG/EGEE Operational Security Coordination