240 likes | 331 Views
“ Deploying and operating the LHC Computing Grid 2 during Data Challenges ” Markus Schulz, IT-GD, CERN markus.schulz@cern.ch the CERN IT-GD group. CHEP’04 – Interlaken, Switzerland, 27 September – 1 October 2004. EGEE is a project funded by the European Union under contract IST-2003-508833.
E N D
“Deploying and operating the LHC Computing Grid 2 during Data Challenges” Markus Schulz, IT-GD, CERNmarkus.schulz@cern.chthe CERN IT-GD group CHEP’04 – Interlaken, Switzerland, 27 September – 1 October 2004 EGEE is a project funded by the European Union under contract IST-2003-508833
Outline • LCG overview • Short History of LCG-2 • Data Challenges • Operating LCG • Problems • Summary CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 2
The LCG Project (and what it isn’t) • Mission • To prepare, deploy and operate the computing environment for the experiments to analyze the data from the LHC detectors • Two phases: • Phase 1: 2002 – 2005 • Build a prototype, based on existing grid middleware • Deploy and run a productionservice • Produce the Technical Design Report for the final system • Phase 2: 2006 – 2008 • Build and commission the initial LHC computing environment LCG is NOT a development project for middleware but problem fixing is permitted (even if writing code is required) LCG-2 is the first production service for EGEE CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 3
LCG Grid Deployment Area • Scope: • Integrate a set of middleware and coordinate and support its deployment to the regional centres • Provide operational services to enable running as a production-quality service • Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support • Deployment Goals for LCG-2 • Production service for Data Challenges in 2004 • Focused on batch production work • Experience in close collaboration between the Regional Centres • Learn how to maintain and operate a global grid • Focus on building a production-quality service • robustness, fault-tolerance, predictability, and supportability • Understand how LCG can be integrated into the sites’ computing CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 4
LCG Deployment Area • A core team at CERN – Grid Deployment group (~30) • Collaboration of the regional centres – • through the Grid Deployment Board • Partners take responsibility for specific tasks (e.g. GOCs, GUS) • Focussed task forces as needed • Collaborative joint projects – via JTB, grid projects, etc. • CERN deployment group • Core preparation, (re)certification, deployment, and support activities • Integration, packaging, debugging, development of missing tools, • Deployment, coordination & support, security & VO management, • Experiment integration and support • GDB: Country representatives for regional centres • Address policy, operational issues that require general agreement • Brokered agreements on: • Security • What is deployed….. CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 5
LCG-2 Software • LCG-2 core packages: • VDT (Globus2, condor) • EDG WP1 (Resource Broker, job submission tools) • EDG WP2 (Replica Management tools) + lcg tools • One central RMC and LRC for each VO, located at CERN, ORACLE backend • Several bits from other WPs (Config objects, InfoProviders, Packaging…) • GLUE 1.1 (Information schema) + few essential LCG extensions • MDS based Information System with significant LCG enhancements (replacements, simplified (see poster)) • Mechanism for application (experiment) software distribution • Almost all components have gone through some reengineering • robustness • scalability • efficiency • adaptation to local fabrics • The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture) CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 6
History • Jan 2003 GDB agreed to take VDT and EDG components • March 2003 LCG-0 • existing middleware, waiting for EDG-2 release • September 2003 LCG-1 • 3 month late -> reduced functionality • extensive certification process -> improved stability (RB, Information system) • integrated 32 sites ~300 CPUs first use for production • December 2003 LCG-2 • Full set of functionality for DCs, first MSS integration • Deployed in January to 8 core sites • DCs started in February -> testing in production • Large sites integrate resources into LCG (MSS and farms) • Introduced a pre-production service for the experiments • Alternative packaging (tool based and generic installation guides) • Mai 2004 -> now monthly incremental releases • Not all releases are distributed to external sites • Improved services, functionality, stability and packing step by step • Timely response to experiences from the data challenges CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 7
LCG-2 Status 22 09 2004 new interested sites should look here: release Cyprus Total: 78 Sites ~9400 CPUs ~6.5 PByte CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 8
Integrating Sites • Sites contact GD Group or Regional Center • Sites go to therelease page • Sites decide on manual or tool based installation • documentation for both available • WN and UI from next release on tar-ball based release • Sites provide security and contact information • Sites install and use provided tests for debugging • support from regional centers or CERN • CERN GD certifies site and adds it to the monitoring and information system • sites are daily re-certified and problems traced in SAVANNAH • Large sites have integrated their local batch systems in LCG-2 • Adding new sites is now quite smooth • problem is keeping large number of sites correctly configured CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 9
Data Challenges • Large scale production effort of the LHC experiments • test and validate the computing models • produce needed simulated data • test experiments production frame works and software • test the provided grid middleware • test the services provided by LCG-2 • All experiments used LCG-2 for part of their production CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 10
Phase I • 120k Pb+Pb events produced in 56k jobs • 1.3 million files (26TByte) in Castor@CERN • Total CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years) • ~25% produced on LCG-2 • Phase II (underway) • 1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU • ~15% on LCG-2 Data Challenges • Phase I • 7.7 Million events fully simulated (Geant 4) in 95.000 jobs • 22 TByte • Total CPU: 972 MSI-2k hours • >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid) CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 11
~30 M events produced • 25Hz reached • (only once for a full day) Data Challenges CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 12
Phase I • 186 M events 61 TByte • Total CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites) • Up to 5600 concurrent running jobs in LCG-2 Data Challenges 3-5 106/day LCG restarted LCG paused LCG in action 1.8 106/day DIRAC alone CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 13
Problems during the data challenges • All experiments encountered on LCG-2 similar problems • LCG sites suffering from configuration and operational problems • not adequate resources on some sites (hardware, human..) • this is now the main source of failures • there is a discrepancy between the failure rate on LCG-2 and on the C&T testbed • Load balancing between different sites is problematic • jobs can be “attracted” to sites that have no adequate resources • modern batch systems are too complex and dynamic to summarize their behavior in a few values in the IS • Identification and location of problems in LCG-2 is difficult • distributed environment, access to many logfiles needed….. • status of monitoring tools • Handling thousands of jobs is time consuming and tedious • Support for bulk operation is not adequate • Performance and scalability of services • storage (access and number of files) • job submission • information system • file catalogues • Services suffered from hardware problems (no fail over services) CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 14 DC summary
Running Services • Multiple instances of core services for each of the experiments • separates problems, avoids interference between experiments • improves availability • allows experiments to maintain individual configuration (information system) • addresses scalability to some degree • Monitoring tools for services currently not adequate • tools under development to implement control system • Access to storage via load balanced interfaces • CASTOR • dCache • Services that carry “state” are problematic to restart on new nodes • needed after hardware problems, or security problems • “State Transition” between partial usage and full usage of resources • required change in queue configuration (faire share, individual queues/VO) • next release will come with description for fair share configuration (smaller sites) CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 15 DC summary
You made DCs almost enjoyable Support during the DCs • User (Experiment) Support: • GD at CERN worked very close with the experiments production managers • Informal exchange (e-mail, meetings, phone) • “No Secrets” approach, GD people on experiments mail lists and vice versa • ensured fast response • tracking of problems tedious, but both sites have been patient • clear learning curve on BOTH sites • LCG GGUS (grid user support) at FZK became operational after start of the DCs • due to the importance of the DCs the experiments switch slowly to the new service • Very good end user documentation by GD-EIS • Dedicated testbed for experiments with next LCG-2 release • rapid feedback, influenced what made it into the next release • Installation (Site) Support: • GD prepared releases and supported sites (certification, re-certification) • Regional centres supported their local sites (some more, some less) • Community style help via mailing list (high traffic!!) • FAQ lists for trouble shooting and configuration issues: TaipeiRAL Dear Experiments DC Staff, [LCG-ROLLOUT]-friends, Site Admins and GD-Group Thank you very much for the supportive attitude, patience and energy during the data challenges!!!! CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 16
Support during the DCs • Operations Service: • RAL (UK) is leading sub-project on developing operations services • Initial prototype http://www.grid-support.ac.uk/GOC/ • Basic monitoring tools • Mail lists for problem resolution • Working on defining policies for operation, responsibilities (draft document) • Working on grid wide accounting • Monitoring: • GridICE (development of DataTag Nagios-based tools) • GridPP job submission monitoring • Information system monitoring and consitency check http://goc.grid.sinica.edu.tw/gstat/ • CERN GD daily re-certification of sites (including history) • escalation procedure under development • tracing of site specific problems via problem tracking tool • tests core services and configuration CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 17
Operational issues (selection) • Slow response from sites • Upgrades, response to problems, etc. • Problems reported daily – some problems last for weeks • Lack of staff available to fix problems • Vacation period, other high priority tasks • Various mis-configurations (see next slide) • Lack of configuration management – problems that are fixed reappear • Lack of fabric management (mostly smaller sites) • scratch space, single nodes drain queues, incomplete upgrades, …. • Lack of understanding • Admins reformat disks of SE … • Firewall issues – • often less than optimal coordination between grid admins and firewall maintainers • PBS problems • Scalability, robustness (switching to torque helps) • Provided documentation often not read (carefully) • new activity started to develop “hierarchical” adaptive documentation • (see G2G poster) • simpler way to install middleware on farm nodes (even remotely in user space) • will be included in October release CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 18
Site (mis) - configurations • Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems: • – The variable VO <VO> SW DIR points to a non existent area on WNs. • – The ESM is not allowed to write in the area dedicated to the software installation • – Only one certificate allowed to be mapped to the ESM local account • – Wrong information published in the information system (Glue Object Classes not linked) • – Queue time limits published in minutes instead of seconds and not normalized • – /etc/ld.so.conf not properly configured. Shared libraries not found. • – Machines not synchronized in time • – Grid-mapfiles not properly built • – Pool accounts not created but the rest of the tools configured with pool accounts • – Firewall issues • – CA files not properly installed • – NFS problems for home directories or ESM areas • – Services configured to use the wrong/no Information Index (BDII) • – Wrong user profiles • – Default user shell environment too big • Partly related to middleware complexity CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 19
Outstanding Middleware Issues • Collection: Outstanding Middleware Issues • Important: 1st systematic confrontation of required functionalities with capabilities of the existing middleware • Some can be patched, worked around, • Those related to fundamental problems with underlying models and architectures have to be input as essential requirements to future developments (EGEE) • Middleware is now not perfect but quite stable • Much has been improved during DC’s • A lot of effort still going into improvements and fixes • Big hole is missing space management on SE’s • especially for Tier 2 sites CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 20
EGEE Impact on Operations • The available effort for operations from EGEE is now ramping up: • LCG GOC (RAL) EGEE CICs and ROCs, + Taipei • Hierarchical support structure • Regional Operations Centres (ROC) • One per region (9) • Front-line support for deployment, installation, users • Core Infrastructure Centres (CIC) • Four (+ Russia next year) • Evolve from GOC – monitoring, troubleshooting, operational “control” • “24x7” in a 8x5 world ???? • Also providing VO-specific and general services • EGEE NA3 organizes training for users and site admins • “Grid operations day” at HEPiX in October • Address common issues, experiences • “Operations and Fabric Workshop” • CERN 1-3 Nov CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 21
Summary • LCG-2 services have been supporting the data challenges • Many middleware problems have been found – many addressed • Middleware itself is reasonably stable • Biggest outstanding issues are related to providing and maintaining stable operations • Has to be addressed in large part by management buy-in to providing sufficient and appropriate effort at each site • Future middleware has to take this into account: • Must be more manageable, trivial to configure and install • Must be easy deployed in a failsafe mode • Must be easy deployed in a way that allows to build scalable load balancing services • Management and monitoring must be built into services from the start on CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 22
Screen Shots CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 23
Screen Shots CHEP’04, Interlaken Switzerland CERN IT-GD 27 September – 1 October 2004 24