EU DataGRID testbed management and support at CERN

EU DataGRID testbed management and support at CERN Speaker: Emanuele Leonardi (EDG Testbed Manager – WP6) Emanuele.Leonardi@cern.ch http://presentation.address

Talk Outline • Foreword • Introduction to EDG Services • EDG Testbeds at CERN • EDG Testbed Operation Activities • Installation and configuration • Service administration • Resource management • Conclusions Authors Emanuele Leonardi, Markus Schulz - CERN

Foreword • This talk is NOT about how to run the EDG grid middleware in a production environment: EDG software is still quite far from being a viable and solid production system and whatever we learned up to now could (and hopefully will) change completely in the near future • I will only describe the (mis)adventures which happened while trying (and often not succeeding) to run the EDG testbeds during the first two years of the project • I will also point out some of the deployment-related problems we encountered and which will (hopefully) be addressed by the future versions of the software on the way to LCG-1

EDG Services (1) • Authentication • Grid Security Infrastructure (GSI) based on PKI (openSSL) • Proxy renewal service • Authorization • Global EDG user directory + per-VO user directory (LDAP) • Very coarse grained • Resource Access • GLOBUS gatekeeper with EDG extensions • Interface to standard batch systems (PBS, LSF) • GSI-enabled FTP server • Data replication and access • Job sandbox transportation

EDG Services (2) • Installation • LCFG(ng) • Storage Management • File Replication service (GDMP) • At CERN interfaced to CASTOR (MSS) • Replica Catalog (LDAP) • One RC per VO • Information Services • Hierarchical GRID Information Service (GIS) structure (LDAP) • Central Metacomputing Directory Service (MDS)

EDG Services (3) • Resource Management • Resource Broker • Interfaced to the GIS • Jobmanager and Jobsubmission • Based on CondorG • Logging and Bookkeeping • Monitoring • None (being deployed in EDG 2) • Accounting • None (foreseen for EDG 2)

EDG Services (4) • Services are interdependent • Services are composite and heterogeneous • Based on lower level services (e.g. CondorG) • Many different DataBase flavors in use (MySQL, Postgres, …) • Services are mapped to logical machines • Each physical node runs one or more services • e.g. a Computing Element (CE) runs the GLOBUS gatekeeper, an FTP server, a batch submission system, … • Services impose constraints on the testbed configuration • Shared filesystems are needed within the batch system to have a common /home area and to create an homogeneous security configuration • Some services need special resources on the node (extra RAM)

EDG Testbeds at CERN • Production (Application) Testbed: 40 Nodes, EDG v.1.4.7 • Few updates (security fixes) but frequent restarts of services (daily) • Data production tests by application groups (LHC experiments, etc.) • Demonstrations and Tutorials (every few weeks) • Number of nodes varied greatly (user requests, stress tests, availability) • Development Testbed: 9 Nodes , EDG v.1.4.7 • In the past used to integrate and test new releases • many changes/day, very unstable, service restarts, traceability problems • Now used to test small changes before installation on the Production TB • Integration Testbed: 18 Nodes • EDG porting to RH7.3+GLOBUS 2.2.4, then EDG 2.0 integration • Many minor Testbeds (developers, unit testing, service integration)

EDG Testbeds at CERN: Infrastructure • 5 NFS servers with 2.5 TByte mirrored disk • User directories • Shared /home area on the batch system • Storage managed by EDG and visible from the batch system • NIS server to manage users (not only CERN users) • LCFG servers for installation and configuration • Certification Authority • To provide CERN users with X509 user certificates • To provide CERN with host and service certificates • Hierarchical system (Registration Authorities) mapped to experiments • Linux RH 6.2 (now moving to RH 7.3)

Some History: before v.1.1.2 • The Continuous Release Period (AKA the Dark Ages) • No procedures • CERN testbeds have seen all versions • Trial and error (on the CERN testbeds first) • Services very unreliable • Many debugging sessions with developers • Resulted in a version that could convince the reviewers March last year

Release Procedures • New RPMs are delivered to the Integration Team • Configuration changed in CVS • Installed on Dev TB at CERN (highest rate of changes)  Basic tests • Core sites install version on their Dev TBs Distributed tests • Software is deployed on Application TB  Final (large scale) tests • Applications start using the Application Testbed • Other sites install, get certified by ITeam, and then join • Over time this process has evolved into a quite strict release procedure • Application Software is installed on UI/WN on demand and outside the release process

More History Successes • Matchmaking/Job Mgt. • Basic Data Mgt. Known Problems: • High Rate Submissions • Long FTP Transfers Known Problems: • GASS Cache Coherency • Race Conditions in Gatekeeper • Unstable MDS ATLAS phase 1 start CMS stress test Nov.30 - Dec. 20 Successes • Improved MDS Stability • FTP Transfers OK Known Problems: • Interactions with RC Intense Use by Applications! Limitations: • Resource Exhaustion • Size of Logical Collections CMS, ATLAS, LHCB, ALICE Security fix (sendmail) Security fix (file)

Operations: Node Installation (1) • Basic installation tool: LCFG (LocalConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided for all services • Machine configurations are compatible with LCFG constraints • e.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes)

Home-made tools to do the checking A lot of manual work Operations: Node Installation (1) • Basic installation tool: LCFG (Local ConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided to configure all services • Machine configurations are compatible with LCFG constraints • e.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes)

Operations: Node Installation (1) • Basic installation tool: LCFG (LocalConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided for all services • Machine configurations are compatible with LCFG constraints • E.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes) Install the node with LCFG Switch off LCFG Developers play with the node and come up with a new setup Reinstall the node with LCFG Redo ad lib.

Operations: Node Installation (1) • Basic installation tool: LCFG (LocalConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided for all services • Machine configurations are compatible with LCFG constraints • E.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes) Use LCFG to handle root and system accounts Use NIS for standard users

Operations: Node Installation (2) • PXE-based initiation of installation process • a floppy was needed to start the installation process • with PXE the whole process goes through the network • Serial line controlled reset of nodes • The reset button is connected to a relays system controlled from a server via a serial line (ref. Andras Horvath’s talk) • Serial line console monitoring • All serial lines are connected to a central server via a multi-port serial card (ref. Andras Horvath’s talk) • Visits to the Computer Center drastically reduced

EDG Middleware Management • EDG is an R&D project (but was advertised for production) • Many services are fragile (daily restarts) • Very complex fault patterns (every release creates new) • The “right” way to do things had (and for some services still has) to be discovered • The site model used was not suited for production • Missing management of storage, scratch space, log files (if any) • Several services does not scale above the “proof of principle” level • Max 512 jobs per RB, max ~1000 files in the RC • Some components are resource hungry • Memory leaks, file leaks, port leaks, i-node leaks, … • Route from working binary to deployable RPM not always reliable • An autobuild system is now in place

gass_cache • A disk area (gass_cache) has to be shared between the gatekeeper node (CE) and all the worker nodes (scaling problem here?) • Each job creates a big number (>>100) of tiny files in this area • If the job ends in an unclean way, these files are not deleted • No easy way to tell which file belongs to whom, no GLOBUS/EDG tool to handle this case • Usage of i-nodes is huge: at least once per week the whole batch system has to be stopped and the gass_cache area cleaned (~2 hours given the number of i-nodes) • Random fault pattern: the system stops working for apparently totally uncorrelated reasons, shared area appears empty • All the jobs running at the time are lost

Storage management (1) • “No clear concept in Grid community how to deal with data access and storage” – P.Kunszt, WP2 manager • Available: • GSI-enabled ftp server (SE) • replication-on-demand service (GDMP) • toy Replica Catalogue (max O(1000) files managed coherently) • a few user commands to copy and register files • very basic interface to tape storage (CASTOR, HPSS) • Unavailable: • a clear idea of how the whole system should work • any disk space management and monitoring tool

Storage Management (2) • Constraint from the Replica Catalog: • PFN = physical filename, LFN = logical filename • PFN = <Local storage area>/LFN • Consequences: • the whole storage area must consist in a single big partition OR all SE’s must have exactly the same disk partition structure • different partitioning of the storage area on different SE’s can make file replication impossible • e.g. a LFN might be located on a partition with free space on one SE and on a full partition on another SE • Needs A LOT OF PLANNING if only small partitions are available • at CERN we use disk servers exporting 100GB partitions

Resource Broker woes • The Resource Broker is the central intelligence of the GRID • It interacts with most of the other services in a complex way • Any problem on the RB actually stops the whole system • A lot of effort was put into fixing problems but it is still the most sensitive spot in the GRID • Almost every day one of the RB’s must be stopped and all databases cleaned • The problem is related to a db corruption due to a non-thread-safe library. Fixes are already available and will be deployed with EDG 2. • All jobs being managed by the RB at the time are lost

EDG Developer’s view of computing • From the previous slides and (oh, too many!) other examples, we inferred the following “EDG Developer’s view of computing” • Service node characteristics: CPU =  SPECInt, RAM =  MB • Disk storage characteristics: single disk partition of  GB (transparently growable) • Network connectivity:  bandwidth with 0 RTT to anywhere in the world, up 24/7 • These “approximations” are reasonable in a proof-of-principle system, but when moving to the real world these expectations should be revisited

Conclusions • EDG testbeds have been in operation for almost two years • They were source of continuous and fundamental feed-back to the developers (like: “this doesn’t work”,“this doesn’t scale”) • LHC experiments and other project partners were able to taste the flavor (perfume? stink?) of a realistic GRID environment • The EDG testbed was too often advertised as a working production system. This was source of misunderstandings and had bad consequences for several people: • users were frustrated by missing or non-working functionalities • testbed managers were over-stressed by upset users asking for a working system 24/7 and were not given the means to fulfill this request Thanks to the EU and to our national funding agencies for their support of this work

EU DataGRID testbed management and support at CERN

EU DataGRID testbed management and support at CERN

Presentation Transcript

Peter Kunszt CERN Peter.Kunszt@cern.ch EU DataGrid WP2 Manager

IPv6 Testbed @ CERN

EU DataGrid Data Management Workpackage : WP2 Status and Plans

The EU DataGrid - Introduction

The EU DataGrid

EU DataGrid Project TestBed Status and Plans

EU DataGrid segment in Russia. Testbed WP6.

News from EU DataGrid

The European DataGRID Production Testbed

EU DataGrid Testbed

EU DataGrid Data Management Workpackage : WP2 Status and Plans

EU DataGrid segment in Russia. Testbed WP6.

Fabric Management in EU DataGrid: An Overview

The EU DataGrid Testbed

The EU DataGrid Project

EU DataGrid Project TestBed Status and Plans

DATAGRID Testbed Work Package (report)

EU DataGrid Testbed