240 likes | 370 Views
EU DataGRID testbed management and support at CERN. Speaker: Emanuele Leonardi (EDG Testbed Manager – WP6) Emanuele.Leonardi@cern.ch http://presentation.address. Talk Outline. Foreword Introduction to EDG Services EDG Testbeds at CERN EDG Testbed Operation Activities
E N D
EU DataGRID testbed management and support at CERN Speaker: Emanuele Leonardi (EDG Testbed Manager – WP6) Emanuele.Leonardi@cern.ch http://presentation.address
Talk Outline • Foreword • Introduction to EDG Services • EDG Testbeds at CERN • EDG Testbed Operation Activities • Installation and configuration • Service administration • Resource management • Conclusions Authors Emanuele Leonardi, Markus Schulz - CERN
Foreword • This talk is NOT about how to run the EDG grid middleware in a production environment: EDG software is still quite far from being a viable and solid production system and whatever we learned up to now could (and hopefully will) change completely in the near future • I will only describe the (mis)adventures which happened while trying (and often not succeeding) to run the EDG testbeds during the first two years of the project • I will also point out some of the deployment-related problems we encountered and which will (hopefully) be addressed by the future versions of the software on the way to LCG-1
EDG Services (1) • Authentication • Grid Security Infrastructure (GSI) based on PKI (openSSL) • Proxy renewal service • Authorization • Global EDG user directory + per-VO user directory (LDAP) • Very coarse grained • Resource Access • GLOBUS gatekeeper with EDG extensions • Interface to standard batch systems (PBS, LSF) • GSI-enabled FTP server • Data replication and access • Job sandbox transportation
EDG Services (2) • Installation • LCFG(ng) • Storage Management • File Replication service (GDMP) • At CERN interfaced to CASTOR (MSS) • Replica Catalog (LDAP) • One RC per VO • Information Services • Hierarchical GRID Information Service (GIS) structure (LDAP) • Central Metacomputing Directory Service (MDS)
EDG Services (3) • Resource Management • Resource Broker • Interfaced to the GIS • Jobmanager and Jobsubmission • Based on CondorG • Logging and Bookkeeping • Monitoring • None (being deployed in EDG 2) • Accounting • None (foreseen for EDG 2)
EDG Services (4) • Services are interdependent • Services are composite and heterogeneous • Based on lower level services (e.g. CondorG) • Many different DataBase flavors in use (MySQL, Postgres, …) • Services are mapped to logical machines • Each physical node runs one or more services • e.g. a Computing Element (CE) runs the GLOBUS gatekeeper, an FTP server, a batch submission system, … • Services impose constraints on the testbed configuration • Shared filesystems are needed within the batch system to have a common /home area and to create an homogeneous security configuration • Some services need special resources on the node (extra RAM)
EDG Testbeds at CERN • Production (Application) Testbed: 40 Nodes, EDG v.1.4.7 • Few updates (security fixes) but frequent restarts of services (daily) • Data production tests by application groups (LHC experiments, etc.) • Demonstrations and Tutorials (every few weeks) • Number of nodes varied greatly (user requests, stress tests, availability) • Development Testbed: 9 Nodes , EDG v.1.4.7 • In the past used to integrate and test new releases • many changes/day, very unstable, service restarts, traceability problems • Now used to test small changes before installation on the Production TB • Integration Testbed: 18 Nodes • EDG porting to RH7.3+GLOBUS 2.2.4, then EDG 2.0 integration • Many minor Testbeds (developers, unit testing, service integration)
EDG Testbeds at CERN: Infrastructure • 5 NFS servers with 2.5 TByte mirrored disk • User directories • Shared /home area on the batch system • Storage managed by EDG and visible from the batch system • NIS server to manage users (not only CERN users) • LCFG servers for installation and configuration • Certification Authority • To provide CERN users with X509 user certificates • To provide CERN with host and service certificates • Hierarchical system (Registration Authorities) mapped to experiments • Linux RH 6.2 (now moving to RH 7.3)
Some History: before v.1.1.2 • The Continuous Release Period (AKA the Dark Ages) • No procedures • CERN testbeds have seen all versions • Trial and error (on the CERN testbeds first) • Services very unreliable • Many debugging sessions with developers • Resulted in a version that could convince the reviewers March last year
Release Procedures • New RPMs are delivered to the Integration Team • Configuration changed in CVS • Installed on Dev TB at CERN (highest rate of changes) Basic tests • Core sites install version on their Dev TBs Distributed tests • Software is deployed on Application TB Final (large scale) tests • Applications start using the Application Testbed • Other sites install, get certified by ITeam, and then join • Over time this process has evolved into a quite strict release procedure • Application Software is installed on UI/WN on demand and outside the release process
More History Successes • Matchmaking/Job Mgt. • Basic Data Mgt. Known Problems: • High Rate Submissions • Long FTP Transfers Known Problems: • GASS Cache Coherency • Race Conditions in Gatekeeper • Unstable MDS ATLAS phase 1 start CMS stress test Nov.30 - Dec. 20 Successes • Improved MDS Stability • FTP Transfers OK Known Problems: • Interactions with RC Intense Use by Applications! Limitations: • Resource Exhaustion • Size of Logical Collections CMS, ATLAS, LHCB, ALICE Security fix (sendmail) Security fix (file)
Operations: Node Installation (1) • Basic installation tool: LCFG (LocalConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided for all services • Machine configurations are compatible with LCFG constraints • e.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes)
Home-made tools to do the checking A lot of manual work Operations: Node Installation (1) • Basic installation tool: LCFG (Local ConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided to configure all services • Machine configurations are compatible with LCFG constraints • e.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes)
Operations: Node Installation (1) • Basic installation tool: LCFG (LocalConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided for all services • Machine configurations are compatible with LCFG constraints • E.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes) Install the node with LCFG Switch off LCFG Developers play with the node and come up with a new setup Reinstall the node with LCFG Redo ad lib.
Operations: Node Installation (1) • Basic installation tool: LCFG (LocalConFiGuration System) • by University of Edinburgh (http://www.lcfg.org) • LCFG is most effective if: • Not too many different machine types/setups • LCFG-objects are provided for all services • Machine configurations are compatible with LCFG constraints • E.g. only the 4 primary partitions are supported • Main drawbacks of LCFG • No verification of the installation/update process • Not well suited for rapidly changing nodes (developers) • Wants to have total control on the machine (not suitable for installation of EDG on an already running system) • Does not handle user accounts (password changes) Use LCFG to handle root and system accounts Use NIS for standard users
Operations: Node Installation (2) • PXE-based initiation of installation process • a floppy was needed to start the installation process • with PXE the whole process goes through the network • Serial line controlled reset of nodes • The reset button is connected to a relays system controlled from a server via a serial line (ref. Andras Horvath’s talk) • Serial line console monitoring • All serial lines are connected to a central server via a multi-port serial card (ref. Andras Horvath’s talk) • Visits to the Computer Center drastically reduced
EDG Middleware Management • EDG is an R&D project (but was advertised for production) • Many services are fragile (daily restarts) • Very complex fault patterns (every release creates new) • The “right” way to do things had (and for some services still has) to be discovered • The site model used was not suited for production • Missing management of storage, scratch space, log files (if any) • Several services does not scale above the “proof of principle” level • Max 512 jobs per RB, max ~1000 files in the RC • Some components are resource hungry • Memory leaks, file leaks, port leaks, i-node leaks, … • Route from working binary to deployable RPM not always reliable • An autobuild system is now in place
gass_cache • A disk area (gass_cache) has to be shared between the gatekeeper node (CE) and all the worker nodes (scaling problem here?) • Each job creates a big number (>>100) of tiny files in this area • If the job ends in an unclean way, these files are not deleted • No easy way to tell which file belongs to whom, no GLOBUS/EDG tool to handle this case • Usage of i-nodes is huge: at least once per week the whole batch system has to be stopped and the gass_cache area cleaned (~2 hours given the number of i-nodes) • Random fault pattern: the system stops working for apparently totally uncorrelated reasons, shared area appears empty • All the jobs running at the time are lost
Storage management (1) • “No clear concept in Grid community how to deal with data access and storage” – P.Kunszt, WP2 manager • Available: • GSI-enabled ftp server (SE) • replication-on-demand service (GDMP) • toy Replica Catalogue (max O(1000) files managed coherently) • a few user commands to copy and register files • very basic interface to tape storage (CASTOR, HPSS) • Unavailable: • a clear idea of how the whole system should work • any disk space management and monitoring tool
Storage Management (2) • Constraint from the Replica Catalog: • PFN = physical filename, LFN = logical filename • PFN = <Local storage area>/LFN • Consequences: • the whole storage area must consist in a single big partition OR all SE’s must have exactly the same disk partition structure • different partitioning of the storage area on different SE’s can make file replication impossible • e.g. a LFN might be located on a partition with free space on one SE and on a full partition on another SE • Needs A LOT OF PLANNING if only small partitions are available • at CERN we use disk servers exporting 100GB partitions
Resource Broker woes • The Resource Broker is the central intelligence of the GRID • It interacts with most of the other services in a complex way • Any problem on the RB actually stops the whole system • A lot of effort was put into fixing problems but it is still the most sensitive spot in the GRID • Almost every day one of the RB’s must be stopped and all databases cleaned • The problem is related to a db corruption due to a non-thread-safe library. Fixes are already available and will be deployed with EDG 2. • All jobs being managed by the RB at the time are lost
EDG Developer’s view of computing • From the previous slides and (oh, too many!) other examples, we inferred the following “EDG Developer’s view of computing” • Service node characteristics: CPU = SPECInt, RAM = MB • Disk storage characteristics: single disk partition of GB (transparently growable) • Network connectivity: bandwidth with 0 RTT to anywhere in the world, up 24/7 • These “approximations” are reasonable in a proof-of-principle system, but when moving to the real world these expectations should be revisited
Conclusions • EDG testbeds have been in operation for almost two years • They were source of continuous and fundamental feed-back to the developers (like: “this doesn’t work”,“this doesn’t scale”) • LHC experiments and other project partners were able to taste the flavor (perfume? stink?) of a realistic GRID environment • The EDG testbed was too often advertised as a working production system. This was source of misunderstandings and had bad consequences for several people: • users were frustrated by missing or non-working functionalities • testbed managers were over-stressed by upset users asking for a working system 24/7 and were not given the means to fulfill this request Thanks to the EU and to our national funding agencies for their support of this work