Debugging and hardening grid middleware in the real world

Debugging and hardening grid middleware in the real world David Smith LCG Certification & Testing section GridPP Collaboration meeting – 3 June 2004

Overview • The Certification & Testing section • Only a small part of the section’s activity is debugging or end to end performance analysis and problem identification • But in this presentation I concentrate on this type of work • Scope is limited to the current or recent LCG software • Will discuss some of the details of technical problems encountered in the LCG software • A selection of issues will be shown • many of which are resolved, some of which are still open • It is essential that debugging and work to address problems is shared with developers and other people outside the group • However a significant amount is done in the group • I will concentrate on this work in the presentation David.Smith@cern.ch

Topics • The software components • Will mostly discuss problems particularly related to job management in the lowest level of the software: Globus & Condor-G • We use Globus 2.4, but many of the issues relevant also to Globus 3.x release • Also some identification of problems in higher level services, eg WMS and Data Management. • Globus problems: • Leaks, both memory and file descriptor • Logic problems – usually state machine problems or network callback functions • Limitations in functionality we want, either through the available implementation or service design • Condor-G: Used by the WMS software for job submission and tracking • Good response from Condor team for specific Condor problems • Shared work to address some problems in the area of Condor and Globus interactions David.Smith@cern.ch

Globus • Basically Globus is used for: • Secure connections • Job submission via GRAM • Some data access, eg. GridFTP server and client and GASS transfer • Many ‘small’ things to address while trying to make a robust, large scale service: • Memory and file descriptor leaks. Long lived services particularly sensitive, eg GAHP server on the broker machine • Memory leaks addressed in several Globus areas: • globus_gass_server_ez • globus_gass_transfer_http • globus_gram_client • globus_io_common • globus_gsi_proxy • File descriptor leak in import_cred (part of GSSAPI) David.Smith@cern.ch

Globus II • Other problems • TCP port bind and listen race in globus_io_tcp • Set gatekeeper port reusable to allow gatekeeper daemon restart • In state machines that were giving rise to various problems: • Memory + descriptor leaks (sometimes left after closed connections) • Associated problems with ‘constantly running’ services: a descriptor is at EOF and is repeatedly selected but the callback never closes it • in the jobmanager – eg. job managers terminated during stageout could restart in the wrong state • Few other problems, eg. mutex & shutdown handling that could occasionally hang services such as the jobmanager • Most of these problems were identified with large job ‘storms’ on the CT testbed David.Smith@cern.ch

Globus III • GSSAPI Module activation • GSSAPI has some problems with module activation • quite severe problem for us – needed to work around it • currently still with Globus for resolution within the toolkit • Design and implementation: Load on the gatekeeper host • Generally, connections handled by the gatekeeper process • Job specific tasks are handled by the jobmanager process, one process for each job • a significant problem for large scale use • To address this Condor-G provides a facility to kill the jobmanager for a given job while it is pending or (perhaps) while running. • Very useful feature. Without this difficult to have more than ~100 jobs per compute element. David.Smith@cern.ch

Globus IV • Some work was required to harden the jobmanager stopping facility • Work was needed on the service which is started on the gatekeeper to monitor a group of jobs (grid_monitor) • Optimisation of the condor_gridmanager logic • to avoid restarting jobmanagers in error situations • Protect against some collisions in the transfer of the job state list between gatekeeper and condor-g submission machine (eg. the broker) • Optimisation of parameters controlling the communication between the gridmanager and GAHP server • Jobmanager scripts • Needs a shared filesystem between batch workers and compute element • The standard interface queries the batch system frequently, at least one query per grid job in the queue per poll interval • To address the above issues alternative job managers were written David.Smith@cern.ch

LCG Jobmanagers • The LCG jobmanager scripts • Similar structure to the existing Globus scripts • Perl scripts, that interface between the batch system command line tools and the job manager executable • OO Perl to provide abstract methods that are common between different batch systems • LCG versions primarily address the need to share a filesystem between batch workers and compute element • Decouple the batch system interaction from the Globus actions • Intended to be compatible with existing Globus jobmanager without modification • Work with or without Condor-G job management David.Smith@cern.ch

LCG Jobmanagers II • Add task queues and service them by dedicated processes Queue & Cache Service processes Job1 Stage in Export and Submission queue Submit to bs Grid monitor for user Poll status Import queue Stage out Cleanup queue Cleanup Batch status cache David.Smith@cern.ch

GAHP server • The GAHP server • is a thin interface between the gridmanager and the GRAM protocol used to communicate between the broker and the CEs • Many problems with Globus showed up here since the GAHP server exercises several Globus routines and is also possibly long lived • Runs a GASS server • GAHP GASS server • Uses the globus gass_server_ez routines to operate a GASS server (allows the jobmanager to read and write files on the broker) • Found a limitation with the way the network connections are accepted • Only one network connection can be accepted and enter the internal Globus established state at a time. Typically caused timeouts on the GASS requests which causes various job failure modes. • To overcome this, modified GASS server handling routines had to be written for the GAHP server. David.Smith@cern.ch

CE Information provider • Using the information provider from EDG (WP4) • Single implementation that supports one of pbs, lsf or condor. • Now with several LCG changes needed to support the batch system configuration found at various production sites • PBS routing queues • Greater ability to handle diverse LSF configurations • In the future may move to a different model • Single generic information provider shell • Specific module for each type of batch system • Ranking derived from published information is problematic • By default ranking is performed by the broker on a metric published in the information system. • Metric is based on passive measurements of the state of the batch system. (ie making the measurement does not change the state of the batch system) • Metric is defined per cluster (usually a batch system queue) • In practice it is difficult to generalise this method to produce meaningful metrics for the rich variety of batch scheduling policies found across sites David.Smith@cern.ch

Miscellaneous I • GridFTP • Firewall issues due to current specification for E-block mode • Incoming data connection required for transferring data onto a machine from a remote source • Can be avoided by either not using E-block mode or by avoiding transferring from a remote source to a destination behind a firewall • To be addressed in future gridftp specification (from GGF), which will including new mode of operation: X-block mode • Performance for some operations, such as checking if a directory or file exists • Already addressed in Globus 3.2 but yet to evaluate this • Performance of proxy delegation in the CoG • Size of delegated proxy was taken from the service security context (ie was based on the host certificate key size) • Impacted performance of tests of d-Cache gridftp server at CERN David.Smith@cern.ch

Miscellaneous II • Workload Management System problems • Usually some work to disentangle Globus/Condor-G/WMS issues • WMS team provide good response to issues we think are relevant to their software. Typical problems are: • Bugs in existing features • Functionality changes to address specific, important problems – often problems seen in the full production system • Sometimes requires some non trivial amount of work by developers • eg. caching in the WM/NS of queries to the information system for the ranking metric at various compute elements David.Smith@cern.ch

Summary • Have discussed problems, not debugging techniques or tools • Most of the problems identified could be reproduced or understood on the certification testbed facility • For certain types of problem, such as the firewall issues it is useful to have remote diagnostics such as traces and reports from remote grid nodes • Some problems may become apparent only in the deployed system • Have given examples of the types of problem found while certifying LCG software, or problem reports fed back from deployed system • Both bugs in the traditional sense and also some issues of end to end performance • which are sometimes solved with a relatively localised change • Sometimes reflect consequences of a basic choice of architecture • Many issues are tackled in CT section, but • In any case technical observations are fed back to appropriate development team • VDT (Condor-G & Globus) • Sometimes Globus directly for detailed discussion of some toolkit design discussions • WMS team for problems relating to the workload management components David.Smith@cern.ch

Summary II • Data management and storage • Not discussed much in this presentation, but there is a lot of activity in progress • Work on a possible solution for managed disc space has been ongoing this year within the CT section • Collaboration between developers and LCG • But lots of problems in software, packaging, performance, stability and ease of use to resolve • Development of additional data management tools, as needed by experiments • Deployment section of LCG also active in tacking on issues found during deployment or operation of the software • Scaling of the information system (based on OpenLDAP) • Modular information provider (for SE and later for CE) David.Smith@cern.ch

Debugging and hardening grid middleware in the real world

Debugging and hardening grid middleware in the real world

Presentation Transcript

Welcome to the Grid Middleware Workshop

Service-Oriented Grid Middleware

………… IN THE REAL WORLD

The Electric Grid: OSW Integration and Storm-Hardening in New Jersey

IPv6 and Grid Middleware: the EUChinaGRID experience

Grid Middleware

Grid Computing Middleware

SIG Container and grid job middleware

Middleware for Grid Computing and the relationship to Middleware at large

A European Grid Middleware

User-Level Middleware for the Grid

SAM-Grid Middleware

Grid Middleware

Grid Computing for Real World Applications

Comparison of Grid Middleware in European Grid Projects

Grid middleware services in CoreGRID

Grid Middleware

In the Real World

Open Source Middleware for the Grid

Debugging IN THE REAL WORLD

Grid Computing and Middleware

The Use of Condor in the gLite Grid Middleware