150 likes | 242 Views
Debugging and hardening grid middleware in the real world. David Smith LCG Certification & Testing section GridPP Collaboration meeting – 3 June 2004. Overview. The Certification & Testing section
E N D
Debugging and hardening grid middleware in the real world David Smith LCG Certification & Testing section GridPP Collaboration meeting – 3 June 2004
Overview • The Certification & Testing section • Only a small part of the section’s activity is debugging or end to end performance analysis and problem identification • But in this presentation I concentrate on this type of work • Scope is limited to the current or recent LCG software • Will discuss some of the details of technical problems encountered in the LCG software • A selection of issues will be shown • many of which are resolved, some of which are still open • It is essential that debugging and work to address problems is shared with developers and other people outside the group • However a significant amount is done in the group • I will concentrate on this work in the presentation David.Smith@cern.ch
Topics • The software components • Will mostly discuss problems particularly related to job management in the lowest level of the software: Globus & Condor-G • We use Globus 2.4, but many of the issues relevant also to Globus 3.x release • Also some identification of problems in higher level services, eg WMS and Data Management. • Globus problems: • Leaks, both memory and file descriptor • Logic problems – usually state machine problems or network callback functions • Limitations in functionality we want, either through the available implementation or service design • Condor-G: Used by the WMS software for job submission and tracking • Good response from Condor team for specific Condor problems • Shared work to address some problems in the area of Condor and Globus interactions David.Smith@cern.ch
Globus • Basically Globus is used for: • Secure connections • Job submission via GRAM • Some data access, eg. GridFTP server and client and GASS transfer • Many ‘small’ things to address while trying to make a robust, large scale service: • Memory and file descriptor leaks. Long lived services particularly sensitive, eg GAHP server on the broker machine • Memory leaks addressed in several Globus areas: • globus_gass_server_ez • globus_gass_transfer_http • globus_gram_client • globus_io_common • globus_gsi_proxy • File descriptor leak in import_cred (part of GSSAPI) David.Smith@cern.ch
Globus II • Other problems • TCP port bind and listen race in globus_io_tcp • Set gatekeeper port reusable to allow gatekeeper daemon restart • In state machines that were giving rise to various problems: • Memory + descriptor leaks (sometimes left after closed connections) • Associated problems with ‘constantly running’ services: a descriptor is at EOF and is repeatedly selected but the callback never closes it • in the jobmanager – eg. job managers terminated during stageout could restart in the wrong state • Few other problems, eg. mutex & shutdown handling that could occasionally hang services such as the jobmanager • Most of these problems were identified with large job ‘storms’ on the CT testbed David.Smith@cern.ch
Globus III • GSSAPI Module activation • GSSAPI has some problems with module activation • quite severe problem for us – needed to work around it • currently still with Globus for resolution within the toolkit • Design and implementation: Load on the gatekeeper host • Generally, connections handled by the gatekeeper process • Job specific tasks are handled by the jobmanager process, one process for each job • a significant problem for large scale use • To address this Condor-G provides a facility to kill the jobmanager for a given job while it is pending or (perhaps) while running. • Very useful feature. Without this difficult to have more than ~100 jobs per compute element. David.Smith@cern.ch
Globus IV • Some work was required to harden the jobmanager stopping facility • Work was needed on the service which is started on the gatekeeper to monitor a group of jobs (grid_monitor) • Optimisation of the condor_gridmanager logic • to avoid restarting jobmanagers in error situations • Protect against some collisions in the transfer of the job state list between gatekeeper and condor-g submission machine (eg. the broker) • Optimisation of parameters controlling the communication between the gridmanager and GAHP server • Jobmanager scripts • Needs a shared filesystem between batch workers and compute element • The standard interface queries the batch system frequently, at least one query per grid job in the queue per poll interval • To address the above issues alternative job managers were written David.Smith@cern.ch
LCG Jobmanagers • The LCG jobmanager scripts • Similar structure to the existing Globus scripts • Perl scripts, that interface between the batch system command line tools and the job manager executable • OO Perl to provide abstract methods that are common between different batch systems • LCG versions primarily address the need to share a filesystem between batch workers and compute element • Decouple the batch system interaction from the Globus actions • Intended to be compatible with existing Globus jobmanager without modification • Work with or without Condor-G job management David.Smith@cern.ch
LCG Jobmanagers II • Add task queues and service them by dedicated processes Queue & Cache Service processes Job1 Stage in Export and Submission queue Submit to bs Grid monitor for user Poll status Import queue Stage out Cleanup queue Cleanup Batch status cache David.Smith@cern.ch
GAHP server • The GAHP server • is a thin interface between the gridmanager and the GRAM protocol used to communicate between the broker and the CEs • Many problems with Globus showed up here since the GAHP server exercises several Globus routines and is also possibly long lived • Runs a GASS server • GAHP GASS server • Uses the globus gass_server_ez routines to operate a GASS server (allows the jobmanager to read and write files on the broker) • Found a limitation with the way the network connections are accepted • Only one network connection can be accepted and enter the internal Globus established state at a time. Typically caused timeouts on the GASS requests which causes various job failure modes. • To overcome this, modified GASS server handling routines had to be written for the GAHP server. David.Smith@cern.ch
CE Information provider • Using the information provider from EDG (WP4) • Single implementation that supports one of pbs, lsf or condor. • Now with several LCG changes needed to support the batch system configuration found at various production sites • PBS routing queues • Greater ability to handle diverse LSF configurations • In the future may move to a different model • Single generic information provider shell • Specific module for each type of batch system • Ranking derived from published information is problematic • By default ranking is performed by the broker on a metric published in the information system. • Metric is based on passive measurements of the state of the batch system. (ie making the measurement does not change the state of the batch system) • Metric is defined per cluster (usually a batch system queue) • In practice it is difficult to generalise this method to produce meaningful metrics for the rich variety of batch scheduling policies found across sites David.Smith@cern.ch
Miscellaneous I • GridFTP • Firewall issues due to current specification for E-block mode • Incoming data connection required for transferring data onto a machine from a remote source • Can be avoided by either not using E-block mode or by avoiding transferring from a remote source to a destination behind a firewall • To be addressed in future gridftp specification (from GGF), which will including new mode of operation: X-block mode • Performance for some operations, such as checking if a directory or file exists • Already addressed in Globus 3.2 but yet to evaluate this • Performance of proxy delegation in the CoG • Size of delegated proxy was taken from the service security context (ie was based on the host certificate key size) • Impacted performance of tests of d-Cache gridftp server at CERN David.Smith@cern.ch
Miscellaneous II • Workload Management System problems • Usually some work to disentangle Globus/Condor-G/WMS issues • WMS team provide good response to issues we think are relevant to their software. Typical problems are: • Bugs in existing features • Functionality changes to address specific, important problems – often problems seen in the full production system • Sometimes requires some non trivial amount of work by developers • eg. caching in the WM/NS of queries to the information system for the ranking metric at various compute elements David.Smith@cern.ch
Summary • Have discussed problems, not debugging techniques or tools • Most of the problems identified could be reproduced or understood on the certification testbed facility • For certain types of problem, such as the firewall issues it is useful to have remote diagnostics such as traces and reports from remote grid nodes • Some problems may become apparent only in the deployed system • Have given examples of the types of problem found while certifying LCG software, or problem reports fed back from deployed system • Both bugs in the traditional sense and also some issues of end to end performance • which are sometimes solved with a relatively localised change • Sometimes reflect consequences of a basic choice of architecture • Many issues are tackled in CT section, but • In any case technical observations are fed back to appropriate development team • VDT (Condor-G & Globus) • Sometimes Globus directly for detailed discussion of some toolkit design discussions • WMS team for problems relating to the workload management components David.Smith@cern.ch
Summary II • Data management and storage • Not discussed much in this presentation, but there is a lot of activity in progress • Work on a possible solution for managed disc space has been ongoing this year within the CT section • Collaboration between developers and LCG • But lots of problems in software, packaging, performance, stability and ease of use to resolve • Development of additional data management tools, as needed by experiments • Deployment section of LCG also active in tacking on issues found during deployment or operation of the software • Scaling of the information system (based on OpenLDAP) • Modular information provider (for SE and later for CE) David.Smith@cern.ch