Evaluation of the Globus GRAM Service

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Evaluation of GRAM Service GIS Submit jobs (using Globus tools) Information on characteristics and status of local resources GRAM GRAM GRAM CONDOR LSF PBS Site1 Site2 Site3

Evaluation of GRAM Service • Job submission tests using Globus tools (globusrun, globus-job-run, globus-job-submit) • GRAM as uniform interface to different underlying resource management systems • “Cooperation” between GRAM and GIS • Evaluation of RSL as uniform language to specify resources • Tests performed with Globus 1.1.2 and 1.1.3 and Linux machines

GRAM & fork system call Client Server (fork) Globus Globus

GRAM & Condor Client Server (Condor front - end machine) Globus Globus Condor Condor pool

GRAM & Condor • Tests considering: • Standard Condor jobs (relinked with Condor library) • INFN WAN Condor pool configured as Globus resource • ~ 200 machines spread across different sites • Heterogeneous environment • No single file system and UID domain • Vanilla jobs (“normal” jobs) • PC farm configured as Globus resource • Single file system and UID domain

GRAM & LSF Server (LSF front - end machine) Globus LSF Client Globus LSF Cluster

Results • Some bugs found and fixed (fixes included in INFNGRID 1.1 distribution) • Standard output and error for vanilla Condor jobs • globus-job-status • … • Some bugs can be solved without major re-design and/or re-implementation: • For LSF the RSL parameter (count=x) is translated into: bsub –n x … • Just allocates x processors, and dispatches the job to the first one • Used for parallel applications • Should be: bsub … x times • Maybe we don’t need to solve this problem (see later…) • … • Two major problems: • Scalability • Fault tolerance

Globus GRAM Architecture Client Globus front-end machine pc2 pc1 pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename (stdout=/diskCms/Cmsim/filename) (count=1) LSF/ Condor/ PBS/ … Jobmanager Job

Scalability • One jobmanager for each globusrun • If I want to submit 1000 jobs ??? • 1000 globusrun • 1000 jobmanagers running in the front-end machine !!! • %globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename) (stdout=/diskCms/CmsimOut/filename) (count=1000) • It is not possible to specify in the RSL file 1000 different input files and 1000 different output files … • $(Process) in Condor • Problems with job monitoring (globus-job-status) • Therefore (count=x) with x>1 not very useful !

Fault tolerance • The jobmanager is not persistent • If the jobmanager can’t be contacted, Globus assumes that the job(s) has been completed • Example of problem • Submission of n jobs on a cluster managed by a local resource management systems • Reboot of the front end machine • The jobmanager(s) doesn’t restart • Orphan jobs  Globus assumes that the jobs have been successfully completed

GRAM & GIS • How the local GRAMs provide the GIS with characteristics and status of local resources ? • Tests performed considering: • Condor pool • LSF cluster

GRAM & Condor & GIS

GRAM & LSF & GIS Must be fixed

Jobs & GIS • Info on Globus jobs published in the GIS: • User • Subject of certificate • Local user name • RSL string • Globus job id • LSF/Condor/… job id • Status: Run/Pending/…

GRAM & GIS • The information on characteristics and status of local resources and on jobs is not enough • As local resources we must consider Farms and not the single workstations • Other information (i.e. total and available CPU power) needed • Fortunately the default schema can be integrated with other info provided by specific agents • The needed information must be identified first

RSL • We need a uniform language to specify resources, between different resource management systems • The RSL syntax model seems suitable to define even complicated resource specification expressions • The common set of RSL attributes is often not sufficient • The attributes not belonging to the common set are ignored

RSL • More flexibility is required • Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model) • Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

Next steps • Bug fixes • Modification of Globus LSF scripts for GIS • Problem (count=x) with LSF ??? • Tests with real applications and real environments (CMS fall production) • Define a small set of attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and try to implement it • Let’s start with information provided by the underlying resource management system • Tests with GRAM API • Not necessary tests with other resource management systems • Scalability and robustness problems • Not so simple and straightforward !!! • Up to Workload management WP, possible collaboration with Globus team and Condor team

Evaluation of the Globus GRAM Service

Evaluation of the Globus GRAM Service

Presentation Transcript

Globus and Service Oriented Architecture

Service Evaluation

INFN-GRID Globus evaluation (WP 1)

First evaluation of the Globus GRAM service

GRAM: Globus Resource Allocation and Management

The Globus Toolkit

Globus GRAM

Gram-Positive DegP Evaluation

Globus

Report on the INFN-GRID Globus evaluation

Globus GRAM for Developers

The Globus Toolkit

The Globus APIs

Globus OGSI Grid Service

Globus Toolkit 4.0 Grid Resource Allocation Manager (GRAM) Job submission

Globus

Globus and Service Oriented Architecture

Globus GRAM for Developers