190 likes | 376 Views
Evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova. Evaluation of GRAM Service. GIS. Submit jobs (using Globus tools). Information on characteristics and status of local resources. GRAM. GRAM. GRAM. CONDOR. LSF. PBS. Site1. Site2. Site3.
E N D
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova
Evaluation of GRAM Service GIS Submit jobs (using Globus tools) Information on characteristics and status of local resources GRAM GRAM GRAM CONDOR LSF PBS Site1 Site2 Site3
Evaluation of GRAM Service • Job submission tests using Globus tools (globusrun, globus-job-run, globus-job-submit) • GRAM as uniform interface to different underlying resource management systems • “Cooperation” between GRAM and GIS • Evaluation of RSL as uniform language to specify resources • Tests performed with Globus 1.1.2 and 1.1.3 and Linux machines
GRAM & fork system call Client Server (fork) Globus Globus
GRAM & Condor Client Server (Condor front - end machine) Globus Globus Condor Condor pool
GRAM & Condor • Tests considering: • Standard Condor jobs (relinked with Condor library) • INFN WAN Condor pool configured as Globus resource • ~ 200 machines spread across different sites • Heterogeneous environment • No single file system and UID domain • Vanilla jobs (“normal” jobs) • PC farm configured as Globus resource • Single file system and UID domain
GRAM & LSF Server (LSF front - end machine) Globus LSF Client Globus LSF Cluster
Results • Some bugs found and fixed (fixes included in INFNGRID 1.1 distribution) • Standard output and error for vanilla Condor jobs • globus-job-status • … • Some bugs can be solved without major re-design and/or re-implementation: • For LSF the RSL parameter (count=x) is translated into: bsub –n x … • Just allocates x processors, and dispatches the job to the first one • Used for parallel applications • Should be: bsub … x times • Maybe we don’t need to solve this problem (see later…) • … • Two major problems: • Scalability • Fault tolerance
Globus GRAM Architecture Client Globus front-end machine pc2 pc1 pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename (stdout=/diskCms/Cmsim/filename) (count=1) LSF/ Condor/ PBS/ … Jobmanager Job
Scalability • One jobmanager for each globusrun • If I want to submit 1000 jobs ??? • 1000 globusrun • 1000 jobmanagers running in the front-end machine !!! • %globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename) (stdout=/diskCms/CmsimOut/filename) (count=1000) • It is not possible to specify in the RSL file 1000 different input files and 1000 different output files … • $(Process) in Condor • Problems with job monitoring (globus-job-status) • Therefore (count=x) with x>1 not very useful !
Fault tolerance • The jobmanager is not persistent • If the jobmanager can’t be contacted, Globus assumes that the job(s) has been completed • Example of problem • Submission of n jobs on a cluster managed by a local resource management systems • Reboot of the front end machine • The jobmanager(s) doesn’t restart • Orphan jobs Globus assumes that the jobs have been successfully completed
GRAM & GIS • How the local GRAMs provide the GIS with characteristics and status of local resources ? • Tests performed considering: • Condor pool • LSF cluster
GRAM & LSF & GIS Must be fixed
Jobs & GIS • Info on Globus jobs published in the GIS: • User • Subject of certificate • Local user name • RSL string • Globus job id • LSF/Condor/… job id • Status: Run/Pending/…
GRAM & GIS • The information on characteristics and status of local resources and on jobs is not enough • As local resources we must consider Farms and not the single workstations • Other information (i.e. total and available CPU power) needed • Fortunately the default schema can be integrated with other info provided by specific agents • The needed information must be identified first
RSL • We need a uniform language to specify resources, between different resource management systems • The RSL syntax model seems suitable to define even complicated resource specification expressions • The common set of RSL attributes is often not sufficient • The attributes not belonging to the common set are ignored
RSL • More flexibility is required • Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model) • Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach
Next steps • Bug fixes • Modification of Globus LSF scripts for GIS • Problem (count=x) with LSF ??? • Tests with real applications and real environments (CMS fall production) • Define a small set of attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and try to implement it • Let’s start with information provided by the underlying resource management system • Tests with GRAM API • Not necessary tests with other resource management systems • Scalability and robustness problems • Not so simple and straightforward !!! • Up to Workload management WP, possible collaboration with Globus team and Condor team