180 likes | 358 Views
Grid Workload Management. Massimo Sgaravatto INFN Padova. Grid Workload Management WP. Goal: define and implement a suitable architecture for distributed scheduling and resource management in a GRID environment Large heterogeneous environment Large numbers (thousands) of independent users
E N D
Grid Workload Management Massimo Sgaravatto INFN Padova
Grid Workload Management WP • Goal: define and implement a suitable architecture for distributed scheduling and resource management in a GRID environment • Large heterogeneous environment • Large numbers (thousands) of independent users • Many challenging issues : • Optimizing the choice of execution location based on the availability of data, computation and network resources • Uniform interface to possible different local resource management systems under different administrative domains • Priorities, policies on resource usage • Reliability, scalability, … • … • http://www.infn.it/workload-grid
Approach • We need much more experience with the various grid issues • The application requirements are not completely defined yet. They will evolve as more familiarity with the grid model is acquired • Fast prototyping instead of a classic top-down approach
Current activities • Report on current technology on Grid scheduling and resource management • Globus resource management • Condor • Survey on Grid scheduling systems • Focus on the implementation of a first prototype workload management system • This part will be plugged together with the other parts implemented by the other WP’s to form the project month 9 (September) deliverable • Grid accounting
Functionalities foreseen for the 1st release • First version of job description language (JDL) • First version of resource broker • Job submission service • First version of bookkeeping and logging services • First user interface
Block diagram of the currently foreseen components of the workload management system • Not a real architecture • Functional interactions among the various components • Dependencies on “external” functionalities
Job Description Language (JDL) • First release of job description language (JDL) used when the job is submitted, to specify the job characteristics (application, input data set id, resources [required and preferable], …) • A document describing the syntax and semantics of a “prototype” JDL, based on Condor ClassAds was prepared • Ready to collect feedback from applications
Resource Broker • First version of resource broker, that chooses the computing resources (queues or “single” nodes) where to submit jobs, considering • Access policies (grid-mapfiles in the Globus based prototype) • Characteristics and status of resources • Availability of input data set • Availability of the required run time/application environments • Resources required specified in the JDL • Resources required published in an Information Space (Globus GIS in the first prototype) + Replica Catalog • Ongoing implementation based on the Condor matchmaking library (Salvatore’s presentation)
Information Service • All the information needed by the broker published in one Grid Information Space (Globus GIS/MDS for the first release) • New MDS 2 alpha release soon available • Should address some of the existing shortcomings • Necessary to implement plug in modules • Index (for a first level query, to identify a set of candidate resources) • Information providers (to publish needed information about resources)
Job submission service • Job submission service based (for the first release) on: • Globus GRAM • Condor-G on top of Globus GRAM (to implement a reliable job submission service) • Globus GRAM • Comprehensive evaluation already done (collaboration with the “Evaluation of the Globus toolkit” WP) • Globus GRAM as uniform interface to different underlying resource management system (LSF, Condor, PBS) • GRAM reporter (GRAM – GIS interaction) • RSL
Job submission service • Condor-G • First prototype implementation already tested • Promising, but many problems to fix • New Condor-G implementation under testing • Many problems fixed, but still other open issues • Other new Condor-G implementation released hopefully in a few weeks • Exploitation of a new persistent Globus jobmanager • Active in following the developments of Globus GRAM, Condor-G, implementing the required customizations
Bookkeeping & Logging • Job monitoring and control • Job status • Used resources • Start time • End time • … • Record of significant events occurring in the workload management system
User interface • Command-line, for job management operations • List of resources “suitable” to run a job • Job submission (with the possibility to specify where to submit the job, or leaving this choice to the broker) • Job status monitoring • Job removal • Access to bookkeeping info for the job
Workload management system (1st prototype) Other info Resource Discovery Submit jobs (using JDL [Class-Ads]) Broker GIS + Replica Catalog Broker chooses in which Globus resources the jobs must be submitted Job submission service Information on characteristics and status of local resources Condor-G Condor-G able to provide a reliable/crash-proof job submission service Globus GRAM as uniform interface to different local resource management systems Globus GRAM Globus GRAM Globus GRAM Local Resource Management Systems CONDOR LSF PBS Site1 Farms Site2 Site3
Grid Accounting • New problem • Working systems (even prototype implementations) don’t exist yet • Economy-based model for Grid accounting ? • See Stefano’s presentation
Deliverables foreseen in the INFN-GRID proposal • D2.1.1 Technical assessment about Globus and Condor, interactions and usage (5/2001) • Done • D2.1.2 First resource broker implementation for high throughput applications (7/2001) • The resource broker should be easily customizable for high throughput applications • Usable after M9 release
Deliverables foreseen in the INFN-GRID proposal • D2.1.3 Comparison of different local resource managers (10/2001) • Condor, LSF, PBS • Farms with these resource management systems already in place and instrumented with the Globus software • D2.1.4 Study of the three workload systems and implementation of the workload system for Monte Carlo productions (12/2001) • Should be achievable