220 likes | 353 Views
Practical approaches to Grid workload management in the EGEE project Massimo Sgaravatto INFN Padova On behalf of the EGEE JRA1 IT-CZ cluster. CHEP 2004. www.eu-egee.org. EGEE is a project funded by the European Union under contract INFSO-RI-508833. EGEE project. EGEE project
E N D
Practical approaches to Grid workload management in the EGEE projectMassimo SgaravattoINFN PadovaOn behalf of the EGEE JRA1 IT-CZ cluster CHEP 2004 www.eu-egee.org EGEE is a project funded by the European Union under contract INFSO-RI-508833
EGEE project • EGEE project • Aim: build a consistent, robust and secure Grid infrastructure • Focus first on two pilot applications areas (HENP, Biomedical applications) • But the goal is to take other researchers in academia and industry • Middleware activity (JRA1) • Re-engineer Grid software to provide production quality middleware • Evolution towards emerging standards, based on Service Oriented Architectures • Taking into account application requirements and production/ deployment/ management needs • See talk #247 (E. Laure) Chep 2004 - 2
Workload management • Grid workload and resource management is one of the key Grid middleware functionality • How to efficiently schedule a big number of different data-intensive jobs, submitted by a distributed community of users, to a Grid encompassing many and heterogeneous resources • Progress was made in various projects with different integrated software solutions: • DataGrid Workload Management System • Condor • EuroGrid-Unicore resource broker • … • Still a lot to do • Scalability, reliability • Identification and handling of failures originating from different software layers, and possibly from 'foreign' Grid system and resources • Distributed (hierarchical ?) super-scheduling • Proper semantics of resource information collection and distribution (push, pull, index, cache, refresh) • … Chep 2004 - 3
Workload Management System • Provision of Grid Workload Management System services assigned to the “EGEE JRA1 Italian Czech cluster” • CESNET • Datamat S.p.A. • INFN • Architecture of the EGEE WMS designed and being implemented • Taking into account feedback and requirements from reference applications and deployment/production/management activities • Taking into account previous experiences from other Grid projects (in particular the DataGrid WMS) • Set of Grid services • Workload Manager (WM) • Computing Element (CE): Resource access • Logging & Bookkeeping (L&B) • Job Provevance (JP) • Grid Accounting service • Interoperating among them and with other EGEE Grid Services Chep 2004 - 4
Workload Manager Chep 2004 - 5
Workload Manager Job management requests (submission, cancellation) expressed via a Job Description Language (JDL) Chep 2004 - 6
Workload Manager Keeps submission requests Requests are kept for a while if no matching resources available Chep 2004 - 7
Workload Manager Repository of resource information available to matchmaker Updated via notifications and/or active polling on sources Chep 2004 - 8
Workload Manager Finds an appropriate CE for each submission request, taking into account job requests and preferences, Grid status, utilization policies on resources Chep 2004 - 9
Scheduling policies • Different possible policies • Eager scheduling: a job is bound to a resource as soon as possible • Job is then forwarded to that CE, where very likely it will end up in a queue • Lazy scheduling: job held by the WM until a resource becomes available • Job then forwarded to that CE for immediate execution • WM architecture able to accommodate both models (and the intermediate solutions) • Eager scheduling: matching a job against multiple resources • Lazy scheduling: matching a resource against multiple jobs • Needed to better investigate strengths and weaknesses of different policies in different scenarios • Evaluation of relevant metrics, covering both resource utilization and user satisfaction Chep 2004 - 10
Computing Element • Service representing a computing resource • Main functionality: job management • Run jobs • Cancel jobs • Suspend and resume jobs • Provide info on “quality of service” • How many resources match the job requirements ? • What is the estimated time to have the job starting its execution ? • … • … • Used by the WM or by any other client (e.g. end-user) • CE architecture accommodated to support both push and pull model • Push model: the job is pushed to the CE by the WM • Pull model: the CE asks the WM for jobs • These two models are somewhat mirrored in the resource information flow • In order to 'pull' a job a resource must choose where to 'push' information about itself Chep 2004 - 11
CE Architecture Client JobSubmit JobAssess JobKill JobSuspend JobResume JobGetStatus WEB WEB CE Mon Web service accepting job management requests LSF PBS ? Worker Nodes Chep 2004 - 12
CE Architecture Client Notifications Job requests WEB WEB CE Mon Async. notifications about job/CE events Job requests (for CE working in pull mode) LSF PBS ? Worker Nodes Chep 2004 - 13
Logging & Bookkeeping • Collects and manages job-related events (e.g. submission, suitable CE found, start of execution, …) from the WMS components • Processes these events to give a higher level view on job states • Both job states and raw data available to users • Also via Web Service interface • Possible to subscribe to receive notifications on particular job state changes • LB event trail can be analyzed to identify problems with resources ("black holes", unusual failure rates, etc). • See poster #419 for more details Chep 2004 - 14
Job Provenance • Keeps track of definition of submitted jobs, execution conditions and job life cycle for a long time • Job life logs (JDL, timestamps, jobids, …) • Executable and input/output files • Execution environment (OS, installed software version, …) • Custom data provided by user • Used for • Debugging • Post-portem analysis • Comparison of job executions in an evolving environment • Service components • Primary Storage Server • Keeps data in the most compact and economic form • Index Servers • Configured to support a set of queryable attributes • See poster # 419 for more details Chep 2004 - 15
Grid Accounting • Accumulates information about the usage of Grid resources by users / groups (e.g. VOs) • To be used • To track resource usage • To discover abuses (and help avoiding them) • Also possible to charge users for the resources they have used • Allows implementation of submission policies based on resource usage • Exchange market among Grid users and Grid resource owners, which should result in market equilibrium • Load balancing on the Grid Chep 2004 - 16
Accounting architecture Accounting Resource metering: getting info about resource usage Storage Element Computing Element Chep 2004 - 17
Accounting architecture Accounting Reports about resource usage per user / VO/ resource Storage Element Computing Element Chep 2004 - 18
Accounting architecture Resource pricing Accounting Storage Element Computing Element Resource owner Chep 2004 - 19
Accounting architecture Resource pricing Cost computation Accounting Storage Element Computing Element Resource owner Chep 2004 - 20
Status • Workload Manager, Logging & Bookkeeping, Grid Accounting software inherited by DataGrid WMS software • Being revised and complemented according to the new architecture • E.g. Information Supermarket, TaskQueue new developments • Web services interfaces • First implementation already deployed in the EGEE GLITE prototype testbed • Computing Element • New fresh developments • CEMon prototype already implemented • Job Provenance • New component being implemented Chep 2004 - 21
Links • EGEE JRA1 IT-CZ cluster homepage • http://egee-jra1-wm.mi.infn.it/egee-jra1-wm • EGEE JRA1 (middleware activity) homepage • http://egee-jra1.web.cern.ch/egee-jra1 • EGEE project homepage • http://www.eu-egee.org Chep 2004 - 22