870 likes | 1.02k Views
The Workload Management System in the DataGrid project. Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it http://presentation.address. DB. DB. DB. DB. Grid vision. “ Dependable, consistent, pervasive access to resources” Enable communities (“ virtual organizations ”) to share
E N D
The Workload Management System in the DataGrid project Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it http://presentation.address
DB DB DB DB Grid vision “Dependable, consistent, pervasive access to resources” • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals in the absence of central control, omniscience, trust relationships • Make it easy to use diverse, geographically distributed, locally managed and controlled computing facilities as if they formed a coherent local cluster • People have been discussing about Grid for various years … • … but till some years ago more or less only Globus toolkit available • Globus toolkit: core services for Grid tools and applications (Authentication, Information service, Resource management, etc…) • Good basis to build on but: • No higher level services • Many problems (e.g. handling of lots of data) not addressed • No production quality implementations • Not possible to do real work with Grids yet …
EU DataGrid • DataGrid funded by European Union whose objective to exploit and build the next generation computing infrastructure providing intensive computation and analysis of shared large-scale databases • Enable data intensive sciences by providing world wide Grid test beds to large distributed scientific organizations ( “Virtual Organizations, VOs”) • Duration: Jan 1, 2001 - Dec 31, 2003 • Applications/End Users Communities: HEP, Earth Observation, Biology • Specific Project Objectives: • Middleware for fabric & grid management • Large scale testbed • Collaborate and coordinate with other projects • Contribute to Open Standards and international bodies
DataGrid Main Partners • CERN – International (Switzerland/France) • CNRS - France • ESA/ESRIN – International (Italy) • INFN - Italy • NIKHEF – The Netherlands • PPARC - UK
Assistant Partners • Industrial Partners • Datamat (Italy) • IBM-UK (UK) • CS-SI (France) • Research and Academic Institutes • CESNET (Czech Republic) • Commissariat à l'énergie atomique (CEA) – France • Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) • Consiglio Nazionale delle Ricerche (Italy) • Helsinki Institute of Physics – Finland • Institut de Fisica d'Altes Energies (IFAE) - Spain • Istituto Trentino di Cultura (IRST) – Italy • Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany • Royal Netherlands Meteorological Institute (KNMI) • Ruprecht-Karls-Universität Heidelberg - Germany • Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands • Swedish Research Council - Sweden
DataGrid Work Packages • The EDG collaboration is structured in 12 Work Packages • WP1: Workload Management System • WP2: Data Management • WP3: Grid Information and Monitoring • WP4: Fabric Management • WP5: Storage Element / Storage Resource Manager • WP6: Testbed and demonstrators • WP7: Network Monitoring • WP8: High Energy Physics Applications • WP9: Earth Observation • WP10: Biology • WP11: Dissemination • WP12: Management
WP1 Task • The objective of the first DataGrid workpackage was-is (according to the project "Technical Annex"): To define and implement a suitable architecture for distributed scheduling and resource management on a GRID environment • Many challenging issues : • Large heterogeneous environments • Large numbers (thousands) of independent users • Optimizing the choice of execution location based on the availability of data, computation and network resources • Uniform interface to possible different local resource management systems under different administrative domains • Policies on resource usage • Reliability, scalability, … • …
EDG Tutorial Overview Workload Management Services Data Management Services Networking Information Service Fabric Management
WP1 teams • INFN • INFN Catania • INFN Cnaf (Bologna) • INFN Milano • INFN Padova • INFN Pisa • INFN Torino • CESNET (Czech Republic) • Datamat SpA (Rome) • Imperial College (UK)
Approach • We needed much more experience with the various Grid issues • The application requirements were not completely defined yet • They evolved as more familiarity with the Grid model was acquired • Fast prototyping instead of a classic top-down approach • Implementation of a first prototype Workload Management System (WMS) • By integrating existing tools • Globus • Condor • And by implementing new middleware
Functionalities of the first WMS • Lightweight User Interface (UI) to submit jobs and control them • Allows also to transfer a "small" amount of data to and from the client machine and the executing machine (input/output sandboxes) • Job characteristics and requirements described via an an appropriate Job Description Language (JDL) • WP1's Resource Broker (RB) chooses an appropriate computing resource (Computing Element, CE) for the job, based on the constraints specified in the JDL and on the status of the Grid • RB strategy is to send the job to an appropriate CE: • Where the submitting user has proper authorization • That matches the characteristics specified in the JDL (architecture, computing power, application environment, etc.) • Where the specified input data (and possibly the chosen output Storage Element) are determined to be "close enough" • Throughout this process, WP1's Logging and Bookkeeping services maintain a "state machine" view of each job
dg-job-submit myjob.jdl Myjob.jdl Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed0-00019"; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {“$(CMS)/exe/sum.exe", "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2"; Rank = other.FreeCPUs;
Experiences with the first WMS • First Workload Management System deployed in the EDG testbed at the end of first year of the project • Application users have now been experiencing for about one year and a half with this first release of the WMS • Stress tests and quasi-production activities • CMS stress tests • Atlas efforts • … • Significantachievements exploited by the experiments • … but also various problems were spotted • Impacting in particular the reliability and scalability of the system
Review of WP1 WMS architecture • WP1 Workload Management System architecture reviewed • To apply the “lessons” learned and addressing the shortcomings emerged with the first release of the software, in particular • To increase the reliability problems • To address the scalability problems • To support new functionalities • To favor interoperability with other Grid frameworks, by allowing exploiting WP1 modules (e.g. RB) also “outside” the EDG WMS
Improvements wrt first rel. of WMS • Reliability and scalability problems addressed • No more a monolithic long-lived process • Some functionalities (e.g. matchmaking) delegated to pluggable modules • Less exposed to memory leaks (coming not only from EDG software) • No more multiple job info repositories • No more job status inconsistencies which caused problems • Techniques to quickly recover from failures • Reliable communications among components • Done via the file system (filequeues) • For example jobs are not lost if the target entity is temporary down: when it restarts it gets and “process” the jobs
Improvements wrt first rel. of WMS • Flexibility and interoperability increased • Much more feasible to exploit the Resource Broker also outside the DataGrid WMS • Much more easier to implement and “plug” in the system the module implementing the chosen scheduling strategy defined according the one’s own needs and requirements • Glue Schema for Information Services to describe Grid resources • Common schema agreed between US and EU High Energy Physics Grid projects • Various rel. 1 problems fixed • Various enhancements in design and implementation in the various modules • Also due to enhancements in the underlying software (e.g. Condor and Globus)
WP1 WMS reviewed architecture Details in EDG deliverable D1.4 …
UI Job submission RB node RLS Network Server Workload Manager Inform. Service Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
Job Status UI RB node Job submission submitted Replica Catalog Network Server Workload Manager Inform. Service UI: allows users to access the functionalities of the WMS Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
edg-job-submit myjob.jdl Myjob.jdl JobType = “Normal”; Executable = "$(CMS)/exe/sum.exe"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000; Rank = other.GlueCEStateFreeCPUs; Job Status UI RB node Job submission submitted Replica Catalog Network Server Workload Manager Inform. Service Job Description Language (JDL) to specify job characteristics and requirements Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
NS: network daemon responsible for accepting incoming requests submitted waiting UI RB node Job Status Job submission RLS Network Server Job Input Sandbox files Workload Manager Inform. Service RB storage Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
submitted waiting UI RB node Job Status Job submission RLS Network Server Job Workload Manager Inform. Service RB storage WM: responsible to take the appropriate actions to satisfy the request Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
submitted waiting UI RB node Job Status Job submission RLS Network Server Match- Maker/ Broker Workload Manager Inform. Service RB storage Where must this job be executed ? Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
submitted waiting UI RB node Job Status Job submission RLS Network Server Matchmaker: responsible to find the “best” CE where to submit a job Match- Maker/ Broker Workload Manager Inform. Service RB storage Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
submitted waiting UI RB node Job Status Where are (which SEs) the needed data ? Job submission RLS Network Server Match- Maker/ Broker Workload Manager Inform. Service RB storage What is the status of the Grid ? Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
submitted waiting UI RB node Job Status Job submission RLS Network Server Match- Maker/ Broker Workload Manager Inform. Service RB storage CE choice Job Contr. - CondorG CE characts & status SE characts & status Computing Element Storage Element
submitted waiting UI RB node Job Status Job submission RLS Network Server Workload Manager Inform. Service RB storage Job Adapter Job Contr. - CondorG CE characts & status JA: responsible for the final “touches” to the job before performing submission (e.g. creation of wrapper script, etc.) SE characts & status Computing Element Storage Element
submitted waiting UI ready RB node Job Status Job submission RLS Network Server Workload Manager Inform. Service RB storage Job Job Contr. - CondorG JC: responsible for the actual job management operations (done via CondorG) CE characts & status SE characts & status Computing Element Storage Element
submitted waiting UI ready scheduled RB node Job Status Job submission RLS Network Server Workload Manager Inform. Service RB storage Job Contr. - CondorG Input Sandbox files CE characts & status SE characts & status Job Computing Element Storage Element
submitted waiting UI ready scheduled running Job RB node Job Status Job submission RLS Network Server Workload Manager Inform. Service RB storage Job Contr. - CondorG Input Sandbox “Grid enabled” data transfers/ accesses Computing Element Storage Element
submitted waiting UI ready scheduled running done RB node Job Status Job submission RLS Network Server Workload Manager Inform. Service RB storage Job Contr. - CondorG Output Sandbox files Computing Element Storage Element
submitted waiting UI ready scheduled running done RB node Job Status Job submission edg-job-get-output <dg-job-id> RLS Network Server Workload Manager Inform. Service RB storage Job Contr. - CondorG Output Sandbox Computing Element Storage Element
UI RB node Job Status Job submission submitted RLS Network Server waiting ready Output Sandbox files Workload Manager Inform. Service RB storage scheduled Job Contr. - CondorG running done cleared Computing Element Storage Element
UI RB node Job monitoring edg-job-status <dg-job-id> edg-job-get-logging-info <dg-job-id> Network Server LB: receives and stores job events; processes corresponding job status Workload Manager Job status Logging & Bookkeeping Job Contr. - CondorG Log Monitor Log of job events LM: parses CondorG log file (where CondorG logs info about jobs) and notifies LB Computing Element
Logging and Bookkeeping (LB) service • Stores logging and bookkeeping information concerning events generated by the various components of the WMS (“push” model) • Using this information, the LB service keeps a state machine view of each job • Extended querying capabilities • E.g. Give me all jobs marked as ‘XYZ’ (user tag) and running on CE1 or C'E2 • Possible to have more LB servers per WMS • Could be useful in case of LB overloaded
Logging and Bookkeeping (LB) service • LB components: • Local logger:responsible for accepting messages from their sources and for passing them to the inter-logger • Information flow implemented on top of inter-process communication mechanisms and is backed up by a log file that allows a correct recovery of the inter-logger if some problems occur • Interlogger: responsible to forward them to the bookkeeping servers • The inter-logger, running as a separate process, makes the logging procedure robust with respect to local and network faults • Bookkeeping server:accept messages from the inter-logger and save them on its permanent storage • Support queries (in particular job status queries) generated by a consumer API
User Interface (UI) • Allows to access the functionalities of the WMS • To submit jobs • To see the suitable resources for a job (without submitting it) • To cancel a job • To see the status of the job • The retrieve the output of a job • … • Interfaces • Command line interface (python) • C++ and Java API • GUI