250 likes | 377 Views
Running your jobs everywhere. What do you want from a Grid job submission system?. Well, I cannot answer for you but this is my guess…. You want the Grid to be as easy to use as a conventional, local computing batch system. This means: Simple “qsub” style commands
E N D
Running your jobs everywhere David Colling Imperial College London
What do you want from a Grid job submission system? Well, I cannot answer for you but this is my guess… You want the Grid to be as easy to use as a conventional, local computing batch system. • This means: • Simple “qsub” style commands • All the data transparently available to the job • You can monitor your jobs. You may want more… but these are the basics David Colling Imperial College London
So if we have a whole range of local batch system that can satisfy these criteria, why is doing this on the Grid so difficult? • Some of the problems of a distributed computing system are: • Not all data is distributed to every site • You do not have computer accounts at every site (See Andrew’s Talk) • Your jobs are travelling across the WAN and so additional security is required (See Andrew’s Talk) • Difficult to gather coherent information about the remote sites. • Everything (network, computer, disks etc) breaks. David Colling Imperial College London
So how do we overcome these problems? This is the subject of this talk… This is going to be a conceptual treatment… I am going to describe the solution developed by the European DataGrid Project (EDG) and now adopted by the LHC Computing Grid (LCG) and will be the basis of the first EGEE release. There are several other Grid projects (e.g. see Rick’s talk), however are conceptually very similar… although they do have important technical differences These are my personal views David Colling Imperial College London
edg-job-submit myjob.jdl Myjob.jdl JobType = “Normal”; Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed0-00019"; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000; Rank = other.GlueCEStateFreeCPUs; A storage element A compute element Job & Input Sandbox The World as seen by the EDG Now a happy user Replica Location service (Replicac Catalogue) Each Site consists of: edg-job-get-output <dg-job-id> VO server Confused and unhappy user So now the user knows about what machines are out there and can communicate with them… however where to submit the job is too complex a decision for user alone. What is needed is an automated system So lets introduce some grid infrastructure… Security and an information system This is the world without Grids • Sites are not identical. • Different Computers • Different Storage • Different Files • Different Usage Policies Workload Management System (Resource Broker) WMS using RC decide on execution location Logging & Bookkeeping David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? • Security • GSI security model based on X.509 provides authentication • Authorisation via membership of virtual organisations (VO) and group pool accounts If well implemented this is secure (I am told) and provides a way authorising access to resources on which individuals do not have personal accounts David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? • Two different information systems have been tried within EDG/LCG. • Hierarchical LDAP based system • Each site publishes a set of information about itself • This was slow and didn’t scale well • Improvements in later versions • R-GMA (Relational-Grid Monitoring Architecture) • Works on serverlets • Allows user to implement their own monitoring by implementing their executables. • Seems to scale David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? So it appears that we have the information system that we need to be able get a coherent picture of our Grid David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? • Not all the the data are at every site • The Replica Location Service knows about all the physical copies of the data. • User specifies logical file name • Can feed information into the WMS and provides sufficient information to the user job to be able find the data it needs • Users can also register the output files • Still some scaling issues Ways of handling the data are being developed and work for reasonable numbers of files David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? • At the heart of the EDG/LCG is the WMS • Takes the job along with its description in ClassAd format and input sandbox from the user • Uses this description, information about the state of the resources and data location to decide an execution location • Submits job to selected resource • Returns output to the user after the job has completed Built on Globus and CondorG as well as original code David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? Is it straight forward to use? Need to describe the job ######################################### # # ---- Sample Job Description File ---- # ######################################### JobType = "Normal"; Executable = "sum.exe"; StdInput = "data.in"; InputSandbox = {"/home_firefox/fpacini/exe/sum.exe","/home1/data.in"}; OutputSandbox = {"data.out","sum.err"}; InputData = {"lfn:CARF_System.META.TestG4"}; Rank = other.GlueCEPolicyMaxCPUTime; Requirements = other.GlueCEInfoLRMSType == "Condor" && other.GlueHostArchitecturePlatformType== "INTEL" && other.GlueHostOperatingSystemName == "LINUX" && other.GlueCEStateFreeCPUs >= 2; David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? Are the commands easy to use? Some typical commands: edg-job-list-match myjob.jdl *************************************************************************** Computing Element IDs LIST The following CE(s) matching your job requirements have been found: *CEId* bbq.mi.infn.it:2119/jobmanager-pbs-dque skurut.cesnet.cz:2119/jobmanager-pbs-wp1 *************************************************************************** David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? edg-job-submit –vo cms myjob1.jdl ================= edg-job-submit Success ================================== The job has been successfully submitted to the Network Server. Your job is identified by (edg_jobId): https://ibm139.cnaf.infn.it:9000/ZU9yOC7AP7AOEhMAHirG3 Use edg-job-status command to display current job status. ====================================================================== $> edg-job-status –v 0 https://ibm139.cnaf.infn.it:9000/_tO6hdgToYKGCuV68q-gqQ ************************************************************* BOOKKEEPING INFORMATION: Printing status info for the Job : https://ibm139.cnaf.infn.it:9000/_tO6hdgToYKGCuV68q-gqQ Current Status: Scheduled Destination: bbq.mi.infn.it:2119/jobmanager-pbs-dque Status Reason: Job successfully submitted to Globus reached on: Tue May 6 16:14:59 2003 ************************************************************* David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? edg-job-cancel edg-job-get-logging-info etc The commands are as easy to use as other batch systems. David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? • Robustness • Uses CondorG • Job retries at a new site if it fails at original (up to a number specified in the classads) David Colling Imperial College London
So does this system fulfil the requirements and overcome the problems? However could still be inefficient (e.g. Job runs for hours or days before machine crashes) so introduced logical checkpointing. • Value Attribute pairs are periodically saved to the LB service • If job fails because of a CE problem it can restart from last saved state • Provides a natural way dividing up parameter scanning jobs Still not perfectly robust, but is getting there. The WMS is becoming robust David Colling Imperial College London
The EDG release satisfies our basic requirements (pretty much) • However also has additional functionality • In the current release: • Support for interactive jobs • Support for MPI jobs • Implemented but not yet released • Dependent jobs (DAGs) • A distributed accounting system based on the Home Location Registers David Colling Imperial College London
DAGs A = [ Executable = "A.sh"; PreScript = "PreA.sh"; PreScriptArguments = { "1" }; Children = { "B", "C" } ]; B = [ Executable = "B.sh"; PostScript = "PostA.sh"; PostScriptArguments = { "$RETURN" }; Children = { "D" } ]; C = [ Executable = "C.sh"; Children = { "D" } ]; D = [ Executable = "D.sh"; PreScript = "PreD.sh"; PostScript = "PostD.sh"; PostScriptArguments = { "1", "a" } ] David Colling Imperial College London
Notes of caution: Yes, the system works and is now pretty robust, however problems do still occur. Constant monitoring is required or else site configurations seem to decay. This is problem of many interacting pieces of software and problems at an individual site can go overlooked as jobs are just resubmitted else where. David Colling Imperial College London
EDG Application testbed: More than 1000 CPUs 5 Terabyte of storage EDG sw installed at more than 40 sites 60K Successful jobs since Oct 2003 (current release) So what is there now? http://www.hep.ph.ic.ac.uk /~stuatw/applet/ (links from http://www.hep.ph.ic.ac.uk/eScience/ ) David Colling Imperial College London
The LCG testbed (at the time SC2003) So what is there now? So you really can submit your jobs around the world David Colling Imperial College London
How to get start using the Grid Get a certificate: http://ca.grid-support.ac.uk/ Sign the EDG and LCG usage rules: http://marianne.in2p3.fr/ http://lcg.web.cern.ch/LCG/ (Soon EDG to be replaced by EGEE) You will then become a member of a VO David Colling Imperial College London
How to get start using the Grid Follow examples in the user guides http://marianne.in2p3.fr/datagrid/documentation/EDG-Users-Guide-2.0.pdf http://server11.infn.it/work-loadgrid/documents.html User Support is currently limited, but will grow significantly over the next few months. David Colling Imperial College London
And in the future…? The future will bring changes in the underlying technology Almost certainly based on Web Services However the functionality required will not change very much and the LCG and EGEE users should be shielded from these changes. David Colling Imperial College London
Summary Over the last three years the EDG has developed a working Grid that is fulfils the basic user requirements. This has been adopted by LCG and EGEE. We are approaching a production scientific service. The future may be based on new technology but will look similar to the user David Colling Imperial College London