250 likes | 719 Views
JRA7 and SAGA . Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007 . DEISA Objectives. To deploy and operate a persistent, production quality, distributed supercomputing environment with continental scope
E N D
JRA7 and SAGA Malcolm Illingworth, EPCC OGF19 Chapel Hill 29/01 – 02/02 2007
DEISA Objectives • To deploy and operate a persistent, production quality, distributed supercomputing environment with continental scope • To enable scientific discovery across a broad spectrum of science and technology. Scientific impact (enabling new science) is the only criterion for success. • Users should not be aware of complex grid technologies) and applications transparency • Minimal intrusion on applications
JRA7 Objectives “To develop a single way of coordinating and integrating OGSA-based services for distributed resource management in a heterogeneous environment, and to use this to integrate a variety of existing user-level tools to provide the necessary high-level services in: - authentication, authorisation and accounting; - job preparation, submission and monitoring; - data movement for job input and output; - other areas to be determined by DEISA user requirements.” DESHL: DEISA Services for the Heterogeneous management Layer
Current status and future plans • Started in May 2004 • Decision taken to follow SAGA mid-2005 • Project finishes in April 2008 • DESHL command line tool deployed and tested at all 11 DEISA sites • DESHL training included at DEISA user training sessions since July 2005 • Some take up from outside of DEISA • Recent focus on usability and robustness • DESHL 4.1 due for release in April • Possible inclusion by eDEISA for lifesciences portal development (integration with EngineFrame)
Job User HPC HPC Data Data Network Network HPC Site HPC Site UNICORE DRM UNICORE DRM Resources Resources Data-Mgt Information Data-Mgt Information The Big Picture At a local site a user wants to run a job on the DEISA heterogeneous environment User tools DEISA Services for the Heterogeneous management Layer Standards-based interfaces to allow user-level tools to interact across heterogeneous sites. JRA7 DESHL Batch Job service Data Management service Information service
Client Command Line Tool SAGA Client Library Grid Access Library ARCON Client library DESHL v4.1 Components DESHL Server UNICORE Gateway
Command line tool functionality • The precise set of operations is based upon application requirements, but focus has been on file transfer and job submission. • Data Transfer • Upload/download files between local workstation and DEISA site • delete a file at a DEISA site • determine if a file exists on a DEISA site • list the contents of a directory on a DEISA site • rename a file on a DEISA site • copy/move a file between DEISA sites • Job Management • determine the DEISA sites to which a user can submit a batch job to • submit a batch job to a DEISA site • terminate a batch job at a DEISA site • view the status of a batch job on a DEISA site • retrieve job stdout and stderr
Client Library • Provides factory classes for access to remote job services and remote file systems • Specific implementation classes are specified via a properties file and hidden from the caller • Changes in implementation should not be visible to caller • Remote resources configured locally via configuration file • Jobs specified to CLT as SAGA directive scripts • SAGA directives translated to JSDL script • JSDL script is submitted to a site via Grid Library. • Grid Library returns a Task object for submitted JSDL script.
SAGA Factory Classes • SAGA interfaces obtained from factory classes • DESHLNSDir dir = DESHLClientFactory.getNSDirFactory().getInstance(Session session); • JobService js = DESHLClientFactory.getJobServiceFactory().getInstance(Session session); • Caller identity(s) provided via Session object containing appropriate context objects • TODO - Currently have UnicoreContext interface extending Context, will refactor to SAGA-compliant attribute-based Context - • TODO – rename DESHLNSDir to NSDir
NSDir interface (1) public interface DESHLNSDir { String[] list( String dir ) throws SAGAException, BadParameterException, DoesNotExistException; boolean exists(String name) throws SAGAException, BadParameterException; boolean isDir(String name) throws SAGAException, BadParameterException, DoesNotExistException; boolean isFile(String name) throws SAGAException, BadParameterException, DoesNotExistException;
NSDir Interface (2) void copy(String source, String target, int[] copyFlags) throws SAGAException, BadParameterException, DoesNotExistException, IncorrectStateException; void move(String source, String target, int[] moveFlags) throws SAGAException,BadParameterException, DoesNotExistException,IncorrectStateException; void remove(String target, int[] removeFlags) throws SAGAException, BadParameterException, DoesNotExistException,IncorrectStateException; void makeDir(String target, int[] makeDirFlags) throws SAGAException, BadParameterException, IncorrectStateException;
NSDir Interface (3) • Methods implemented but not currently used: • (no persistence in CLT application, not currently relevant) String getURL() throws SAGAException; String getName() throws SAGAException; void changeDir(String dir) throws SAGAException, BadParameterException, DoesNotExistException; int getNumEntries() throws SAGAException; String getEntry(int entry) throws SAGAException, BadParameterException;
Job Service Interface public interface JobService { Job submitJob( JobDefinition jobDef ) throws SAGAException; String[] list(boolean showAllDetails) throws SAGAException; Job getJob( String jobId ) throws SAGAException; /* not specified by SAGA but very useful */ public String[] listJobsForSite( String siteName, boolean showAllDetails) throws SAGAException; }
JobDefinition • Contains job description as set of SAGA attributes • JobDefinition interface extends Attribute interface • Implementation defines the set of attributes we support • CLT reads SAGA definitions from a text file to build job definition Example simple job submission script: #!/bin/bash # Test job script for DESHL using SAGA. # # SAGA JobDefinition based directives: #$ SAGA_FileTransfer = file:///jobs/hello.sh#HOME > hello.sh #$ SAGA_HostList = ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx #$ SAGA_JobCmd = hello.sh #$ SAGA_JobName = example job script
More complex example … # SAGA JobDefinition based directives: #$ SAGA_JobCmd = a.out #$ SAGA_FileTransfer = file:///unicore/a.out#HOME > a.out #$ SAGA_HostList = ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx #$ SAGA_FileTransfer = file:///TestOutput#HOME < TestOutput #$ SAGA_JobEnv = account_no=e24-sa #$ SAGA_JobEnv = stack_limit=200MB #$ SAGA_Memory = 24400 #$ SAGA_NumTasks = 16 #$ SAGA_NumCpus = 1 #$ SAGA_WallClockSoftLimit = 3600
Currently supported attributes • SAGA_JobCmd • SAGA_JobArgs • SAGA_JobEnv • SAGA_JobName • SAGA_FileTransfer • SAGA_HostList (note: only one host can currently be specified, DEISA does not have a broker) • SAGA_NumTasks • SAGA_NumCpus (interpreted as number of threads per task) • SAGA_Memory (host uses value to calculate stack and heap) • SAGA_WallClockSoftLimit
Job Interface • Uses subset of SAGA Job interface. • Due to translation steps (SAGA-JSDL-AJO), not possible to retrieve SAGA job definition from remote host. public interface Job { String getJobId(); JobState getJobState(); String getJobStateDetail(); void terminate(); /* Not specified by SAGA but required by UNICORE to * retrieve output from USPACE and free resources. */ void cleanUp( File toDir ); }
Example job submission Session session; … // get the class factory JobServiceFactory factory = DESHLClientFactory.getJobServiceFactory(); // get an instance of the job service from the factory JobService js = factory.getInstance(session); JobDefBuilder jobDefBuilder = new JobDefBuilder(); ... // build up job definition from file or arguments // get the constructed job definition JobDef jobDef = jobDefBuilder.create(); // submit the job, return a job instance Job submittedJob = js.submitJob( jobDef ); // get the job identifier, eg to display to the user String jobID = job.getJobId(); // get the job instance again from the job identifier Job remoteJob = js.getJob(jobID); // get the job's status JobState jobState = remoteJob.getJobStatus(); // retrieve the job output to a specified directory remoteJob.fetch("/home/malcolm/joboutputdir");
Example copy operation Session session; int copyFlags[] = { NSDirFlags.copyFlags_NoRecursive, NSDirFlags.NoOverwrite }; String source = "ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx/home/malcolm/test.dat"; String target = "ssl://admin.hpcx.ac.uk:4433/IDRIS%20ZAHIR/home/malcolm/test.dat"; // get an instance of the factory NSDirFactory factory = DESHLClientFactory.getNSDirFactory(); // get an instance of the NSDir interface from the factory NSDir dir = factory.getInstance(session); // verify the source file exits boolean sourceFileExists = dir.exists("ssl://admin.hpcx.ac.uk:4433/EPCC%20HPCx/home/malcolm/test.dat"); // copy the file to the other site dir.copy(source, target, copyFlags); // verify the file turned up at the remote site boolean targetFileExists = dir.exists(target);
Grid Access Library (roctopus) Grid • Presents a generalised object-oriented model for interacting with a UNICORE grid, not purely for DESHL • Provides a general interface that can have multiple implementations Jobs submitted to a Site as JSDL scripts, returns a Task. • Presents Task interface to represent executing jobs. • All of this hidden from the user/application developer • Authentication/Authorisation is by existing UNICORE mechanisms ie. long-lived x509 pairs 1 0.* Site 1 0.* Storage 1 0.* File
Grid Library interface • Provides dedicated functions for file management/transfer • Job submission/management via rich Task interface • Job submitted as JSDL, Task instance returned • List of tasks at a remote site can be retrieved and manipulated example: JobDefinition jobDef; … XmlJobDefinitionDocument jsdl = JobDefJSDLConverter.jobDefToJSDL( jobDef ); host = new UnicoreLocation( unicoreLocationStr ); Site site = grid.locateSite( host ); final Task task = site.submit( jobSubmission ); task.startASync( new File[] {} );
Current Issues (1) • SAGA defines job identifiers as ‘[backend url]-[native id]’ • Example ‘[ssh://remote:host.net:22/]-[1234]’ • (We escape out any characters likely to be a problem on the command line) • Fine programatically … • From a CLT perspective, not user friendly $ deshl submit –q ssl://myhost.ac.uk:4433/myNJS sleeper.sh Your job: ssl%3A%2F%2Fmyhost.ac.uk%3A4433%2FmyNJS%2F957383131, has been successfully submitted. $ deshl status ssl%3A%2F%2Fmyhost.ac.uk%3A4433%2FmyNJS%2F957383131
Current Issues (2) • Could save job id to a file and use simpler naming convention • DESHL allows aliases to be defined for remote sites $ deshl submit –q myHost sleeper.sh Your job myHost%2F957383131 has been successfully submitted nsdir.copy(“myhosta/home/malcolm/test.dat”, “myhostb/home/malcolm/test.dat”); • Aliases are currently specified and handled outside of the SAGA standard, we would like to include this as an optional attribute in the context
Current Issues (3) • Retrieving job definition: • Not currently supported … • Job definition originally as SAGA script • Not possible to retrieve original SAGA job definition from remote host, as host does not receive or understand this, would need to rely on local persistence • May be possible to get JSDL description, reverse translate to SAGA • (could store original SAGA script in a local database with job id) • Debugging / Exception reporting: • Layered architecture can be difficult to debug. • Sometimes unclear if a problem is in middleware or on remote host, very clear exception reporting required or user will tend to blame middleware for operational problems on host.
Questions … ? http://forge.nesc.ac.uk/projects/deisa-jra7/ malcolm@epcc.ed.ac.uk